Table of Contents for Programming Languages: a survey

Chapter : History and families of programming languages


ctm (close-to-the-metal) (low-level vs high-level)

What is a low-level language? Two definitions are:

More specifically, a low-level language:

One common class of CPU-time overhead is automatic memory management. In languages with garbage collection, a call to the garbage collector sometimes takes a long time, and in languages with reference counting, a single reference decrement could in the worst case lead to an arbitrarily long cascade of memory deallocations. In addition, automatic memory management may not deallocate memory as soon as the programmer knows that it could be deallocated, so this often adds memory overhead. In addition, in languages with automatic memory management, the programmer must interact closely with automatic memory management if they want to pass a pointer to memory allocated by their program to an outside process, and in many languages with automatic memory management, the necessary primitives do not exist (in the base language; most languages have an extension API that includes these facilities, but utilizing the extension API often requires use of another language, such as C). For these reasons, 'lack of mandatory automatic memory management' is considered by many to be another criterion for a language to be considered 'low-level'.

If there are two languages A and B, and A compiles to B, then according to our definition, B <= A, because if B does not give you the capability to control CPU or memory in some way, then neither can A. One might think that ordering is strict, that is, B < A, that is, A must be at least a little higher level than its compilation target language, but this is not true; for example, you can write a 'compiler' that compiles a language to itself by merely returning as output the same source code that it was given. One can imagine less trivial examples, too, however, in which there are two languages which are equally able to express low-level constructs but which express them in differently ways; such a pair of languages could compile to each other.

Some fans of high-level languages might define things a little differently; instead of enumerating abilities of low-level languages, they might enumerate lists of things that their favorite high-level languages 'take care of for you' so that you 'don't have to worry about' them.

todo leaky abstractions

see also Part Implementation, Chapter Efficiency.

continuing on..

discipline (safe vs unsafe; tradeoff between programming time and debugging time (part of this: protect programmers from themselves), also flexibility vs self-documenting architecture for programming-in-the-large, also making others's code easy to read) (Yegge's liberal vs. conservative: Yegge gives some specific features that are polarized in this spectrum:


performant/not (fast/slow)


functional/not note: the word 'functional' is used in multiple ways when talking about programming languages:

Often, the latter two concepts coincide; because their incidence is highly correlated, some confusion has developed between them, and the word 'functional' has come to mean either or both. Languages which are both (pure function)-centric and (first class function)-centric often are based on a lambda calculus-like fundamental model.

An example of how one tends to compute by applying higher-order functions to other functions, instead of by iteration, can be found in, which shows how to calculate one could calculate a moving average in Haskell:

delay n lst = replicate n 0 ++ lst deltas n lst = map (uncurry (-)) (zip lst (delay n lst)) slidingSums n lst = scanl1 (+) (deltas n lst) movingAverage n list = map (/ (fromIntegral n)) (slidingSums n list)

some make the point that state cannot be avoided completely; the point of functional languages is to avoid state as a default, to make state more explicit, not elimination of state completely

(supposed) advantages of the functional style (deriving from eschewing implicit state):

statically typed/dynamically typed/gradually typed/mostly untyped

should we list models of computation here? maybe just the most 'important' ones in terms of programming language families:

turing, assembly, imperative lambda calc, ? , haskell/purely functional concatenative, ?, forth?, joy relational, ?, SQL logic programming, ?, prolog combinatorial, Nock, Hoon mu-recursive, ?, ? grammar-based, O-meta, ? (see also the list of Turing-equivalent systems in the formal chapter)

early binding/late binding (static/dynamic)

strictly typed/not

scripting/systems/application/end-user/embedded scripting/glue scripting: one liners from command line?

??? size types/design goals ???

typical practical languages

a language that is invented for people to write programs in.

(note: one might say 'general purpose' but this is more often used to talk about application domain; e.g. general purpose vs. systems vs. scripting)

agglomerative vs. spartan

agglomerative: 'offer all features that programmers find useful'

spartan: 'be opinionated.'

motivations for spartan can be:

agglomerative: non-orthogonal features are okay, because they save programmers time

spartan: minimal orthogonal subset of features

the defining characteristic of a spartan language design is to say, "Yeah, we agree that this feature would be commonly and practically useful, but it is not strictly necessary and we're not going to include it ever". (note the 'ever'; all programming languages have to initially omit some desired features because they don't have time to implement them, or because they are as of yet unsure what the best design is for the feature) (note the 'commonly': all programming languages omit features because they think they would only be of niche usefulness, therefore their complexity cost outweighs their benefit) (note the 'agree': often a language designer feels that some proposed feature would not really serve any significant use, and that existing mechanisms are just as good) (that being said, these criteria are rarely absolutely met; there is almost always some arguent that adding feature X has disadvantage Y). The usual reason given for the omission is that some other way of accomplishing the same thing is more in keeping with the desired style.

'spartan' is similar to the term 'Fluchtpunkt' as used in

examples: Python's OOWTDI, e.g. iteration over recursion, disinclusion of tail call optimization

examples: Golang's dislike of exceptions, possibly golang's exclusion of generics

examples of some agglomerative languages: 'practical lisps' from


a language with a very heavy focus on simplicity.

usually because it is an intermediate target of compilers and/or interpreters, for interoperability or for other reasons. Sometimes just for the sake of simplicity (e.g. Nock).

what sets this apart from 'spartan' languages is that a 'spartan' language will still include non-orthogonal features if their exclusion would make it just too annoying to program in the language; but a kernel language isn't expected to be a sensible way for humans to program in, or even to read; even the language designer prefers to program in something higher-level.

a kernel language is not necessarily completely orthogonal.

A common case of non-orthogonality is the inclusion of multiple ways to do the same thing for reasons of efficiency, e.g. a kernel language might provide an integer addition operator even though Successor (+1), along with basic control constructs, is theoretically sufficient to express addition, because on most platforms integer addition can be accomplished in O(1) (constant time) by an optimized primitive, whereas calling Successor n times to add n to a number is an O(n) operation. Similarly, a kernel language might provide floating-point data types and arithmetic even though it could theoretically only provide bitstreams or only provide integers, and expect a library to handle floating point, because by providing floating-point primitives it can map these to optimized implementations on the underlying platform.


interoperable common targets:

intermediate targets and cross-compilation bootstrap subsets of various languages:

languages which have spartan motivations to be kernel languages:

languages which have pedagogical motivations to be a kernel language:

embeddable and scripting languages

languages which are designed to be used embedded within or as scipting for other applications or devices.

such languages are sometimes very small in order to make them efficient and portable.

sometimes, like DSLs, they have features for specific niche application domains.



a language with an even more extreme focus on simplicity than kernel languages. Generally the calculii are absolutely orthogonal.

The purpose of the calculii is usually not to program in or to compile to, but rather as a mathematical abstraction about which theorems may be proved. The smallness of the language is crucial because often proofs will have to treat each part of the language by cases. The length, readability, and often the efficiency of the resulting programs is immaterial.

examples: Turing machines, lambda calculus, SKI combinator calculus

Turing-level DSL

a "domain-specific language" is a term for a language which, rather than being "general purpose" or even with a tilt towards a domain such as 'systems programming', is narrowly targeted at a niche domain.

todo examples:

sub-Turing DSL

a sub-Turing DSL is a DSL that is not Turing complete

todo examples:

data language

a data language is a language intended to express data, but not computation on that data

examples: HTML, XML, the literal-defining subset of any other language


todo: merge with above?

(big ideas; constraints)

these are not mutually exclusive

Structured/procedural programming


GOTO considered harmful ; ban (discipline) on GOTO; flow control via primitives for selection (if) and iteration (e.g. while) and subroutine call and return

call stack


differing opinions on what it is

encapsulation of data with methods

forced encapsulation (access modifiers) vs. consenting adults

object scope; self object

message passing


Bob Martin claims that OOP also involves a ban (discipline) on calling functions via pointers; he claims that, because polymorphism covers the main reason to do this, along with the polymorphism in OOP tends to come a removal of the ability (or cultural encouragement) to do this at all, improving discipline (

prototype vs. class-based

single vs. multiple inheritance

inheritance vs. composition

functions of OOP:

"OOP to me means only messaging, local retention and protection and hiding of state-process, and extreme late-binding of all things. It can be done in Smalltalk and in LISP. There are possibly other systems in which this is possible, but I’m not aware of them." -- Alan Kay (E-mail to Stefan Ram, 2003,

more on how Kay defines OOP: " Which didn’t even require inheritance, which is not like we know it today:

    I’m sorry that I long ago coined the term “objects” for this topic because it gets many people to focus on the lesser idea.

Where the big missing piece lacking in mainstream typed OO languages today is:

    The big idea is “messaging”

He advocates focus should instead be on messaging and the loose-coupling and interactions of modules rather than their internal object composition:

    The key in making great and growable systems is much more to design how its modules communicate rather than what their internal properties and behaviors should be." --

more Kay:

The concepts of instances of classes, of objects with properties, and of different types of things which act autonomously according to their nature, go back at least to Plato and Aristotle. See An Aristotelian Understanding of Object-Oriented Programming, Object-Oriented Programming and Objectivist Epistemology: Parallels and Implications.

"This approach recognises that programming is neither the breaking down of calculations (imperative approach) nor the definition of functions (functional approach) but the definition and manipulation of domain objects (like numbers, strings, arrays, files, graphs)." -- Sergi Valverde and Ricard Solé, Punctuated equilibrium in the large scale evolution of programming languages, SFI working paper 2014-09-030

" SIMULA contains the core of the concepts now available in mainstream object-oriented languages such as C++, Eiffel, Java, and C#:

One exception to the broad adoption of SIMULA concepts is the notion of an active object with its own action sequence, which, strangely enough, has not been adapted to other languages. For Dahl & Nygaard, having active objects was an essential facility to be able to simulate concurrent processes from the real world. " -- [1]

encapsulation vs. generic data structures: "More generic data types generally have well-known and useful functions associated with them. We could enumerate the keys or values on a map, filter an array by a function, or reduce across it. Alan Perlis [allegedly] said “It is better to have 100 functions operate on one data structure than 10 functions on 10 data structures,” and I tend to agree. One benefit of using a more generic data structure instead of hiding that data behind a class is that it removes an extra step in applying well-known functions to that data." [2]

some constructs substituting for OOP functionality in non-OOP languages:

an argument against encapsulation:

" Objects bind functions and data structures together in indivisible units. I think this is a fundamental error since functions and data structures belong in totally different worlds. Why is this?

    Data structures just are. They don’t do anything. They are intrinsically declarative. “Understanding” a data structure is a lot easier than “understanding” a function.
    Functions are understood as black boxes that transform inputs to outputs. If I understand the input and the output then I have understood the function. This does not mean to say that I could have written the function.
    Functions are usually “understood” by observing that they are the things in a computational system whose job is to transfer data structure of type T1 into data structure of type T2.
    Since functions and data structures are completely different types of animal it is fundamentally incorrect to lock them up in the same cage." -- [3]

another one:

" And encapsulation is usually not what you want either. Usually the problem is you're writing code in one part of the program, and suddenly realize it needs data from a different part of the program. If you made your data structures separate from your code, you can just go get the data you need. Putting it behind an "encapsulation" wall and making you go through object methods usually just results in unneeded complexity in the form of object instantiations and calling getter/setter functions and so fourth. Now I know, people will argue about encapsulation in distributed programs, but process boundaries and machine boundaries impose encapsulation on their own and usually in places where it makes sense. Where it doesn't make sense is within the same program, hiding the data of one part of the program from another parts of the same program. " -- [4]

some arguments against OOP:

Imperative programming

based on Turing machine

Functional programming

based on expression evaluation

referential transparency

  don't use variable assignment
  discipline/ban on (most) side-effects

higher-order functions function composition maps

tail call optimization

partial function application


anonymous functions

based on lambda calculus


Logic programming

Dataflow programming

The word 'dataflow' means at least four things.

In the context of computer hardware, "dataflow" often means a "dataflow architecture", where "When all of the tagged operands of an instruction become available (that is, output from previous instructions and/or user input), the instruction is marked as ready for execution by an execution unit" [5].

In software, "dataflow" can be used to mean "the idea that changing the value of a variable should automatically force recalculation of the values of variables which depend on its value", like a spreadsheet [6]. This idea also called reactive programming.

So, dataflow, the hardware architecture, seems to be about how do you sequence a set of steps in a computation which may only be run once, whereas dataflow, in software, seems to be about how do you implicitly maintain invariants relating multiple pieces of state which may presumably lead to the need to redo a calculation many times.

In the context of programming language families, the term "dataflow programming" "is a programming paradigm that models a program as a directed graph of the data flowing between operations" [7].

In the context of compiler implementation, data-flow analysis refers to analysis of the propagation of data values through parts of a computer program.


Active data


"Here's the gist: in most mainstream languages, you describe how to solve a particular problem; in declarative languages, you merely describe the result you want, and the language itself figures out how to get there." --

examples: prolog, SQL

disadvantages: if the way that the language figures out how to solve the problem is inefficient, you end up having to learn about what it's doing and optimizing it anyway, in which case the resulting program is even harder to work with than imperative languages because of the additional layer of abstraction.

some question the usage of the term 'declarative':


advantages: less syntax, opportunities for metaprogramming, close to the metal

disadvantages: programmer has to keep the contents of the stack in mind at all times. Named variables (and keyword arguments) can be seen as an invention to avoid having to do this.


SK combinators Y combinator


??? what family is Nock in? is it combinatory? is it its own thing?

Links about paradigms

systems programming

types of type systems

dependent typing: can capture properties of values, e.g. you can say 'a list with 3 elements' or 'a list with at least one element' or 'a number less than 255'. fully generalized dependent typing can capture arbitrary computable properties of values. Inference with fully generalized dependent typing is intractable.

slide 12

classify type systems into dimensions static/dynamic, weak/strong:

dynamic, weak: assembly dynamic, moderate: js, ruby dynamic, strong: python, clojure dynamic-ish but somewhat static, moderate: typescript, dart static-ish but somewhat dynamic, moderate: java, C# static-ish but somewhat dynamic, strong: scala static, weak: C static, strong: OCaml, Haskell

slide 13

classify type systems into dimensions coarse/detailed, weak/strong:

coarse, weak: C coarse, strong ("my way or the highway"): Go, Java 4, Pascal detailed, weak ("cutting corners"): eiffel, typescript, dart detailed, strong ("type it to the max"): scala, F#, haskell, ocaml moderate, moderate: C#, java 5+

My Way or The Highway (coarse, strong):

" Simple type systems No generics Not that extensible by users

Type it to the Max: " Rich* language to write types Type combination forms, including generics. Type systems often inspired by logic.

Where dynamic languages had the upper hand: – No type-imposed limits to expressiveness  Rich type system + escape hatches such as casts – No boilerplate  Type Inference – Easier for exploration  Bottom type Nothing, ???

Other Strengths of Dynamic • Simpler languages − Rich types add complexity • Fewer puzzling compiler errors

Cutting corners: Appeal to user’s intuitions (covariant generics). – Employee are Persons – So functions from Employees to Employers are also functions from Persons to Employers, right? • Embrace unsoundness. • Easy, and superficially simple. • But, fundamentally, a hack. • Can we build great designs on false theories?

Precision Soundness Simplicity Take Any Two?


hmm what "take any two" would i say? maybe:

permissive of power (e.g. dynamic languages permit generics, even though their type system doesn't 'understand' this concept) simple/easy to learn and use precise (e.g. you can have generics in dynamic languages, but they aren't typechecked; it's more precise to have a language whose type system knows about generics specifically)

i guess that's the same as what he said; he's replacing my 'precise' with 'sound', and my 'precision' with 'permissive of power'


"What real functional programmers do is “multi-paradigm”– mostly functional, but with imperative techniques used when appropriate. What the debate comes down to is the question of what should be the primary, default “building block” of a program. To a functional programmer, it’s a referentially-transparent (i.e. returning the same output every time per input, like a mathematical function) function. In imperative programming, it’s a stateful action. In object-oriented programming, it’s an object, a more general construct that might be a referentially-transparent function, might represent an action in a hand-rolled domain-specific language (DSL) or might be something else entirely. " --

" In general, functional programming is right. The functional approach is not right for every problem, but there is a right answer regarding the default abstractions for building most high-level programs: immutable data, and referentially transparent functions should be the default, except in special cases where something else is clearly more appropriate. Why? A computational action is, without more knowledge, impossible to test or reason about, because one cannot control the environment in which it exists. One needs to be able to know (and usually, to control) the environment in which the action occurs in order to know if it’s being done right. Usually, the tester wants to be able to cover all special cases of this action, in which case knowing the environment isn’t enough; controlling it is also necessary. But at this point, the state of the environment in which the test happens is an implicit parameter to the action, and making it explicit would make it a referentially-transparent function. In many cases, it’s better to do so– when possible. It might not be. For example, the environment (e.g. the state in a computer cluster database) might be too large, complex, or volatile to explicitly thread into a function. Much of real-world functional programming is not about eliminating all state, but about managing what state is intrinsic, necessary or appropriate. " --

should i maybe have stereotypes of different types of programmers, or at least, dimensions along which they stereotypically vary?

the person who likes fancy IDEs vs fancy languages?

the person who likes new languages vs old?

the person who likes unix-y plaintext vs binary formats?

the person who likes static typing vs dynamic typing?

the person who likes extensive tests vs not?

" The answer is that we tend to insist on good design, to such a degree that we avoid taking jobs where we’re at risk of having to deal with bad designs. This isn’t a vice on our part; it’s a learned necessity not to waste one’s time or risk one’s career trying to “fix” hopeless systems or collapsing companies. Generally, we’ve come to know the signs of necrosis. We like the JVM languages Clojure and Scala, and we might use Java-the-language when needed, but we hate “Java shops” (i.e. Java-the-culture) with a passion, because we know that they generate intractable legacy messes in spite of their best efforts. Say “POJO” or “Visitor pattern”, and you’ve lost us. This seems like arrogance, but for a person with a long-term view, it’s necessary. There are a million “new new things” being developed at any given time, and 995,000 of them are reincarnations of failed old things that are going to crash and burn. If I could characterize the mindset of a functional programmer, it’s that we’re conservative. We don’t trust methodologies or “design patterns” or all-purpose frameworks promising to save the world, we don’t believe in “silver bullets” because we know that software is intrinsically difficult, and we generally don’t believe that the shiniest IDE provides us enough to compensate for its shortfalls and false comforts. For example, an IDE is useless for resolving a production crisis occurring on a server 3,000 miles away. We’d rather use vim or emacs, and the command line, because we know they work pretty much everywhere, and because they give us enough power in editing to be productive." --



Sergi Valverde and Ricard Solé, [ Punctuated equilibrium in the large scale evolution of programming languages], SFI working paper 2014-09-030

The authors start with a dataset of which programming languages are said to have been influenced by which, and from this construct a network model of programming language evolution which matches various statistics from the data. However, actual programming language evolution is found to be more 'bursty' than the model, which is taken as evidence for a quality of 'punctuated equilibrium' in programming language evolution.

Supplement 1 contains lists of language families and some descriptions of some of them:

Figure 1 lists 20 connected components (although they say there is only 18). Table 1 summarizes them, in order from smallest number of languages in the cluster to largest. I will reproduce part of Table 1, except instead of giving the earliest language (the root) as the example of each component, i'll give what i believe to be the most well-known one:

They then describe some of these clusters ("language families"). Here, i'll just note the subfamilies and languages they mention (again, sometimes renaming families from their earliest member to a more well-known member):

They give cladograms for many of these.

In Figure 1, the graph, the two largest clusters have multiple subclusters with a major language in larger type. I don't know if the larger font size just means that the authors subjectively decided to emphasize these, or if it means something objective. These two clusters and their larger-type children are:



In their dynamic "backbone explorer" at , they give another list of families:

They assert (presumably from other sources) that "There are four major PLs categories: imperative, functional, logic and object-oriented (see below)."; however, they note that their method does not produce a separate cluster for OOP. They say that "The concept of programming paradigm was rst introduced by Robert W. Floyd during his Turing Award lecture in 1978....In his lecture `The Paradigms of Programming', Floyd urged software practitioners and teachers of programming to identify what styles (or schools) of programming are they using (the so-called programming paradigms)....Unfortunately, there has been no systematic attempt to classify PLs according to programming paradigms."

They note that "Compatibility, flexibility, memory requirements and third-party support are key factors of PL success and main drivers of technological innovation".

Compare to, which has families:

todo: cover 'term rewriting'

eg Nock. 'Computation via macro substitution' (combinators)

term rewriting vs. lambda calculus: :

" How does term-rewriting based evaluation work?

The Pure programming language is apparently based on term rewriting, instead of the lambda-calculus that traditionally underlies similar-looking languages. ...

The matching of patterns, and substitution into output expressions, superficially looks a bit like syntax-rules to me (or even the humble #define), but the main feature of that is obviously that it happens before rather than during evaluation, whereas Pure is fully dynamic and there is no obvious phase separation in its evaluation system (and in fact otherwise Lisp macro systems have always made a big noise about how they are not different from function application). Being able to manipulate symbolic expression values is cool'n'all, but also seems like an artifact of the dynamic type system rather than something core to the evaluation strategy (pretty sure you could overload operators in Scheme to work on symbolic values; in fact you can even do it in C++ with expression templates).

So what is the mechanical/operational difference between term rewriting (as used by Pure) and traditional function application, as the underlying model of evaluation, when substitution happens in both?

1 Answer

Term rewriting doesn't have to look anything like function application, but languages like Pure emphasise this style because a) beta-reduction is simple to define as a rewrite rule and b) functional programming is a well-understood paradigm.

A counter-example would be a blackboard or tuple-space paradigm, which term-rewriting is also well-suited for.

One practical difference between beta-reduction and full term-rewriting is that rewrite rules can operate on the definition of an expression, rather than just its value. This includes pattern-matching on reducible expressions:

-- Functional style map f nil = nil map f (cons x xs) = cons (f x) (map f xs)

-- Compose f and g before mapping, to prevent traversing xs twice result = map (compose f g) xs

-- Term-rewriting style: spot double-maps before they're reduced map f (map g xs) = map (compose f g) xs map f nil = nil map f (cons x xs) = cons (f x) (map f xs)

-- All double maps are now automatically fused result = map f (map g xs)

Notice that we can do this with LISP macros (or C++ templates), since they are a term-rewriting system, but this style blurs LISP's crisp distinction between macros and functions.

CPP's #define isn't equivalent, since it's not safe or hygenic (sytactically-valid programs can become invalid after pre-processing).


Another practical consideration is that rewrite rules must be confluent if we want deterministic results, ie. we get the same result regardless of which order we apply the rules in. No algorithm can check this for us (it's undecidable in general) and the search space is far too large for individual tests to tell us much. Instead we must convince ourselves that our system is confluent by some formal or informal proof; one way would be to follow systems which are already known to be confluent.

For example, beta-reduction is known to be confluent (via the Church-Rosser Theorem), so if we write all of our rules in the style of beta-reductions then we can be confident that our rules are confluent. Of course, that's exactly what functional programming languages do!


btw the first few pages to this paper describe Aspect Oriented Programming more clearly than most:




one example of when dynamism/late binding is useful:

" Bill Venners: In Ruby, I can add methods and variables to objects at runtime. I have certainly added methods and variables to classes, which become objects at runtime, in other languages. But in Java, for example, once a class is loaded or an object is instantiated, its interface stays the same. Allowing the interface to change at runtime seems a bit scary to me. I’m curious how I would use that feature in Ruby. What’s the benefit of being able to add methods at runtime?

    Yukihiro Matsumoto: One example is a proxy. Instead of designing individual proxy classes for each particular class, in Ruby you can create an all purpose proxy that can wrap any object. The proxy can probe the object inside of it and just morph into the proxy for that object. The proxy can add methods to itself so it has the same interface as the wrapped object, and each of those methods can delegate to the corresponding method in the wrapped object. So an all-purpose proxy, which can be used to wrap any object, is an example of how a library class can adapt to the environment. " -- [9]





todo: this blog post clearly separates the three senses of 'inheritance' and also touches upon two other things that this book should mention the 'square is not a subclass of rectangle' paradox of OOP, and the diamond problem of OOP inheritance:

" Why inheritance never made any sense Graham / March 16, 2018 / OOP

There are three different types of inheritance going on.

    Ontological inheritance is about specialisation: this thing is a specific variety of that thing (a football is a sphere and it has this radius)
    Abstract data type inheritance is about substitution: this thing behaves in all the ways that thing does and has this behaviour (this is the Liskov substitution principle)
    Implementation inheritance is about code sharing: this thing takes some of the properties of that thing and overrides or augments them in this way. The inheritance in my post On Inheritance is this type and only this type of inheritance.

These are three different, and frequently irreconcilable, relationships. Requiring any, or even all, of them, presents no difficulty. However, requiring one mechanism support any two or more of them is asking for trouble.

A common counterexample to OO inheritance is the relationship between a square and a rectangle. Geometrically, a square is a specialisation of a rectangle: every square is a rectangle, not every rectangle is a square. For all s in Squares, s is a Rectangle and width of s is equal to height of s. As a type, this relationship is reversed: you can use a rectangle everywhere you can use a square (by having a rectangle with the same width and height), but you cannot use a square everywhere you can use a rectangle (for example, you can’t give it a different width and height).

Notice that this is incompatibility between the inheritance directions of the geometric properties and the abstract data type properties of squares and rectangles; two dimensions which are completely unrelated to each other and indeed to any form of software implementation. We have so far said nothing about implementation inheritance, so haven’t even considered writing software.

Smalltalk and many later languages use single inheritance for implementation inheritance, because multiple inheritance is incompatible with the goal of implementation inheritance due to the diamond problem (traits provide a reliable way for the incompatibility to manifest, and leave resolution as an exercise to the reader). On the other hand, single inheritance is incompatible with ontological inheritance, as a square is both a rectangle and an equilateral polygon. " -- [10]