Towards a new linguistics

There is a quiet revolution taking place in the “Languages” section of curriculum vitae: the traditional list of languages consisting of “English”, “Spanish” or “French” are now competing with esoteric and exotic sounding languages. Python, Java, C, Javascript and a host of other programming languages have become necessary keywords in a competitive job market, with their knowledge becoming paramount to be recruited for the more interesting jobs. It is tempting to relate programming languages to “natural” languages, and it is true that they share many common features. Indeed, just as natural languages are grouped under different families, programming languages can also be grouped under various languages families. Furthermore, just as the knowledge of one natural language helps you in the acquisition of a newer natural language, acquiring a new programming language becomes faster once you have acquired a certain number of them.

The goal of this blog is to explore and explain, to those who might program but always wondered how things worked under the hood, the linguistics of these new languages, what makes them diverse and different. Beyond the intellectual curiosity, this understanding is in my opinion essential for everyone who will need to write programs more complex than scripts of a dozen lines of code, because it allows to make informed choices based on more than just the latest trend.

Preamble

It is possible to write a whole book on programming language theory without ever going into the details of their final utility: helping people produce programs to be executed on a computer. To begin with and to understand what will follow, it is crucial to fix a few ideas about programs and computers.

Indeed, if we visualize a computer as a desktop full of icons, that graphical interface is only the tip of the iceberg. We will be interested here in what happens under the hood: when you double-click on a desktop icon, it triggers the execution of a computer program. Concretely, this program is a sequence of instructions written beforehand by a programmer and these instructions are then executed by the processor, the brain of the computer. The processor performs billions of isolated computations on numbers, the results of which can have many consequences: printing something on the screen, producing a sound using the speakers, changing the contents of memory, etc.

In this blog, we will focus on the way the programmer uses to specify instructions to the processors. This is where programming languages kick in since they enable the programmer to write computer programs.

Different levels of abstraction

Let us start our study by a programming language that is not often under the spotlight: assembly. Assembly (or one of its forms) is the oldest programming language, as old as computers themselves. Indeed, the encoded form of assembly, machine code, is the only thing a computer can understand. To understand what assembly is, it is useful to use a culinary metaphor. Let us compare a computer program to a cooking recipe. The ingredients are the data and the utensils (pans, spatulas, knives) are the memory cells. Here is a roux recipe that a very tedious chef (the processor) can understand:

Ingredients:
A - 1 cup of butter
B - 1 cup of flour
Utensils:
1 - knive
2 - pan
3 - burner
4 - spatula
Recipe:
* Cut A in pieces with 1.
* Put A in 2 over 3.
* Heat 3 to thermostat 6.
* Wait 1 min 30 s.
* Add B.
* Mix with 4.
* Heat 3 to thermostat 4.
* Wait 4 min.

This level of detail where everything is explicit corresponds to what happens inside the processor. Everything has to be specified, that is why assembly (and thus machine code) is a “low level” language. This term is used in contrast to other programming languages, which are called “high level”. But what does high level looks like? To have an idea, let’s go back to our kitchen metaphor and look at a advanced cooking book, Le répertoire de la cuisine by Gringoire and Saulnier. The roux’s recipe is described as follows:

Clarified butter mixed with sifted flour, moderately heated until the mixture reaches a slightly blond color.

We notice that in this “high level” recipe, quantities and utensils are implicit; however the recipe remains understandable and describes exactly the same operations as the low-level recipe. The chef, when cooking, performs an implicit translation of the high level recipe steps into concrete actions involving the utensils.

A not so literal translation

Computer programs follow a similar logic: if we write a program in a high-level language (other than assembly), we will have to translate it into machine code to execute it. Higher the level of the original programming language, more complex is the translation. However, this complexity is not just a mechanical and excruciating explicitation of the details.

For instance, in the higher level recipe the concept of utensils is absent, even though it is paramount for the concrete execution of the recipe. In a computer these utensils take the shape of memory registers, and these are in limited number. To afford ourselves the luxury of an unlimited number of variables when we write our computer programs, the translation to assembly must put in correspondance an arbitrary high number of variables to a small number of physical memory registers that will hold these variables. This problem known as register allocation is a classic one in the field of compilers, and we’ll surely talk about it later. The way to solve this problem, which traditionally entails a graph coloring, epitomizes what makes the translation between programming languages so interesting: the creating a bridge between two seemingly unrelated concepts, often through the use of powerful mathematical theories.

What about performance?

Let’s get back to our kitchen metaphor. When dealing with a complex recipe, we want to optimize our execution time, for instance by cutting vegetables while the meat is cooking. We can cut down our utensils utilisation by reusing a pan instead of taking another. Two executions of the same recipe can produce the same food output but one can produce twice as more dirty dishes than the other. In the computing world, execution time, size of memory used and number of assembly instructions in the program are all parameters whose minimization we seek.

This quest for performance when translating implies important decisions that impact the programming language’s ergonomics and ease of use. For instance, with our recipes story, we can choose to declare explicitly at what time each of our utensils is to be used for the first time, and at which time it becomes useless. Thanks to this information, it is possible to better optimize the usage of utensils throughout the recipe. Nevertheless, it is painful to write recipes in this manner and for the most part we don’t. It is then the cook’s task to deduce their usage from the recipe. Similarly, a programming language in which memory is dealt with manually will in general have better performance than a programming language where memory management is done implicitly as part of the translation to assembly. But the programmer must still know how to efficiently manage memory for this to be true.

Considering performance as an important part of the equation is essential because it imposes constraints on the design of programming languages; we must find a balance ergonomics and expressiveness. If we cannot have it both ways, the next articles will try to showcase linguistic innovations that are clear victories for programmers, going beyond petty inter-language wars.

Tales of compilation and computational transcriptions

Français/English

Towards a new linguistics

Preamble

Different levels of abstraction

A not so literal translation

What about performance?