A deep dive into the heart of the processor and its language

Now that you know a little more about the two ways of running a computer program with the previous post, it’s time to look under the carpet where I hid some subtleties regarding assembly. I repeatedly mentioned that the assembler as the only programming language understood by the computer. In reality, things are a tad bit complicated; there are several different assemblers sharing some common properties. As we shall see below, this hardly helps us with the compiling processs …

## Dis-assembling a program

In the first article, we illustrated, rather modestly it must be said, the idea of the assembler by a “low level” cooking recipe. Unfortunately, it is necessary that we peer into the belly of the beast and see what the assembler really looks like. The following snippet, extracted from the result of the objdump -d command of the UNIX programme ls, is called * disassembly * of an assembler program (x86 in this case).

Lets first get over the initial shock. Now concentrate a little. Good. We notice four columns in this fragment of code. Reading from left to right, we have: the reference of the instruction in the executable, the binary code fragment of the assembler, the type of instruction, its arguments. The computer reads the program by going over each instruction one by one, in the order in which they have been declared. It then finally executes these instructions. Let us take a closer look at the first instruction.

• 40cc2 is the reference to the instruction in the disassembly.
• 48 89 44 24 08 is literally the instruction, encoded in binary format (and here displayed in hexadecimal format). The details of this encoding are of little interest; on the other hand it is in this form that the program is read by the computer. The display of the assembler program in textual format is a convenience made for humans who must read it, and is exactly equivalent to the binary form.
• mov is the type of the instruction. Here mov, naturally, moves data from one memory cell to another.
• %rax,0x8(%rsp) are the arguments of the instruction. Since this is a mov instruction, % rax is the destination of the move.%rax is actually a memory register. This register is physically located in the processor constitutes the immediate memory of the computer. In 0x8(%rsp), which is the data source, %rsp is another memory register. This expression means “get data that is in RAM at the address %rsp +8” i.e. get data from the memory address that is found 8 memory blocks further down from the address %rsp.

This instruction will therefore look for data in the RAM to put them in a register. It is easy to understand that manual reading or writing of the assembler causes a feeling of boredom superimposed upon a sense of disgust. But how does the computer understand and execute something like 48 89 44 24 08? The description of the architecture of a processor here would be too complicated for this modest blog; let us simply say that the processors goes over the code instruction by instruction and that each bit of each instructions pass through a network of logic gates (those famous transistors) magically conceived so that at the output, we indeed have the result of what the instruction was supposed to do.

## To each her own compiler

Going forward, we must remember that the assembler is intimately connected to the characteristics of the processor itself. However, the type of instructions and the binary encoding of these assemblers varies from processor to processor! There are even different schools of assembler development:the RISC and the CISC. The former offers only the bare minimum of instructions (but ensures that they are executed quickly), whilst the latter offers countless baroque instructions (but may be slower). The two most common assembly languages are the x86 Intel family of processors used mostly by conventional computers, and the ARM family produced by the eponymous company used mostly by mobile phones. By the way, the difference between 32-bit/64-bit that you may have heard about refers to the basic size of the processor’s memory registers. Registers are the fundamental unit for manipulating memory in a computer. These too have undergone changes over the past decade.

Different assemblers are not compatible with each other: running an ARM assembler program on an x86 processor will not work. But it gets worse. Indeed, an executable file contains not only the assembly code of the program, it also contains information about the initial data and how the memory has to be structured during execution. Alas, there are different formats for executable files too, which depend on the operating system used. So the same program has to be translated into four different executables to run on a Windows machine, a Mac, an iPhone or an Android phone… Twice this if you have to take into account the difference between 32 bits/64 bits! Each of these different execution environments is called an architecture.

## One compiler to rule them all.

Thus, from a single source program, we must have as many different translations as platforms on which we want to execute the program. But on closer inspection, all these assemblers are not conceptually different from each other: they do not all have the same operators nor the same formats however they all consist of a sequence of instructions executed sequentially by the processor, in the process manipulating registers and memory addresses.

To avoid writing as many compilers as their are target assemblers, we use a fundamental and powerful idea in compilation: the creation of an intermediate language. In this case, this intermediate language will resemble an assembler: it will be a sequence of instructions manipulating registers and memory. Nevertheless, the instructions present in this intermediate language will be chosen in such a way that each of them can easily be translated into any of the “real” assemblers. From there on, the compilation becomes a translation in two stages: first from the source language to the intermediate language, then from the intermediate language to a particular assembler. To use a linguistic metaphor, suppose we want to interpret oral English into standard french, quebecois french, belgian and swiss. Rather than hiring 4 different translators to translate English to each of these french dialects, we choose standard french as an intermediate language, hire a single translator from English to French and recruit locals - quebecois, belgians and swiss - who will then have the much easier task of translating standard french to their local dialects.

Using an intermediate language, we divide the problem into pieces of manageable complexity. In turn, it becomes easier to add a smaller translator if ever a new architecture is launched with its own assembly format. This is a pretty standard process and is present in modern compiler suites like LLVM. We see an example here on how to add a small translator to the compiler if ever we require it. .

Beyond these mini-translations, it is an interesting exercise to think about what capabilities to provide in this intermediate language. This process is crucial since this determines the difficulty of translations to and from the interemediate language. Thus, in computer science, the translator must become a linguist and create her own language!

• Use objdump -d (on Linux or Unix, unavailable on Windows) on various executables and figure out what the instructions mean (for Intel processors) using this