Introduction to Assembly Language

In the first part of this course, we are going to introduce some rudiments of Assembly, focusing on basic programming concepts and comparing compilation to assembly.

What is Assembly Language?

Assembly language is a low-level programming language for a computer. A low-level programming language means that the instructions are basic and the computer can easily recognize what it is told to do. Using the assembler, assembly language can be converted to machine language, which is the lowest level language.

Why study Assembly?

Any person might ask “why study assembly when there is Python, Java, C++ or [insert other language]?”. Actually, it’s quickly said: assembly gives the developer direct, unfiltered access to the machine’s resources. Compared to other programming languages that rely on compilation (C++) or are interpreted (Python or Java), assembly does not need to be “pre-processed” in any way.

Let us now introduce the concept of a low-level language and a high-level language. Given any problem, the developer wants to build a program that can perform some particular calculation. To structure the code, he starts with a specification in natural language (i.e. a language understandable by people) and begins to write: variables, functions and more. The description of this code can be more or less close to the natural language or not, depending on the language used.

A high-level language is one that is very close to human reasoning and has a significant abstraction from the details of how a computer works; this means that a developer does not have to know how a particular part of the architecture works when writing code. Three of the major languages that are defined as high-level are C, C++, and Java.

The particular simplicity of these 3 programming languages make them chosen to teach anyone how to start programming. C, C++ and Java do not require any special knowledge and their constructs are easy to understand, since they are very close to human thinking.

For example, let’s analyze the following code

int sum_two_numbers(int a, int b){
    int sum;
    sum = a + b;
    return sum;
}

We declare a variable sum and save within the variable the result of the sum operation between a and b. We developers do not know how in a practical way the program handles the two variables a and b, let alone where it goes to save the three variables. In this case, as you can see, it is the language ``itself’’ that provides very simple abstractions to implement the sum. Some examples of simple abstractions can be int sum to create a variable, the sign + to add and also the fact to return a value, without worrying at all to save, delete, write in memory.

It is a different matter for low-level languages that are closer to the way a computer works and operate directly with the resources of the computer. To program, the programmer must necessarily know the hardware structure of the computer, the operation and architecture of the processor. In particular, the memory addresses and the CPU registers. So we will try in the next lessons to go deep into how a processor works to actually understand the great advantage of assembly: giving “raw” instructions to the CPU.

Comparing the capabilities of the Assembly language with those of a high-level programming language, the following advantages/disadvantages can be identified:

Pros:

  • Propaedeuticity: it helps you understand how a computer really works;
  • It helps you write better in the higher-level language (e.g., C), because you understand how it is then executed;
  • It is the ultimate low-level programming language because it allows the widest access to all the resources of the computer;
  • For the previous points, it allows you to perform performance optimizations;

Cons:

  • Writing an Assembly program is very complex and requires non-trivial knowledge;
  • Each architecture has a specific instruction set, so assembly code is not portable to different platforms;
  • Increased length and reduced code readability;

Compiling vs Assembling

An ordinary person might ask: but if both high-level and low-level languages exist, what does a computer interpret? In order to answer this question, it is first necessary to take a step back.

As is well known, a computer is very useful in carrying out problems. There is an algorithm (i.e. a series of elementary instructions) whose execution allows the performance of a problem. Examples of algorithms can be from finding the first 10 prime numbers to the algorithm for ordering a set of numbers. This algorithm is described by some particular programming languages, languages created at the desk that allow to express the instructions in a format understandable by the computer.

Because of the electronic nature of the executor, it is only possible to describe a program by a sequence of electrical 0 or 1 signals that are physically interpreted by a party. The goal of programming languages is therefore to transform the code into machine language (i.e., a sequence of 0’s and 1’s). This transformation is done by two external programs called “compiler” and “interpreter” (depending on the language used, either both or only one of them will be used).

An interpreter is in charge of evaluating the program: it follows the code execution flow and simultaneously performs the machine language translation of the program commands and their execution. What the interpreter returns is the result of program execution. Interpreters are used for example in languages like Python, Ruby, Perl, PHP (they are called interpreted languages for this reason).

A compiler, on the other hand, creates an object code (a binary) from the source language. The execution of the resulting program is faster because the translation phase has already taken place. Compilers are used in C, C++ programming languages,

To compensate for the weaknesses of the two solutions, there is the so-called just-in-time or real-time compiler. This particular type of compiler, which is sometimes also called compreter (from compiler and interpreter), translates the program code like the interpreter, i.e. only during execution. In this way, high execution speed (thanks to the compiler) is combined with a simplified development process.

One of the best known examples of a language based on just-in-time compilation is Java: as a component of the Java Runtime Environment (JRE), such a JIT compiler improves the performance of Java applications by converting previously generated byte code into machine language at runtime.

Let’s see in detail how a program is built starting from a source written in C language and its actual execution.

Compiling and executing a program

The transition from source code to program execution goes through 3 steps: compilation, linking, loading and execution.

During the compilation phase, the code is analyzed and for each instruction a portion of machine language that implements it is generated. Instructions involving data declarations/allocations are also translated appropriately. The output is an object file in which the symbols used in the code (such as mnemonic labels associated with the data) are retained.

A compiler performs four main steps:

  • Scan: the scanner or parser reads one character at a time from the source code and keeps track of which character is present in which line.

  • Lexical analysis: the compiler converts the sequence of characters that appear in the source code into a series of character strings (known as tokens ), which are associated by a specific rule by a program called a lexical analyzer. A symbol table is used by the lexical analyzer to store the words in the source code that match the generated token.

  • Syntactic Analysis: Syntax analysis is performed in this step, which involves preprocessing to determine if the tokens created during lexical analysis are in the correct order based on their usage. The correct order of a set of keywords, which can produce a desired result, is called syntax. The compiler must check the source code to ensure syntactic accuracy.

  • Semantic analysis: this step consists of several intermediate steps. First, the structure of the tokens is checked, along with their order with respect to grammar in a given language. The meaning of the token structure is interpreted by the parser and the parser to finally generate an intermediate code, called object code.

The object code includes instructions that represent the processor’s action for a corresponding token when it is detected in the program. Finally, the entire code is analyzed and interpreted to see if any optimizations are possible. Once the optimizations are performed, the appropriate modified tokens are inserted into the object code to generate the final object code, which is saved within a file.

The object files (translated into machine language) are linked together by resolving references to the external symbols used in the various object files, using a program called a linker. Among the external symbols used, we have invocations of library functions (example: printf()) that are external, i.e. not defined within the source code. The result is an executable file, i.e. a file in which in addition to the code there is information about the memory location where the program should be loaded, as well as any symbols not yet “resolved”.

Even if the compiler is used only in the first “phase”, often with the term compilation is indicated the entire process of translation from high-level language to machine language.

For execution, the Loader, which is a component of the operating system, loads the program into memory (hierarchy) and then passes CPU control to the first program instruction (in systems with dynamic libraries such as Windows, it invokes the dynamic linker to resolve missing symbols). It performs other procedures in other more complex mechanisms. At the time of the actual execution of the first program instruction, there is a special memory location, stored in a CPU register, that contains a very useful memory address, i.e. the next machine language instruction to be executed.

So far in the compilation phase we have not mentioned assembly, however we wonder how does it all work? And why is there a need for a low-level language when it is very easy to build programs using high-level languages?

The Levels of Execution

In computer science, a great many concepts can be viewed at the physical, electronic, hardware, operating system, and application levels. Imagine that you have a magnifying glass and you want to open your laptop, you can look at the electrons flowing inside the electrical components or alternatively notice the real-time instructions from the CPU.

We are going to introduce 5 levels of code execution. As we will better address in the next lessons, the “central” component of our architecture is the CPU (Central Processing Unit) whose task is to process the instructions given by the program.

Each level consists of an interface, that is what is visible from outside and is actually used by the upper level, and an implementation that uses the interface of the lower level.

About SerHack

I am a security researcher, a writer, and a contributor to the Monero project, a cryptocurrency focused on preserving privacy for transactions data. My publication Mastering Monero has became one of the best rated resources to learn about Monero. More about me

Follow me on Twitter or send me an e-mail. I also appreciate donations, they allow me to continue doing my work and writing.

Mastering Monero Book