Inside the Compiler
This article explains how source code is transformed into optimized machine instructions. We’ll touch on lexical analysis, syntax trees, intermediate representations (IR), and static single assignment (SSA).
The Compilation Pipeline
A typical compiler follows these stages:
Source Code → Lexer → Parser → AST → IR → Optimizer → Code Gen → Machine Code
1. Lexical Analysis (Lexer)
The lexer breaks raw source text into tokens — the atomic units of the language:
int x = 42;
→ [INT, IDENT("x"), ASSIGN, NUM(42), SEMICOLON]
2. Parsing
The parser arranges tokens into an Abstract Syntax Tree (AST) that captures the program’s hierarchical structure.
3. Intermediate Representation (IR)
IR is a compiler’s internal language — simpler than source code but richer than machine code. LLVM IR is a popular example:
define i32 @add(i32 %a, i32 %b) {
%result = add i32 %a, %b
ret i32 %result
}
Static Single Assignment (SSA)
SSA is a property of IR where every variable is assigned exactly once. This simplifies many optimizations:
- Constant propagation — replacing variables with known constants
- Dead code elimination — removing unreachable or unused code
- Register allocation — mapping virtual registers to hardware registers
Compilers are the ultimate translators between human and machine — efficient, precise, and ever-evolving.
Key Optimizations
| Optimization | What It Does |
|---|---|
| Inlining | Replaces function calls with the function body |
| Loop unrolling | Reduces loop overhead by duplicating the loop body |
| Vectorization | Uses SIMD instructions for data-parallel operations |
| Tail call optimization | Converts recursive calls into loops |
Going Deeper
Future articles will cover:
- Writing a toy compiler from scratch
- LLVM pass infrastructure
- JIT compilation and runtime optimization
- Profile-guided optimization (PGO)