Inside the Compiler

This article explains how source code is transformed into optimized machine instructions. We’ll touch on lexical analysis, syntax trees, intermediate representations (IR), and static single assignment (SSA).

The Compilation Pipeline

A typical compiler follows these stages:

Source Code → Lexer → Parser → AST → IR → Optimizer → Code Gen → Machine Code

1. Lexical Analysis (Lexer)

The lexer breaks raw source text into tokens — the atomic units of the language:

int x = 42;
→ [INT, IDENT("x"), ASSIGN, NUM(42), SEMICOLON]

2. Parsing

The parser arranges tokens into an Abstract Syntax Tree (AST) that captures the program’s hierarchical structure.

3. Intermediate Representation (IR)

IR is a compiler’s internal language — simpler than source code but richer than machine code. LLVM IR is a popular example:

define i32 @add(i32 %a, i32 %b) {
  %result = add i32 %a, %b
  ret i32 %result
}

Static Single Assignment (SSA)

SSA is a property of IR where every variable is assigned exactly once. This simplifies many optimizations:

Constant propagation — replacing variables with known constants
Dead code elimination — removing unreachable or unused code
Register allocation — mapping virtual registers to hardware registers

Compilers are the ultimate translators between human and machine — efficient, precise, and ever-evolving.

Key Optimizations

Optimization	What It Does
Inlining	Replaces function calls with the function body
Loop unrolling	Reduces loop overhead by duplicating the loop body
Vectorization	Uses SIMD instructions for data-parallel operations
Tail call optimization	Converts recursive calls into loops

Going Deeper

Future articles will cover:

Writing a toy compiler from scratch
LLVM pass infrastructure
JIT compilation and runtime optimization
Profile-guided optimization (PGO)