→ Next slide
← Previous slide
Home First slide
End Last slide
F Fullscreen
Esc Back to blog
From silicon to serving — the GPU, the execution and memory model, how kernels get written, and the systems that stitch them into a server.


A CPU has a few powerful cores tuned for latency. A GPU flips the trade-off: thousands of simpler cores that run the same operation across huge batches of data at once. That throughput model is exactly what dense tensor math wants.
SIMT is NVIDIA’s spin on SIMD. Threads are programmed independently but executed in lockstep groups, which is what lets you write scalar-looking code that still runs as a wide vector operation.

Work on a GPU is organized as a hierarchy. You write one program, and the hardware runs it across many threads, grouped into units that share resources and can cooperate.
Shared memory plus barriers inside a block are what make fast kernels possible — tiling a matmul, staging data, and reducing across threads all depend on this cooperative layer.

A GPU has tiers of memory that trade speed for size. The closer to the compute cores, the faster and smaller. Performance is mostly decided by which tier your data lives in.
| Tier | Scope | Latency / size |
|---|---|---|
| Registers | per-thread, on-chip | ~1 cycle, KBs/thread |
| Shared Mem / L1 | per-block (SM), managed | ~20-30 cycles, ~100-228 KB/SM |
| L2 Cache | shared across all SMs | ~200 cycles, tens of MB |
| Global Mem (HBM) | whole GPU, off-chip | ~400+ cycles, GBs @ TB/s |
The kernel engineer’s job is to move data up the hierarchy and keep it there: coalesce global loads, stage tiles in shared memory, and hold accumulators in registers.
FlashAttention is the canonical example — Dao et al. (2022) reformulate attention as a tiled, online-softmax computation that never materializes the full N × N score matrix in HBM, turning a memory-bound op into a compute-bound one.

GPGPU means using the GPU’s thousands of cores for general computation, not just graphics. CUDA is NVIDIA’s platform and C++ language for exactly that: you write kernels that run across the thread / block / grid hierarchy, while the CPU orchestrates.
The mental model is host and device. The CPU (host) launches kernels and moves data. The GPU (device) runs the grid of blocks in parallel. One source file spans two worlds: __global__ functions run on the device, and the rest runs on the host.

A Domain-Specific Language trades generality for power inside one narrow problem. Instead of expressing everything from scratch, you describe what you want and the compiler handles the messy how. Familiar examples are SQL for querying data, RegEx for text matching, and HTML / CSS for documents and layout.
Why it matters for GPUs: writing raw CUDA is powerful but slow to iterate. You manage threads, shared memory, and synchronization by hand. A GPU DSL lets you write high-performance kernels at a higher level and leans on a compiler to emit efficient GPU code. That DSL is Triton.

Triton (from OpenAI) lets you write GPU kernels in plain Python. You get close-to-CUDA performance without hand-managing the lowest-level details. The compiler handles thread mapping, memory coalescing, and scheduling.
@triton.jit, no separate C++ toolchain.@triton.jit
def add(x, y, out, n, BLOCK: tl.constexpr):
pid = tl.program_id(0)
offs = pid*BLOCK + tl.arange(0, BLOCK)
mask = offs < n
a = tl.load(x + offs, mask=mask)
b = tl.load(y + offs, mask=mask)
tl.store(out + offs, a + b, mask=mask)

CUDA makes you reason about individual threads. Triton raises the unit of work to a block (tile): you load, compute on, and store whole chunks of data at once. The compiler maps that tile onto threads for you.
Picture a 1-D tensor split into BLOCK-sized tiles, with one program instance per tile (pid = 0, 1, 2, ...). The three primitives you lean on:
tl.program_id answers “which tile am I?” and gives you your slice of the data.tl.arange plus a mask builds the index range and masks off the ragged tail.tl.load / tl.store move a whole tile to and from global memory.
Every kernel launch carries CPU-side overhead. In an LLM decode loop you launch the same sequence of small kernels thousands of times, and that overhead starts to dominate. CUDA Graphs let you record a sequence once and replay it as a single unit.
Without graphs, the CPU pays launch overhead before every kernel. With a graph, one replay fires the kernels back-to-back. What you get:

Attention is the hot path of LLM inference. FlashInfer is a library of highly optimized attention kernels built specifically for serving: handling the paged KV cache, wildly varying sequence lengths, and both the prefill and decode phases.

A model’s forward pass is a pipeline of GPU kernels. An inference engine wires them together, then schedules and reuses that pipeline across many requests:
Tokens → Embedding → Transformer (× N layers) → Final Norm → LM Head → Sample
That pipeline is built from three groups of kernels and systems:

vLLM ties the stack together into a production inference server. Its key idea, PagedAttention, manages the KV cache like virtual memory, so the GPU stays busy across many concurrent requests.
Where it shows up:

Reading from the lowest level up to the API, the whole picture stacks like this:
| Layer | Role |
|---|---|
| vLLM | serving engine: batching, scheduling, the API |
| FlashInfer | optimized attention kernels on the hot path |
| CUDA Graphs | replay the decode loop with minimal overhead |
| Triton / DSL | custom fused kernels, written productively |
| CUDA / GPGPU | the platform: kernels across the thread hierarchy |
| Threads & CTAs | the cooperative execution and memory model |
| GPU | thousands of parallel cores |
Each layer exists to keep the one above it fed and the silicon below it busy. Understand the memory hierarchy and the execution model at the bottom, and every optimization higher up reads as the same idea applied at a different scale: keep the compute units saturated and stop paying for data movement you do not need.