This article covers the basics of High-Performance Computing (HPC). From understanding parallelism to using GPUs efficiently, HPC remains the foundation of large-scale simulation and AI workloads.

Why HPC Matters

Modern scientific computing, weather simulation, molecular dynamics, and deep learning all demand computational power far beyond what a single CPU core can provide. HPC is the discipline of harnessing thousands (or millions) of processing units to solve these problems. The TOP500 list ranks the world’s fastest supercomputers twice a year. As of 2024, Frontier at Oak Ridge holds the #1 spot at 1.2 exaFLOPS.

Core Concepts

Parallelism

There are two fundamental types:

  • Data parallelism — the same operation applied across different data chunks
  • Task parallelism — different operations running concurrently

Memory Models

Model Description
Shared memory All processors access the same memory (OpenMP)
Distributed memory Each processor has its own memory (MPI)
Hybrid Combines both (MPI + OpenMP)

OpenMP is the de facto standard for shared-memory parallelism in C/C++ and Fortran. OpenMP uses compiler directives (#pragma omp) to parallelize loops and sections, making it one of the lowest-friction ways to add parallelism to existing code. MPI, on the other hand, handles distributed-memory communication across nodes. MPI (Message Passing Interface) has been the backbone of distributed HPC since the early 1990s. The MPI-4.0 standard introduced persistent collectives and partitioned communication.

GPU Computing with CUDA

GPUs excel at data-parallel workloads. A simple CUDA kernel:

__global__ void vectorAdd(float *a, float *b, float *c, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) c[i] = a[i] + b[i];
}

Key CUDA concepts:

  1. Threads are organized into blocks
  2. Blocks are organized into a grid
  3. Shared memory within a block enables fast communication CUDA shared memory is on-chip SRAM (~100 KB per SM on modern GPUs) with ~100× lower latency than global memory. Efficient use of shared memory is often the single biggest optimization lever.

Profiling & Optimization

Before optimizing, measure. Nsight Systems provides a timeline view of GPU and CPU activity, making it easy to spot idle gaps and kernel launch overhead.

Nsight timeline

This second paragraph demonstrates rich sidenotes with images, bold, italic, and inline code.</span>

Essential tools:

  • nvprof / Nsight Systems — GPU profiling
  • perf — CPU-level performance counters
  • gprof — function-level profiling

Premature optimization is the root of all evil, but late optimization is the root of all slow code.

What’s Next

Future posts will explore:

  • MPI communication patterns
  • CUDA memory hierarchy deep dive
  • Distributed training with NCCL
  • Profiling real AI workloads end to end