Getting Started with High-Performance Computing
This article covers the basics of High-Performance Computing (HPC). From understanding parallelism to using GPUs efficiently, HPC remains the foundation of large-scale simulation and AI workloads.
Why HPC Matters
Modern scientific computing, weather simulation, molecular dynamics, and deep learning all demand computational power far beyond what a single CPU core can provide. HPC is the discipline of harnessing thousands (or millions) of processing units to solve these problems. The TOP500 list ranks the world’s fastest supercomputers twice a year. As of 2024, Frontier at Oak Ridge holds the #1 spot at 1.2 exaFLOPS.
Core Concepts
Parallelism
There are two fundamental types:
- Data parallelism — the same operation applied across different data chunks
- Task parallelism — different operations running concurrently
Memory Models
| Model | Description |
|---|---|
| Shared memory | All processors access the same memory (OpenMP) |
| Distributed memory | Each processor has its own memory (MPI) |
| Hybrid | Combines both (MPI + OpenMP) |
OpenMP is the de facto standard for shared-memory parallelism in C/C++ and Fortran. OpenMP uses compiler directives (#pragma omp) to parallelize loops and sections, making it one of the lowest-friction ways to add parallelism to existing code. MPI, on the other hand, handles distributed-memory communication across nodes. MPI (Message Passing Interface) has been the backbone of distributed HPC since the early 1990s. The MPI-4.0 standard introduced persistent collectives and partitioned communication.
GPU Computing with CUDA
GPUs excel at data-parallel workloads. A simple CUDA kernel:
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) c[i] = a[i] + b[i];
}
Key CUDA concepts:
- Threads are organized into blocks
- Blocks are organized into a grid
- Shared memory within a block enables fast communication CUDA shared memory is on-chip SRAM (~100 KB per SM on modern GPUs) with ~100× lower latency than global memory. Efficient use of shared memory is often the single biggest optimization lever.
Profiling & Optimization
Before optimizing, measure. Nsight Systems provides a timeline view of GPU and CPU activity, making it easy to spot idle gaps and kernel launch overhead.

This second paragraph demonstrates rich sidenotes with images, bold, italic, and inline code.</span>
Essential tools:
nvprof/Nsight Systems— GPU profilingperf— CPU-level performance countersgprof— function-level profiling
Premature optimization is the root of all evil, but late optimization is the root of all slow code.
What’s Next
Future posts will explore:
- MPI communication patterns
- CUDA memory hierarchy deep dive
- Distributed training with NCCL
- Profiling real AI workloads end to end