→ Next slide
← Previous slide
Home First slide
End Last slide
Esc Back to blog
Understanding the essence of data locality in GPU and the complexity of Tiling.
The basic idea of GPU is to get lots of things done very fast. Nothing less, nothing more.
To do that, a lot of things were put in place — from the hardware architectures to the software stack.
Let’s start with understanding the basic building blocks of a GPU program.
Imagine you have a matrix data that you want to process with the GPU. Let’s call that matrix A and it has the shape of (M × K) which we’ll assign to be 16 × 16 for this discussion.

The GPU has to find a way to process that data in parallel, and it does that by making different threads work on different parts of the data.
The GPU divides this work into grids. Our data is divided into a 2 × 2 grid. We’ll discuss how we arrive at that later.

Each grid is assigned a block of threads, and each block is of 8 × 8 dimension.

Each block is made up of threads, and each thread is responsible for processing a specific element of the matrix. In this case, each thread will process one element of matrix A.

Threads are grouped into warps, and each warp is made up of 32 threads. Since we’re dealing with a small matrix, we’ll assume our warps to be of 8 threads.

In practice, warps are of 32 threads, and matrix dimensions and block sizes are always in multiples of 32. We’re simplifying here.
Each thread processes one element of matrix A, and each block processes a specific part of the matrix:
Grid size = (M/8, K/8) = (16/8, 16/8) = (2, 2)
Block size = (8, 8)
Threads per block = 8 × 8 = 64
Grid and block can be expressed in 1D, 2D, or 3D. Here we use 2D.
Data locality refers to accessing data in a way that maximizes cache usage and minimizes memory latency.
Here are the levels of data locality in GPU programming:

Accessing data from registers is much faster than from global memory. If we want the best performance, our data needs to be closer to the compute area.
The SM is the basic unit of computation in a GPU. It is responsible for:
To optimize your GPU kernel to the fullest, you need to understand your GPU specification — total number of SMs, total registers per SM, etc.
When solving data locality — moving data from global memory → shared memory → registers — we need to consider:
This can be quite complex based on the tiling strategy we use.
Starting with a simple matrix multiplication kernel:
__global__ void sgemm_naive(int M, int N, int K,
float alpha, const float *A,
const float *B, float beta, float *C) {
const uint x = blockIdx.x * blockDim.x + threadIdx.x;
const uint y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < M && y < N) {
float tmp = 0.0;
for (int i = 0; i < K; ++i) {
tmp += A[x * K + i] * B[i * N + y];
}
// C = α*(A@B)+β*C
C[x * N + y] = alpha * tmp + beta * C[x * N + y];
}
}
A is M×K, B is K×N, C is M×N. Using M=16, K=16, N=16.
To access data from the matrix, we determine the index by first getting the thread that has access to that index, then we need to know the block that thread is in.

Grid size = (2, 2)
Block = (8, 8) => (blockDim.x, blockDim.y)
From the image, the block in both x and y direction ranges from 0 to 1:
{(0,0), (0,1), (1,0), (1,1)}The thread in both x and y direction ranges from 0 to 7 for each block:
{(0,0) … (7,7)} per blockAccess element at index (0,0) in matrix A:
blockIdx.x = 0, blockIdx.y = 0
threadIdx.x = 0, threadIdx.y = 0
x = blockIdx.x * blockDim.x + threadIdx.x
= 0 * 8 + 0 = 0
The x is used to get the offset in matrix A:
A[x * K + i] = A[0 * 16 + 0] = A[0]
Access element in Block (1,0), second row and third column → thread (1,2):
blockIdx.x = 1
threadIdx.x = 1, threadIdx.y = 2
x = blockIdx.x * blockDim.x + threadIdx.x
= 1 * 8 + 1 = 9
The index in matrix A:
A[9 * 16 + i] = A[144 + i]
This maps to row 9 of matrix A — the second row inside the second block.
Read the full article on BrassinAI