CUDA API of GPT 2

Cuda basics

Thread smallest unit of execution
Block: group of threads that can share memory
Grid group of blocks that executes a kernel across a dataset

Memory hierarchy

Registers: fastest memory, private to each thread
Shared memory: shared among threads within the same block, faster than global memory but limited in size
Global memory: accessible by all threads but slower than registers or shrared memory

Core reference: https://github.com/viraatdas/llm.c/blob/master/train_gpt2.c

Header inclusion and Definitions

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <stdint.h>
#include <assert.h>
#include <math.h>
#include <time.h>
#include <string.h>
#include <unistd.h>
#ifdef OMP
#include <omp.h>
#endif

Core Functions (Forward and Backward Passes)

encoder_forward Function

__global__ void encoder_forward_kernel2(float* out,
                               int* inp, float* wte, float* wpe,
                               int B, int T, int C) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    int N = B * T * C;
 
    if (idx < N) {
        int bt = idx / C;
        int b = bt / T;
        int t = bt % T;
        int c = idx % C;
 
        int ix = inp[b * T + t];
 
        float* out_btc = out + b * T * C + t * C + c;
        float* wte_ix = wte + ix * C + c;
        float* wpe_tc = wpe + t * C + c;
        *out_btc = *wte_ix + *wpe_tc;
    }
}

Parameters:

out: Output array where the embeddings will be stored.
inp: Input array containing token indices.
wte: Token embeddings matrix
- each row corresponds to a unique token (word or subword) in the models vocabulary
wpe: Positional embeddings matrix
- captures the positional information of tokens within the input data
B, T, C: Dimensions representing Batch size, Sequence length, and Number of Channels (embedding dimension), respectively.

Breakdown

Kernel declaration: __global__ indicates it’s a CUDA kernel called from the host (CPU) and executed on the device (GPU)
Thread index calculation:
- idx = blockIdx.x * blockDim.x + threadIdx.x;: Calculates the global index of the thread across blocks.
Dimension calculations:
- N = B * T * C;: Total number of elements to process.
Core computation
- int ix = inp[b * T + t];: Fetches the index from the input array to select which embeddings to use.
- float* out_btc = out + b * T * C + t * C + c;: Computes the pointer to the output location.
- float* wte_ix = wte + ix * C + c;: Computes the pointer to the selected token embedding.
- float* wpe_tc = wpe + t * C + c;: Computes the pointer to the position embedding.
- *out_btc = *wte_ix + *wpe_tc;: Performs the addition of token and position embeddings and stores the result in the output array.

📝 Notes

Explorer

CUDA API of GPT 2

Cuda basics

Memory hierarchy

Parameters:

Breakdown

Graph View

Table of Contents

Backlinks