title | author | date | lang |
---|---|---|---|
Introduction to GPU Programming |
CSC/ENCCS Training |
2021-11 |
en |
- General purpose
- Good for serial processing
- Great for task parallelism
- Low latency per thread
- Large area dedicated cache and control
- Highly specialized for parallelism
- Good for parallel processing
- Great for data parallelism
- High-throughput
- Hundreds of floating-point execution units
selkie.macalester.edu/csinparallel/modules/GPUProgramming/build/html/Introduction/Introduction.html
- CPU (host) and GPU (device) codes are mixed
- all calls are made from host
- separate address spaces
- host allocates the memory
- host handles the memory transfers between CPU and GPU
- control is return to the host after a kernel calls
- kernels are executed sequentially
- Threads are executed on scalar processors
- Blocks are executed on multiprocessors
- Several blocks can reside on one multiprocessor (limited by the local resources)
- Kernel is executed as a grid of threads block
- Only one kernel is executed on a device at one time
- Warps (waves) of 32 (64) parallel threads
- Consecutive, increasing thread IDs
- All executing one common instruction at a time
- Conditional branches are executed serially
- Memory accesses are per warp (wave)
## CUDA C
...
int *a_d,*b_d,*c_d;
cudaMalloc((void **)&a_d,Nbytes);
cudaMalloc((void **)&b_d,Nbytes);
cudaMalloc((void **)&c_d,Nbytes);
cudaMemcpy(a_d,a,nBytes,cudaMemcpyHostToDevice);
cudaMemcpy(b_d,b,nBytes,cudaMemcpyHostToDevice);
vecAdd<<<gridSize,blockSize>>>(a_d,b_d,c_d,N);
cudaDeviceSynchronize();
## HIP
...
int *a_d,*b_d,*c_d;
hipMalloc((void **)&a_d,Nbytes);
hipMalloc((void **)&b_d,Nbytes);
hipMalloc((void **)&c_d,Nbytes);
hipMemcpy(a_d,a,Nbytes,hipMemcpyHostToDevice));
hipMemcpy(b_d,b,Nbytes,hipMemcpyHostToDevice));
hipLaunchKernelGGL(vecAdd,
dim3(gridSize), dim3(blockSize),
0, 0,
a_d,b_d,c_d,N);
hipDeviceSynchronize();
__global__ void vecAdd(int *a_d,int *b_d,int *c_d,int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if(i<N)
{
c_d[i] = a_d[i] + b_d[i];
}
}
- *Registers*: The fastest form of memory. Accessible only by the thread
- *Shared Memory*: Almost as fast as a registers. Visible by any thread within blocks
- **Global Memory**: 150x slower then registers/shared memory. Accessible from any thread or from the host
- Memory with special access pattern. Heavily cached on chip.
- Memory transactions are done in continuous blocks of 32B, 64B, or 128B
- Address of the first element is aligned to 16x the size of the first element
- Shared memory is divided into banks (allowing only one access per cycle)
- Parallel access: multiple addresses accessed over multiple banks
- Serial access: multiple addresses in the same bank
- Broadcast access: a single address read in a single bank (by the whole warp)
- Data movement appears more transparent to the application
- Creates a pool of managed memory
- Each allocation is accessible on both the CPU and GPU with the same pointer
- System automatically migrates data between the host and device, as needed
- A sequence of asynchronous GPU operations that execute on a device in the order issued by the host code.
- Operations within a stream are guaranteed to execute in the prescribed order
- Operations in different streams may run concurrently or interleaved