Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Repository files navigation

Awesome-GPU

Architecture
Algorithms
- BLAS
- Stencils
- Scans
Applications
- Deep Learning
Tools
Runtime
- Scheduling
Code Generation

Architecture

Resources Management

TECS'21-Reducing Energy in GPGPUs through Approximate Trivial Bypassing
ASPLOS'17-Locality-Aware CTA Clustering for Modern GPUs
ASPLOS'17-Dynamic Resource Management for Efficient Utilization of Multitasking GPUs
HPCA'17-Dynamic GPGPU Power Management Using Adaptive Model Predictive Control
ISCA'16-Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems

Parallelism

HPCA'18-Accelerate GPU Concurrent Kernel Execution by Mitigating Memory Pipeline Stalls
HPCA'17-Controlled Kernel Launch for Dynamic Parallelism in GPUs
GTC'17-COOPERATIVE GROUPS
ISCA'16-LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs
ISCA'16-Virtual Thread Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit
Berkeley TechRpts'16-Understanding Latency Hiding on GPUs

Cache

ISCA'16-APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs
SC'15-Adaptive and Transparent Cache Bypassing for GPUs

Memory

White Papers

NVIDIA Hopper-NVIDIA H100 Tensor Core GPU Architecture
NVIDIA Ampere-NVIDIA A100 Tensor Core GPU Architecture
NVIDIA Turing-NVIDIA TURING GPU ARCHITECTURE
NVIDIA Volta-NVIDIA TESLA V100
NVIDIA Pascal-NVIDIA TESLA P100
NVIDIA Kepler-NVIDIA’s Next Generation CUDA Compute Architecture: Kepler
NVIDIA Fermi-NVIDIA’s Next Generation CUDA Compute Architecture: Fermi
AMD CDNA 2-INTRODUCING AMD CDNA 2 ARCHITECTURE
AMD CDNA-INTRODUCING AMD CDNA ARCHITECTURE

Algorithms

BLAS

GTC'20-DEVELOPING CUDA KERNELS TO PUSH TENSOR CORES TO THE ABSOLUTE LIMIT ON NVIDIA A100
IPDPS'20-Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply
PPoPP'19-A Coordinated Tiling and Batching Framework for Efficient GEMM on GPU
GTC'18-CUTLASS: CUDA TEMPLATE LIBRARY FOR DENSE LINEAR ALGEBRA AT ALL LEVELS AND SCALES

Stencils

CGO'20-AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs
IPDPS'20-On Optimizing Complex Stencils on GPUs
PPoPP'18-Register Optimizations for Stencils on GPUs

Scans

NVResearch TechRpts'16-Single-pass Parallel Prefix Scan with Decoupled Look-back

Applications

Deep Learning

PPoPP'21-Understanding and bridging the gaps in current GNN performance optimizations
SC'21-E.T.: re-thinking self-attention for transformer models on GPUs
OSDI'21-GNNAdvisor: An Adaptive and Efficient Runtime System for GNN Acceleration on GPUs
SC'20-Sparse GPU Kernels for Deep Learning
PPoPP'18-SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks
HPCA'17-Towards Pervasive and User Satisfactory CNN across GPU Microarchitectures

Tools

Benchmarking

GTC'18-Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking
ISPASS'10-Demystifying GPU Microarchitecture through Microbenchmarking

Models

PMBS'19-Instruction Roofline An insightful visual performance model for GPUs
ECP'19-Performance Tuning of Scientific Codes with the Roofline Model
GTC'18-VOLTA Architecture and performance optimization
Synthesis Lectures on Computer Architecture'12-Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)
SC'10-Fundamental_Optimizations

Simulators

ISPASS'10-Visualizing Complex Dynamics in Many-Core Accelerator Architectures
ISPASS'09-Analyzing CUDA Workloads Using a Detailed GPU Simulator

Profilers

PLDI'18-GPU Code Optimization using Abstract Kernel Emulation and Sensitivity Analysis
CGO'18-CUDAAdvisor: LLVM-based runtime profiling for modern GPUs
CCGRID'18-Exposing Hidden Performance Opportunities in High Performance GPU Applications
THPC'16-Monitoring Heterogeneous Applications with the OpenMP Tools Interface
Euro-Par'15-Identifying Optimization Opportunities Within Kernel Execution in GPU Codes
SC'13-Effective sampling-driven performance tools for GPU-accelerated supercomputers
ISPASS'12-Lynx: A dynamic instrumentation system for data-parallel applications on GPGPU architectures
ICPP'11-Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs
Vampir|Score-P
TAU
PAPI
Allinea MAP
Open|SpeedShop
HPCToolkit
NVIDIA Nsight Systems
NVIDIA Nsight Compute
SASSI
NVBit

Runtime

Scheduling

PPoPP'22-CASE: A Compiler-Assisted SchEduling Framework for Multi-GPU Systems
TPDS'20-cCUDA: Effective Co-Scheduling of Concurrent Kernels on GPUs

Code Generation

Compilers

AMD'21-Generating GPU Compiler Heuristics using Reinforcement Learning
TACO'21-Domain-Specific Multi-Level IR Rewriting for GPU: The Open Earth Compiler for GPU-accelerated Climate Simulation
LLVM'17-Implementing implicit OpenMP data sharing on GPUs
CGO'16-gpucc: An Open-Source GPGPU Compiler
LLVM'16-Offloading Support for OpenMP in Clang and LLVM
PMBS'15-Performance Analysis of OpenMP on a GPU using a CORAL Proxy Application
LLVM'15-Integrating GPU Support for OpenMP Ofﬂoading Directives into Clang
LLVM'14-Coordinating GPU Threads for OpenMP 4.0 in LLVM

Programming Models

CGO'21-C-for-metal: high performance SIMD programming on intel GPUs
ECRTS'19-Novel Methodologies for Predictable CPU-To-GPU Command Offloading
ASPLOS'14-Paraprox: Pattern-Based Approximation for Data Parallel Applications

Profile Guided Optimization

Geometry and Optimization'21-Cooperative Profile Guided Optimizations
IPDPS'13-Kernel Specialization for Improved Adaptability and Performance on Graphics Processing Units (GPUs)

Binaries

CGO'19-Decoding CUDA binary
ISCA'15-Flexible software profiling of GPU architectures

About

Awesome resources for GPUs

BSD-3-Clause license

Report repository

Releases

No releases published

Packages

No packages published