2025.05.07 - #34 - NVIDIA/AMD/Google/Furiosa GPU NPU TPU 리뷰, Differentiable rendering 리뷰

# NVIDIA GPU, Google TPU, AMD GPU, Furiosa NPU 비교

## NVIDIA GPU (A100, H100, H200, B200)

- 공부하기 좋은 리소스: https://comsys-pim.tistory.com/6

### Streaming multiprocessors (SM)
- CUDA core (FP32/FP64) + Tensor core (matrix math) 
- A100: 108 SMs -> FP64, FP32, FP16, BF16, INT8, INT4 matrix ops
- H100: 132 SMs -> FP64, FP32, FP16, BF16, **_FP8_** (transformer engine)
   - 2x tensor FLOPS than A100
- H200: Better memory and power than H100
- B200: 2 x 132 SMs (Dual chiplet, connected by NVLink-C2C) -> FP64, FP32, FP16, BF16, FP8, INT8, INT4, **_FP4_**
   - 2x tensor FLOPS than H200

![Image](https://github.com/user-attachments/assets/a5456fd8-2aad-43f3-b5dd-c27ca6d32df2)

### Memory

- A100: 40MB L2 cache, HBM2 (40GB @1.6TB/s, 80GB @2.0TB/s)
- H100: 50MB L2 cache, HBM3 (80GB @3.35B/s, 94GB@3.9TB/s)
- H200: HBM3e (141 GB @4.8TB/s)
- B200: 100MB L2 cache, 8+ HBM3e (192GB @5.3TB/s)

![Image](https://github.com/user-attachments/assets/545737a7-8702-464b-8d4e-5e8c34b63912)


## AMD GPU (MI250, MI300)

- GPU based, but pure compute focus

### Compute Dies

- MI250X (2021 - CDNA2)
   - 2 GPU chiplets - 2 x 110 compute units (Graphics Compute Die - GCD), optimized for FP64 and matrix math
   - Matrix cores - FP16, BF16
- MI300 (2023, CDNA3)
   - 8 GPU chipets - 304 compute units (Accelerator Complex Dies - XCDs) 
   - Matrix cores - FP16, BF16, INT8, FP8
   - Comparable to H100's raw compute

![Image](https://github.com/user-attachments/assets/754b5949-b94c-4cbf-9f2e-81cb9a6a5ba4)

![Image](https://github.com/user-attachments/assets/5e4b5192-48dd-4915-8fa4-85d9b74288d7)

### Memory

- MI250X: 8MB L2 cache per GCD, HBM2e (128GB @3.2TB/s)
- MI300X, 256 'AMD Infinity Cache', HBM3 (192GB @5.3TB/s)

## Google TPU (v4, v5e)

- Systolic array core 

![Image](https://github.com/user-attachments/assets/577a8e7e-d639-4d4a-a669-cbd79f7a5846)

## Processors (TensorCores - large matrix multiply units, SparseCores - Sparse computation)

- TPU v4: 2 TensorCores per chip -> BF16, INT8 (275 TFLOPS per chip for BF16!)
- TPU v5e: 1 TensorCore per chip -> BF16, INT8 (197 TFLOPS for BF16, 393 INT8 TOPS)
- Both uses SparseCores, specialized for embedding vectors, which significantly speed up recommendation and language models with large embedding tables.
- Google uses BF16 for training, and INT8 for inference, and does NOT support FP8 (for v4 and v5e)

## Memory

- TPU v4: HBM2 (16GB @1.2 TB/s), Local SRAM 64MB
- TPU v5e: HBM2e (16GB @819 GB/s), 
- TPU v5p: HBM3 (95GB @2.8 TB/s)
- TPU v6: HBM3 (192GB @7.4 TB/s)

## Architecture 

- Scale up: 
   - TPU v4: 4092 chips in a 'pod'
      - 아래 이미지는 1개의 'pod' 중의 1/8 이라고 함
      - 4x4x4 shape -> 3D torus network
   - TPU v5e: 256 chips in a 'pod' for smaller deployment
   - TPU v6: 256 chips in a 'pod', or 9216 chips in a 'pod'


![](https://storage.googleapis.com/gweb-cloudblog-publish/images/1_Cloud_TPU_v4.max-1100x1100.jpg)

![Image](https://github.com/user-attachments/assets/73eb1223-b61a-4758-8ff4-62a24cfaff78)

![Image](https://github.com/user-attachments/assets/039fe726-f049-4480-8edb-05ec706f16ee)

## Furiosa NPU

### Processor (Tensor Contraction Processors - TCP)

- Warboy: 2 TCP cores, FP32, FP16, BF16, FP8, INT8
   - 32 TOPS INT8, 4 TFLOPS FP16
- RNGD: 8 TCP cores
   - 32 BF16 TFLOPS, 64 TFLOPS FP8 for each core -> 256 TFLOPS BF16, 512 TFLOPS FP8, 1024 TOPS INT4.

### Memory

- Warboy: 16GB LPDDR4X (@66GB/s), 32MB on-chip SRAM
- RNGD: 48GB HBM3 (@1.5TB/S), 256 MB on-xhip SRAM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2025.05.07 - #34 - NVIDIA/AMD/Google/Furiosa GPU NPU TPU 리뷰, Differentiable rendering 리뷰 #36

NVIDIA GPU, Google TPU, AMD GPU, Furiosa NPU 비교

NVIDIA GPU (A100, H100, H200, B200)

Streaming multiprocessors (SM)

Memory

AMD GPU (MI250, MI300)

Compute Dies

Memory

Google TPU (v4, v5e)

Processors (TensorCores - large matrix multiply units, SparseCores - Sparse computation)

Memory

Architecture

Furiosa NPU

Processor (Tensor Contraction Processors - TCP)

Memory

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

2025.05.07 - #34 - NVIDIA/AMD/Google/Furiosa GPU NPU TPU 리뷰, Differentiable rendering 리뷰 #36

Description

NVIDIA GPU, Google TPU, AMD GPU, Furiosa NPU 비교

NVIDIA GPU (A100, H100, H200, B200)

Streaming multiprocessors (SM)

Memory

AMD GPU (MI250, MI300)

Compute Dies

Memory

Google TPU (v4, v5e)

Processors (TensorCores - large matrix multiply units, SparseCores - Sparse computation)

Memory

Architecture

Furiosa NPU

Processor (Tensor Contraction Processors - TCP)

Memory

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions