You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Revised estimation of batch count, directly retrieving from len(train_dataloader).
Deleted unused timer_handle argument in Trainer.
Revised handling of "max_seq_len" override in benchmarking.
Added support for automatic switching between lora and full-rank sharding scheme in benchmarking.
* Revised handling of unspecified max_seq_length.
Added llama-3 to benchmark model_list.
* Benchmarking: Revised benchmark script to ensure consistent per-device train batch size.
* Benchmarking: replaced trainer.step with trainer.train_step to avoid eval overhead in benchmarking.
Revised benchmark parsing logic; display optimal batch size for each context width value.
* Benchmarking: Updated reference throughput based on updated logic.
* Benchmarking: Updated reference throughput descriptions.
We've benchmarked VectorLM on the Vaughan cluster for a number of model architectures across a variety of node configurations.
4
-
In experiments labelled as LoRA, we set hidden dimension to 8. During the testing, the NVIDIA driver version was 525.105.17, CUDA Runtime 12.1.105, and torch 2.2.2.
4
+
In experiments labelled as LoRA, we set hidden dimension to 8. Below are version numbers of the testing environment:
5
5
6
-
For consistency, we use a batch size of 8 and the maximum context length that the pre-trained LLM supports, capped at 65536. Note that especially for smaller models, it might be possible to further increase throughput by switching to a larger batch size.
6
+
```bash
7
+
$ pip3 freeze|grep -E "(torch|flash-attn|nvidia)"
8
+
flash-attn==2.5.8
9
+
nvidia-cublas-cu12==12.1.3.1
10
+
nvidia-cuda-cupti-cu12==12.1.105
11
+
nvidia-cuda-nvrtc-cu12==12.1.105
12
+
nvidia-cuda-runtime-cu12==12.1.105
13
+
nvidia-cudnn-cu12==8.9.2.26
14
+
nvidia-cufft-cu12==11.0.2.54
15
+
nvidia-curand-cu12==10.3.2.106
16
+
nvidia-cusolver-cu12==11.4.5.107
17
+
nvidia-cusparse-cu12==12.1.0.106
18
+
nvidia-ml-py==12.550.52
19
+
nvidia-nccl-cu12==2.19.3
20
+
nvidia-nvjitlink-cu12==12.3.101
21
+
nvidia-nvtx-cu12==12.1.105
22
+
torch==2.2.1
23
+
```
7
24
8
-
Entries that read NaN represent combinations where the node configuration does not have enough GPU memory for the training run to complete. An exception is gemma-2b, which currently does not support full-rank FSDP fine-tuning.
25
+
For each context width and hardware configuration, we experiment with a per-device batch size of 2, 4, and 8. In the table below, we report the batch size that maximizes training throughput. All values in the table represent the median training throughput in tokens/second across all training steps, aggregated across all GPU devices.
9
26
10
-
All values in the table below represent the median training throughput in tokens per second across all training steps, aggregated across all GPU devices.
| (lora) Tesla T4 x1 | nan | nan | nan | nan | 148.272 | 571.471 |
31
-
| (lora) Tesla T4 x2 | nan | 101.126 | 102.859 | nan | 256.534 | 811.159 |
32
-
| (lora) Tesla T4 x4 | nan | 188.575 | 190.127 | nan | 495.755 | 1506.05 |
33
-
| (lora) Tesla T4 x8 | 196.709 | 372.375 | 351.361 | nan | 897.81 | 2945.86 |
36
+
We provide the tools for evaluating the throughput on different context windows and different hardware/model configuration. Refer to the profiling folder in this repository to get started.
Copy file name to clipboardExpand all lines: profiling/README.md
+1-1
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@ $ python3 launch_benchmark.py
13
13
# to accept and automatically invoke the commands.
14
14
```
15
15
16
-
After the SLURM jobs complete, profiler output can be found under `data/benchmark`. Invoke the following the to generate a Markdown summary of the results:
16
+
After the SLURM jobs complete, profiler output can be found under `data/benchmark`. Invoke the following the to generate a Markdown summary of the results. If the benchmark results include multiple different batch sizes for each (model, context window, hardware) pair, the table would list the "optimal" batch size associated with the highest training throughput for this combination.
0 commit comments