Add revised benchmarking logic and results (#9)

jacobthebanana · web-flow · commit 9045f08ac1af · 2024-05-28T11:40:19.000-04:00
* Revised estimation of batch count, directly retrieving from len(train_dataloader).
Deleted unused timer_handle argument in Trainer.
Revised handling of "max_seq_len" override in benchmarking.
Added support for automatic switching between  lora and full-rank sharding scheme in benchmarking.

* Revised handling of unspecified max_seq_length.
Added llama-3 to benchmark model_list.

* Benchmarking: Revised benchmark script to ensure consistent per-device train batch size.

* Benchmarking: replaced trainer.step with trainer.train_step to avoid eval overhead in benchmarking.
Revised benchmark parsing logic; display optimal batch size for each context width value.

* Benchmarking: Updated reference throughput based on updated logic.

* Benchmarking: Updated reference throughput descriptions.
diff --git a/docs/reference_throughput.md b/docs/reference_throughput.md
@@ -1,33 +1,36 @@
 # Reference Throughput
 
 We've benchmarked VectorLM on the Vaughan cluster for a number of model architectures across a variety of node configurations.
-In experiments labelled as LoRA, we set hidden dimension to 8. During the testing, the NVIDIA driver version was 525.105.17, CUDA Runtime 12.1.105, and torch 2.2.2.
+In experiments labelled as LoRA, we set hidden dimension to 8. Below are version numbers of the testing environment:
 
-For consistency, we use a batch size of 8 and the maximum context length that the pre-trained LLM supports, capped at 65536. Note that especially for smaller models, it might be possible to further increase throughput by switching to a larger batch size.
+```bash
+$ pip3 freeze|grep -E "(torch|flash-attn|nvidia)"
+flash-attn==2.5.8
+nvidia-cublas-cu12==12.1.3.1
+nvidia-cuda-cupti-cu12==12.1.105
+nvidia-cuda-nvrtc-cu12==12.1.105
+nvidia-cuda-runtime-cu12==12.1.105
+nvidia-cudnn-cu12==8.9.2.26
+nvidia-cufft-cu12==11.0.2.54
+nvidia-curand-cu12==10.3.2.106
+nvidia-cusolver-cu12==11.4.5.107
+nvidia-cusparse-cu12==12.1.0.106
+nvidia-ml-py==12.550.52
+nvidia-nccl-cu12==2.19.3
+nvidia-nvjitlink-cu12==12.3.101
+nvidia-nvtx-cu12==12.1.105
+torch==2.2.1
+```
 
-Entries that read NaN represent combinations where the node configuration does not have enough GPU memory for the training run to complete. An exception is gemma-2b, which currently does not support full-rank FSDP fine-tuning.
+For each context width and hardware configuration, we experiment with a per-device batch size of 2, 4, and 8. In the table below, we report the batch size that maximizes training throughput. All values in the table represent the median training throughput in tokens/second across all training steps, aggregated across all GPU devices.
 
-All values in the table below represent the median training throughput in tokens per second across all training steps, aggregated across all GPU devices.
+|                                      | Meta-Llama-3-8B (2048) | Meta-Llama-3-8B (4096) | Meta-Llama-3-8B (8192) |
+| :----------------------------------- | :--------------------- | :--------------------- | :--------------------- |
+| (full_rank) NVIDIA A100-SXM4-80GB x1 | 3550.48 (batch: 8)     | 3461.64 (batch: 4)     | 3204.21 (batch: 2)     |
+| (full_rank) NVIDIA A100-SXM4-80GB x2 | 6346.00 (batch: 8)     | 6182.59 (batch: 4)     | 5772.91 (batch: 2)     |
+| (full_rank) NVIDIA A100-SXM4-80GB x4 | 12688.44 (batch: 8)    | 12249.74 (batch: 4)    | 11463.46 (batch: 2)    |
+| (lora) NVIDIA A100-SXM4-80GB x1      | 4079.28 (batch: 8)     | 3682.15 (batch: 4)     | 3528.93 (batch: 2)     |
+| (lora) NVIDIA A100-SXM4-80GB x2      | 7182.97 (batch: 8)     | 6955.58 (batch: 4)     | 6452.96 (batch: 2)     |
+| (lora) NVIDIA A100-SXM4-80GB x4      | 14299.47 (batch: 8)    | 13834.43 (batch: 4)    | 12769.23 (batch: 2)    |
 
-|                                      | Llama-2-13b-hf | Llama-2-7b-hf | Mistral-7B-v0.1 | Mixtral-8x7B-Instruct-v0.1 | gemma-2b | opt-350m |
-| :----------------------------------- | -------------: | ------------: | --------------: | -------------------------: | -------: | -------: |
-| (full_rank) NVIDIA A100-SXM4-80GB x1 |        424.726 |       570.818 |         528.747 |                        nan |      nan |  780.045 |
-| (full_rank) NVIDIA A100-SXM4-80GB x2 |        660.355 |        919.19 |         794.566 |                    275.459 |      nan |  1227.67 |
-| (full_rank) NVIDIA A100-SXM4-80GB x4 |         1309.4 |       1744.39 |         1577.09 |                    817.162 |      nan |  2181.46 |
-| (full_rank) NVIDIA A40 x1            |            nan |       47.6435 |         107.503 |                        nan |      nan |  666.881 |
-| (full_rank) NVIDIA A40 x2            |            nan |       313.074 |         322.624 |                        nan |      nan |  854.672 |
-| (full_rank) NVIDIA A40 x4            |         345.96 |       570.977 |         553.658 |                        nan |      nan |  1765.49 |
-| (full_rank) Tesla T4 x1              |            nan |           nan |             nan |                        nan |      nan |   475.51 |
-| (full_rank) Tesla T4 x2              |            nan |           nan |             nan |                        nan |      nan |  768.008 |
-| (full_rank) Tesla T4 x4              |            nan |           nan |             nan |                        nan |      nan |   1383.6 |
-| (full_rank) Tesla T4 x8              |            nan |           nan |             nan |                        nan |      nan |  2414.68 |
-| (lora) NVIDIA A100-SXM4-80GB x1      |        560.167 |       646.801 |         525.802 |                        nan |  851.678 |  859.379 |
-| (lora) NVIDIA A100-SXM4-80GB x2      |        871.993 |       1157.17 |         1105.68 |                    239.431 |  1724.57 |  1463.82 |
-| (lora) NVIDIA A100-SXM4-80GB x4      |        1783.53 |       2091.03 |         2150.06 |                    1309.74 |  2719.24 |  2381.01 |
-| (lora) NVIDIA A40 x1                 |        272.931 |       435.386 |         336.507 |                        nan |  983.256 |  652.611 |
-| (lora) NVIDIA A40 x2                 |        105.442 |       457.183 |         356.263 |                        nan |  725.723 |  1136.17 |
-| (lora) NVIDIA A40 x4                 |         543.22 |       715.416 |         642.642 |                        nan |  1302.62 |  1647.57 |
-| (lora) Tesla T4 x1                   |            nan |           nan |             nan |                        nan |  148.272 |  571.471 |
-| (lora) Tesla T4 x2                   |            nan |       101.126 |         102.859 |                        nan |  256.534 |  811.159 |
-| (lora) Tesla T4 x4                   |            nan |       188.575 |         190.127 |                        nan |  495.755 |  1506.05 |
-| (lora) Tesla T4 x8                   |        196.709 |       372.375 |         351.361 |                        nan |   897.81 |  2945.86 |
+We provide the tools for evaluating the throughput on different context windows and different hardware/model configuration. Refer to the profiling folder in this repository to get started.
diff --git a/profiling/README.md b/profiling/README.md
@@ -13,7 +13,7 @@ $ python3 launch_benchmark.py
 # to accept and automatically invoke the commands.
 ```
 
-After the SLURM jobs complete, profiler output can be found under `data/benchmark`. Invoke the following the to generate a Markdown summary of the results:
+After the SLURM jobs complete, profiler output can be found under `data/benchmark`. Invoke the following the to generate a Markdown summary of the results. If the benchmark results include multiple different batch sizes for each (model, context window, hardware) pair, the table would list the "optimal" batch size associated with the highest training throughput for this combination.
 
 ```bash
 $ python3 profiling/parse_benchmark.py --folder data/benchmark
diff --git a/profiling/benchmark.py b/profiling/benchmark.py
@@ -25,7 +25,6 @@
 from vectorlm.utils.model_utils import (
     get_lora_model_from_base_model,
     get_submodule_by_pattern,
-    hook_activation_checkpointing,
     load_model_and_tokenizer,
     shard_model,
 )
@@ -67,7 +66,7 @@ def parse_args() -> Namespace:
         default=1000,
     )
     parser.add_argument("--max_length", type=int)
-    parser.add_argument("--training_batch_size", type=int)
+    parser.add_argument("--per_device_batch_size", type=int)
     return parser.parse_args()
 
 
@@ -273,9 +272,26 @@ def load_datasets(self) -> None:
 
     setup(config.train_parameters.output_dir)
 
-    if args.training_batch_size is not None:
-        config.dataset.train_bs = args.training_batch_size
-        write_metrics("training_batch_size", args.training_batch_size)
+    training_args = config.train_parameters
+
+    # set a seed
+    set_seed(training_args.seed)
+
+    # set CUDA related dependencies
+    local_rank = int(os.environ["LOCAL_RANK"])
+    rank = int(os.environ["RANK"])
+    world_size = int(os.environ["WORLD_SIZE"])
+
+    if args.per_device_batch_size is not None:
+        config.dataset.train_bs = args.per_device_batch_size
+        config.dataset.eval_bs = args.per_device_batch_size
+
+    write_metrics("training_batch_size", config.dataset.train_bs)
+    write_metrics("eval_batch_size", config.dataset.eval_bs)
+    write_metrics(
+        "training_batch_size_global",
+        config.dataset.train_bs * world_size,
+    )
 
     print(f"Writing metrics to {output_path}")
     write_metrics("model_name", args.model_name)
@@ -291,16 +307,6 @@ def load_datasets(self) -> None:
         repeat=2,
     )
 
-    training_args = config.train_parameters
-
-    # set a seed
-    set_seed(training_args.seed)
-
-    # set CUDA related dependencies
-    local_rank = int(os.environ["LOCAL_RANK"])
-    rank = int(os.environ["RANK"])
-    world_size = int(os.environ["WORLD_SIZE"])
-
     with track_time("dist_init"):
         print(f"Rank: {rank}, World size: {world_size}")
         if dist.is_initialized():
@@ -314,17 +320,18 @@ def load_datasets(self) -> None:
 
     # load model and tokenizer
     lora_peft_config = config.train_parameters.get("lora_peft_config")
+    is_lora_enabled = lora_peft_config is not None
 
     with track_time("model_load"):
         model, tokenizer = load_model_and_tokenizer(
             args.model_name,
             training_args.use_mp,
             get_is_flash_attention_supported(),
-            training_args.max_seq_len,
+            args.max_length,
             local_rank,
             training_args.low_cpu_mem_usage,
         )
-        if lora_peft_config is not None:
+        if is_lora_enabled:
             print("Enabling LoRA Wrapper.")
             write_metrics("peft_method", "lora")
             model = get_lora_model_from_base_model(model, lora_peft_config)
@@ -348,12 +355,9 @@ def load_datasets(self) -> None:
             training_args.sharding_strategy,
             local_rank,
             training_args.low_cpu_mem_usage,
+            is_lora_enabled=is_lora_enabled,
         )
 
-    with track_time("set_activation_checkpointing"):
-        if training_args.use_activation_checkpointing:
-            hook_activation_checkpointing(model, decoder_layer_module)
-
     # load dataset
     with track_time("dataset_load"):
         dataset = BenchmarkingDataset(
@@ -364,14 +368,17 @@ def load_datasets(self) -> None:
             max_length=args.max_length,
         )
 
+        print(
+            f"Sequence length: {dataset.max_length};"
+            f"Batch Size (per device): {config.dataset.train_bs}",
+        )
         write_metrics("max_length", dataset.max_length)
 
     # instantiate trainer
     trainer = Trainer(
         config=training_args,
         enable_wandb_logging=config.enable_wandb_logging,
         original_dataset_length=dataset.original_length,
-        timer_handle=track_time,
     )
 
     # load optimizer
@@ -412,15 +419,18 @@ def load_datasets(self) -> None:
             trainer.model.train()
             train_dl_iterator = iter(dataset.train_dataloader)
             for _ in tqdm(
-                range(args.num_train_examples),
+                range(len(dataset.train_dataloader)),
                 disable=rank != 0,
                 file=sys.__stdout__,
             ):
                 batch = next(train_dl_iterator)
                 num_tokens = len(batch["input_ids"].flatten())
 
-                with track_time("train_step", {"num_tokens": num_tokens}):
-                    trainer.step(batch, epoch)
+                with track_time(
+                    "train_step",
+                    {"num_tokens": num_tokens * world_size},
+                ):
+                    trainer.train_step(batch, epoch)
 
                 profile_handle.step()
                 write_metrics(
diff --git a/profiling/configs/benchmark.yaml b/profiling/configs/benchmark.yaml
@@ -6,7 +6,6 @@ wandb_config:
 
 train_parameters:
   output_dir: /dev/shm/lora-benchmark
-  max_seq_len: 128
   epochs: 1
   seed: 11
 
diff --git a/profiling/configs/lora-benchmark.yaml b/profiling/configs/lora-benchmark.yaml
@@ -6,7 +6,6 @@ wandb_config:
 
 train_parameters:
   output_dir: /dev/shm/lora-benchmark
-  max_seq_len: 128
   epochs: 1
   seed: 11
 
diff --git a/profiling/launch_benchmark.py b/profiling/launch_benchmark.py
@@ -22,12 +22,13 @@
 model_list = [
     "/model-weights/" + model_name
     for model_name in [
-        "opt-350m",
-        "gemma-2b",
-        "Llama-2-7b-hf",
-        "Llama-2-13b-hf",
-        "Mistral-7B-v0.1",
-        "Mixtral-8x7B-Instruct-v0.1",
+        # "opt-350m",
+        # "gemma-2b",
+        # "Llama-2-7b-hf",
+        "Meta-Llama-3-8B",
+        # "Llama-2-13b-hf",
+        # "Mistral-7B-v0.1",
+        # "Mixtral-8x7B-Instruct-v0.1",
     ]
 ]
 
@@ -37,27 +38,28 @@
 ]
 
 # Set to (-1) to fall back to the max context length of the pre-trained model.
-max_length_list = [1024, 2048, 4096, -1]
-batch_size = [8, 16, 32, 64, 128]
+max_length_list = [8192, 4096, 2048]
+# Per-device batch size for training
+per_device_batch_size = [2, 4, 8]
 
 slurm_flags_options = {
     "nodes": [1],
     "mem-per-gpu": ["16GB"],
     "ntasks-per-node": [1],
     "cpus-per-gpu": [3],
-    "gres": [f"gpu:{n}" for n in [1, 2, 4, 8]],
+    "gres": [f"gpu:{n}" for n in [4, 2, 1]],
     "partition": partitions,
 }
 
-num_repeats = 2
+num_repeats = 1
 slurm_flags_extra = {"time": "01:00:00", "qos": qos_selected}
 
 slurm_pos_args_options = [
     ["profiling/launch_benchmark.sh"],
     config_list,
     model_list,
     max_length_list,
-    batch_size,
+    per_device_batch_size,
 ]
 timestamp = int(time.time())
 
diff --git a/profiling/launch_benchmark.sh b/profiling/launch_benchmark.sh
@@ -28,7 +28,7 @@ profiling/benchmark.py \
 --yaml_path $1 \
 --model_name $2 \
 --max_length $3 \
---training_batch_size $4
+--per_device_batch_size $4
 
 # clean up benchmarking artifacts as ops have requested
 rm -rf /dev/shm/lora-benchmark
diff --git a/profiling/parse_benchmark.py b/profiling/parse_benchmark.py
diff --git a/vectorlm/utils/model_utils.py b/vectorlm/utils/model_utils.py