Wliao2/add dlrm (#1704)

* add dlrm v2 * fix license issue * Update dlrm_dataloader.py * Update dist_models.py * Update dlrm_dataloader.py * Update dist_models.py * refactored to a new folder * update Readme --------- Co-authored-by: Mahathi Vatsal <mahathi.vatsal.salopanthula@intel.com>
intel · Jun 24, 2024 · c598c81 · c598c81
1 parent d74a8c1
commit c598c81
Show file tree

Hide file tree

Showing 23 changed files with 4,043 additions and 52 deletions.
diff --git a/models_v2/pytorch/torchrec_dlrm/inference/gpu/README.md b/models_v2/pytorch/torchrec_dlrm/inference/gpu/README.md
@@ -0,0 +1,101 @@
+# DLRM v2 Inference
+
+DLRM v2 Inference best known configurations with Intel® Extension for PyTorch.
+
+## Model Information
+
+| **Use Case** | **Framework** | **Model Repo** | **Branch/Commit/Tag** | **Optional Patch** |
+|:---:| :---: |:--------------:|:---------------------:|:------------------:|
+|  Inference   |    PyTorch    |       https://github.com/facebookresearch/dlrm/tree/main/torchrec_dlrm        |           -           |         -          |
+
+# Pre-Requisite
+* Host has 4 Intel® Data Center GPU Max and two tiles for each.
+* Host has installed latest Intel® Data Center GPU Max Series Drivers https://dgpu-docs.intel.com/driver/installation.html
+* Host has installed [Intel® Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/)
+
+# prepare Dataset
+After downloading and uncompressing the [Criteo 1TB Click Logs dataset](consisting of 24 files from day 0 to day 23), process the raw tsv files into the proper format for training by running ./scripts/process_Criteo_1TB_Click_Logs_dataset.sh with necessary command line arguments.
+
+Example usage:
+
+bash ./scripts/process_Criteo_1TB_Click_Logs_dataset.sh \
+./criteo_1tb/raw_input_dataset_dir \
+./criteo_1tb/temp_intermediate_files_dir \
+./criteo_1tb/numpy_contiguous_shuffled_output_dataset_dir
+The script requires 700GB of RAM and takes 1-2 days to run. We currently have features in development to reduce the preproccessing time and memory overhead. MD5 checksums of the expected final preprocessed dataset files are in md5sums_preprocessed_criteo_click_logs_dataset.txt.
+
+the final dataset dir will be like below:
+dataset_dir
+ |_day_0_dense.npy
+ |_day_0_labels.npy
+ |_day_0_sparse_multi_hot.npz
+
+this folder will be used as the parameter DATASET_DIR later
+
+wget https://cloud.mlcommons.org/index.php/s/XzfSeLgW8FYfR3S/download -O weigths.zip
+unzip weights.zip
+and the folder will be used as the parameter WEIGHT_DIR later
+
+
+## Inference
+1. `git clone https://github.com/IntelAI/models.git`
+2. `cd models/models_v2/pytorch/torchrec_dlrm/inference/gpu`
+3. Run `setup.sh` this will install all the required dependencies & create virtual environment `venv`.
+4. Activate virtual env: `. ./venv/bin/activate`
+5. Setup required environment paramaters
+
+| **Parameter**                |                                  **export command**                                  |
+|:---------------------------:|:------------------------------------------------------------------------------------:|
+| **MULTI_TILE**               | `export MULTI_TILE=True` (True or False)                                             |
+| **PLATFORM**                 | `export PLATFORM=PVC` (PVC)                                                 |
+| **WEIGHT_DIR**               | `export WEIGHT_DIR=`                                                                 |
+| **DATASET_DIR**              |                               `export DATASET_DIR=`                                  |
+| **BATCH_SIZE** (optional)    |                               `export BATCH_SIZE=32768`                              |
+| **PRECISION** (optional)     |        `export PRECISION=FP16` (FP16 and FP32 are supported for PVC)                 |
+| **OUTPUT_DIR** (optional)    |                               `export OUTPUT_DIR=$PWD`                               |
+6. Run `run_model.sh`
+
+## Output
+
+Multi-tile output will typically looks like:
+```
+[0] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03749502139621311 s
+[6] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03693882624308268 s
+[1] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03849502139621311 s
+[3] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03693882624308268 s
+[7] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03858977953592936 s
+[2] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03870058589511448 s
+[4] 2024-01-10 21:50:10,780 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.022177388932969836 s
+[5] 2024-01-10 21:50:10,780 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.037547969818115236 s
+[0] AUROC over test set: 0.8147445321083069.
+[0] Number of test samples: 3276800
+[0] 2024:01:10-21:50:11:(34779) |CCL_INFO| finalize atl-mpi
+[0] 2024:01:10-21:50:11:(34779) |CCL_INFO| finalized atl-mpi
+[3] 2024:01:10-21:50:11:(34782) |CCL_INFO| finalizing level-zero
+[7] 2024:01:10-21:50:11:(34786) |CCL_INFO| finalizing level-zero
+[0] 2024:01:10-21:50:11:(34779) |CCL_INFO| finalizing level-zero
+[6] 2024:01:10-21:50:11:(34785) |CCL_INFO| finalizing level-zero
+[4] 2024:01:10-21:50:11:(34783) |CCL_INFO| finalizing level-zero
+[3] 2024:01:10-21:50:11:(34782) |CCL_INFO| finalized level-zero
+[5] 2024:01:10-21:50:11:(34784) |CCL_INFO| finalizing level-zero
+[7] 2024:01:10-21:50:11:(34786) |CCL_INFO| finalized level-zero
+[0] 2024:01:10-21:50:11:(34779) |CCL_INFO| finalized level-zero
+[2] 2024:01:10-21:50:11:(34781) |CCL_INFO| finalizing level-zero
+[6] 2024:01:10-21:50:11:(34785) |CCL_INFO| finalized level-zero
+[4] 2024:01:10-21:50:11:(34783) |CCL_INFO| finalized level-zero
+[5] 2024:01:10-21:50:11:(34784) |CCL_INFO| finalized level-zero
+[2] 2024:01:10-21:50:11:(34781) |CCL_INFO| finalized level-zero
+[1] 2024:01:10-21:50:11:(34780) |CCL_INFO| finalizing level-zero
+[1] 2024:01:10-21:50:11:(34780) |CCL_INFO| finalized level-zero
+```
+
+Final results of the inference run can be found in `results.yaml` file.
+```
+results:
+ - key: throughput
+   value: 1693411.31
+   unit: samples/s
+ - key: accuracy
+   value: 0.815
+   unit: AUROC
+```
diff --git a/models_v2/pytorch/torchrec_dlrm/inference/gpu/cmd_distributed_terabyte_test.sh b/models_v2/pytorch/torchrec_dlrm/inference/gpu/cmd_distributed_terabyte_test.sh
@@ -0,0 +1,131 @@
+#!/bin/bash
+# Copyright (c) 2023 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+set -x
+
+function Parser() {
+    while [ $# -ne 0 ]; do
+        case $1 in
+            -b)
+                shift
+                export GLOBAL_BATCH_SIZE="$1"
+                ;;
+            -fp16)
+                shift
+                FP16="$1"
+                ;;
+            -d)
+                shift
+                DATA="$1"
+                ;;
+            -m)
+                shift
+                WEIGHT="$1"
+                ;;
+            -nd)
+                shift
+                ND="$1"
+                ;;
+            -sp)
+                shift
+                SP="$1"
+                ;;
+            -tf32)
+                shift
+                TF32="$1"
+                ;;
+            -tv)
+                shift
+                TV="$1"
+                ;;
+            -h | --help)
+                echo "Usage: cmd_infer.sh [OPTION...] PAGE..."
+                echo "-b, Optional    Specify the batch size. The default value is 32768"
+                echo "-fp16, Optional    Specify the input dtype is fp16. The default value is true"
+                echo "-d, Optional    Specify the data file"
+                echo "-m, Optional    Specify the weight file"
+                echo "-nd, Optional    Specify the number of node"
+                echo "-sp, Optional    Specify the sharding plan of embedding"
+                echo "-tf32, Optional    Specify the input dtype is tf32. The default value is false"
+                echo "-tv, Optional    Training with val. The default value is false"
+                exit
+                ;;
+            --*|-*)
+                echo ">>> New param: <$1>"
+                ;;
+            *)
+                echo ">>> Parsing mismatch: $1"
+                ;;
+        esac
+        shift
+    done
+}
+
+torch_ccl_path=$(python -c "import torch; import oneccl_bindings_for_pytorch; import os;  print(os.path.abspath(os.path.dirname(oneccl_bindings_for_pytorch.__file__)))")
+source $torch_ccl_path/env/setvars.sh
+export MASTER_ADDR='127.0.0.1'
+#export WORLD_SIZE=2 ;
+export MASTER_PORT='10088'
+export TOTAL_TRAINING_SAMPLES=4195197692;
+export GLOBAL_BATCH_SIZE=65536;
+
+ND=1
+SP="round_robin"
+#export CCL_LOG_LEVEL=DEBUG;
+#export CCL_OP_SYNC=1
+
+DATA=${DATA-'/home/sdp/xw/dlrm-v2/'}
+WEIGHT=${WEIGHT-'/home/sdp/xw/model_weights'}
+
+${FP16:=true}
+${TF32:=false}
+${TV:=false}
+Parser $@
+ARGS+=" --embedding_dim 128"
+ARGS+=" --dense_arch_layer_sizes 512,256,128"
+ARGS+=" --over_arch_layer_sizes 1024,1024,512,256,1"
+ARGS+=" --num_embeddings_per_feature 40000000,39060,17295,7424,20265,3,7122,1543,63,40000000,3067956,405282,10,2209,11938,155,4,976,14,40000000,40000000,40000000,590152,12973,108,36"
+ARGS+=" --validation_freq_within_epoch $((TOTAL_TRAINING_SAMPLES / (GLOBAL_BATCH_SIZE * 20 * 1000)))"
+ARGS+=" --synthetic_multi_hot_criteo_path $DATA"
+ARGS+=" --multi_hot_sizes 3,2,1,2,6,1,1,1,1,7,3,8,1,6,9,5,1,1,1,12,100,27,10,3,1,1"
+#ARGS+=" --multi_hot_distribution_type uniform"
+ARGS+=" --use_xpu"
+ARGS+=" --epochs 1"
+ARGS+=" --pin_memory"
+ARGS+=" --mmap_mode"
+ARGS+=" --batch_size $GLOBAL_BATCH_SIZE"
+ARGS+=" --interaction_type=dcn"
+ARGS+=" --dcn_num_layers=3"
+ARGS+=" --adagrad"
+ARGS+=" --dcn_low_rank_dim=512"
+ARGS+=" --numpy_rand_seed=12345"
+ARGS+=" --log_freq 10"
+ARGS+=" --amp"
+ARGS+=" --inference_only"
+ARGS+=" --snapshot_dir ${WEIGHT}"
+ARGS+=" --limit_test_batches 50"
+ARGS+=" --sharding_plan ${SP}"
+ARGS+=" --num_nodes ${ND}"
+ARGS+=" --learning_rate 0.005"
+
+[ "$TV" = true ]             && ARGS+=" --train_with_val"
+if [ "$TF32" = false ]; then
+        [ "$FP16" = true ]             && ARGS+=" --fp16"
+        echo "${ARGS}"
+	mpirun -np 8 -ppn 8 --prepend-rank python -u dlrm_main.py  ${ARGS}
+else
+        echo "${ARGS}"
+	IPEX_FP32_MATH_MODE=1 mpirun -np 8 -ppn 8 --prepend-rank python -u dlrm_main.py ${ARGS}
+fi