Skip to content

Commit

Permalink
Wliao2/add dlrm (#1704)
Browse files Browse the repository at this point in the history
* add dlrm v2

* fix license issue

* Update dlrm_dataloader.py

* Update dist_models.py

* Update dlrm_dataloader.py

* Update dist_models.py

* refactored to a new folder

* update Readme

---------

Co-authored-by: Mahathi Vatsal <mahathi.vatsal.salopanthula@intel.com>
  • Loading branch information
wincent8 and Mahathi-Vatsal committed Jun 24, 2024
1 parent d74a8c1 commit c598c81
Show file tree
Hide file tree
Showing 23 changed files with 4,043 additions and 52 deletions.
101 changes: 101 additions & 0 deletions models_v2/pytorch/torchrec_dlrm/inference/gpu/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# DLRM v2 Inference

DLRM v2 Inference best known configurations with Intel® Extension for PyTorch.

## Model Information

| **Use Case** | **Framework** | **Model Repo** | **Branch/Commit/Tag** | **Optional Patch** |
|:---:| :---: |:--------------:|:---------------------:|:------------------:|
| Inference | PyTorch | https://github.com/facebookresearch/dlrm/tree/main/torchrec_dlrm | - | - |

# Pre-Requisite
* Host has 4 Intel® Data Center GPU Max and two tiles for each.
* Host has installed latest Intel® Data Center GPU Max Series Drivers https://dgpu-docs.intel.com/driver/installation.html
* Host has installed [Intel® Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/)

# prepare Dataset
After downloading and uncompressing the [Criteo 1TB Click Logs dataset](consisting of 24 files from day 0 to day 23), process the raw tsv files into the proper format for training by running ./scripts/process_Criteo_1TB_Click_Logs_dataset.sh with necessary command line arguments.

Example usage:

bash ./scripts/process_Criteo_1TB_Click_Logs_dataset.sh \
./criteo_1tb/raw_input_dataset_dir \
./criteo_1tb/temp_intermediate_files_dir \
./criteo_1tb/numpy_contiguous_shuffled_output_dataset_dir
The script requires 700GB of RAM and takes 1-2 days to run. We currently have features in development to reduce the preproccessing time and memory overhead. MD5 checksums of the expected final preprocessed dataset files are in md5sums_preprocessed_criteo_click_logs_dataset.txt.

the final dataset dir will be like below:
dataset_dir
|_day_0_dense.npy
|_day_0_labels.npy
|_day_0_sparse_multi_hot.npz

this folder will be used as the parameter DATASET_DIR later

wget https://cloud.mlcommons.org/index.php/s/XzfSeLgW8FYfR3S/download -O weigths.zip
unzip weights.zip
and the folder will be used as the parameter WEIGHT_DIR later


## Inference
1. `git clone https://github.com/IntelAI/models.git`
2. `cd models/models_v2/pytorch/torchrec_dlrm/inference/gpu`
3. Run `setup.sh` this will install all the required dependencies & create virtual environment `venv`.
4. Activate virtual env: `. ./venv/bin/activate`
5. Setup required environment paramaters

| **Parameter** | **export command** |
|:---------------------------:|:------------------------------------------------------------------------------------:|
| **MULTI_TILE** | `export MULTI_TILE=True` (True or False) |
| **PLATFORM** | `export PLATFORM=PVC` (PVC) |
| **WEIGHT_DIR** | `export WEIGHT_DIR=` |
| **DATASET_DIR** | `export DATASET_DIR=` |
| **BATCH_SIZE** (optional) | `export BATCH_SIZE=32768` |
| **PRECISION** (optional) | `export PRECISION=FP16` (FP16 and FP32 are supported for PVC) |
| **OUTPUT_DIR** (optional) | `export OUTPUT_DIR=$PWD` |
6. Run `run_model.sh`

## Output

Multi-tile output will typically looks like:
```
[0] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03749502139621311 s
[6] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03693882624308268 s
[1] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03849502139621311 s
[3] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03693882624308268 s
[7] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03858977953592936 s
[2] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03870058589511448 s
[4] 2024-01-10 21:50:10,780 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.022177388932969836 s
[5] 2024-01-10 21:50:10,780 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.037547969818115236 s
[0] AUROC over test set: 0.8147445321083069.
[0] Number of test samples: 3276800
[0] 2024:01:10-21:50:11:(34779) |CCL_INFO| finalize atl-mpi
[0] 2024:01:10-21:50:11:(34779) |CCL_INFO| finalized atl-mpi
[3] 2024:01:10-21:50:11:(34782) |CCL_INFO| finalizing level-zero
[7] 2024:01:10-21:50:11:(34786) |CCL_INFO| finalizing level-zero
[0] 2024:01:10-21:50:11:(34779) |CCL_INFO| finalizing level-zero
[6] 2024:01:10-21:50:11:(34785) |CCL_INFO| finalizing level-zero
[4] 2024:01:10-21:50:11:(34783) |CCL_INFO| finalizing level-zero
[3] 2024:01:10-21:50:11:(34782) |CCL_INFO| finalized level-zero
[5] 2024:01:10-21:50:11:(34784) |CCL_INFO| finalizing level-zero
[7] 2024:01:10-21:50:11:(34786) |CCL_INFO| finalized level-zero
[0] 2024:01:10-21:50:11:(34779) |CCL_INFO| finalized level-zero
[2] 2024:01:10-21:50:11:(34781) |CCL_INFO| finalizing level-zero
[6] 2024:01:10-21:50:11:(34785) |CCL_INFO| finalized level-zero
[4] 2024:01:10-21:50:11:(34783) |CCL_INFO| finalized level-zero
[5] 2024:01:10-21:50:11:(34784) |CCL_INFO| finalized level-zero
[2] 2024:01:10-21:50:11:(34781) |CCL_INFO| finalized level-zero
[1] 2024:01:10-21:50:11:(34780) |CCL_INFO| finalizing level-zero
[1] 2024:01:10-21:50:11:(34780) |CCL_INFO| finalized level-zero
```

Final results of the inference run can be found in `results.yaml` file.
```
results:
- key: throughput
value: 1693411.31
unit: samples/s
- key: accuracy
value: 0.815
unit: AUROC
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
#!/bin/bash
# Copyright (c) 2023 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
set -x

function Parser() {
while [ $# -ne 0 ]; do
case $1 in
-b)
shift
export GLOBAL_BATCH_SIZE="$1"
;;
-fp16)
shift
FP16="$1"
;;
-d)
shift
DATA="$1"
;;
-m)
shift
WEIGHT="$1"
;;
-nd)
shift
ND="$1"
;;
-sp)
shift
SP="$1"
;;
-tf32)
shift
TF32="$1"
;;
-tv)
shift
TV="$1"
;;
-h | --help)
echo "Usage: cmd_infer.sh [OPTION...] PAGE..."
echo "-b, Optional Specify the batch size. The default value is 32768"
echo "-fp16, Optional Specify the input dtype is fp16. The default value is true"
echo "-d, Optional Specify the data file"
echo "-m, Optional Specify the weight file"
echo "-nd, Optional Specify the number of node"
echo "-sp, Optional Specify the sharding plan of embedding"
echo "-tf32, Optional Specify the input dtype is tf32. The default value is false"
echo "-tv, Optional Training with val. The default value is false"
exit
;;
--*|-*)
echo ">>> New param: <$1>"
;;
*)
echo ">>> Parsing mismatch: $1"
;;
esac
shift
done
}

torch_ccl_path=$(python -c "import torch; import oneccl_bindings_for_pytorch; import os; print(os.path.abspath(os.path.dirname(oneccl_bindings_for_pytorch.__file__)))")
source $torch_ccl_path/env/setvars.sh
export MASTER_ADDR='127.0.0.1'
#export WORLD_SIZE=2 ;
export MASTER_PORT='10088'
export TOTAL_TRAINING_SAMPLES=4195197692;
export GLOBAL_BATCH_SIZE=65536;

ND=1
SP="round_robin"
#export CCL_LOG_LEVEL=DEBUG;
#export CCL_OP_SYNC=1

DATA=${DATA-'/home/sdp/xw/dlrm-v2/'}
WEIGHT=${WEIGHT-'/home/sdp/xw/model_weights'}

${FP16:=true}
${TF32:=false}
${TV:=false}
Parser $@
ARGS+=" --embedding_dim 128"
ARGS+=" --dense_arch_layer_sizes 512,256,128"
ARGS+=" --over_arch_layer_sizes 1024,1024,512,256,1"
ARGS+=" --num_embeddings_per_feature 40000000,39060,17295,7424,20265,3,7122,1543,63,40000000,3067956,405282,10,2209,11938,155,4,976,14,40000000,40000000,40000000,590152,12973,108,36"
ARGS+=" --validation_freq_within_epoch $((TOTAL_TRAINING_SAMPLES / (GLOBAL_BATCH_SIZE * 20 * 1000)))"
ARGS+=" --synthetic_multi_hot_criteo_path $DATA"
ARGS+=" --multi_hot_sizes 3,2,1,2,6,1,1,1,1,7,3,8,1,6,9,5,1,1,1,12,100,27,10,3,1,1"
#ARGS+=" --multi_hot_distribution_type uniform"
ARGS+=" --use_xpu"
ARGS+=" --epochs 1"
ARGS+=" --pin_memory"
ARGS+=" --mmap_mode"
ARGS+=" --batch_size $GLOBAL_BATCH_SIZE"
ARGS+=" --interaction_type=dcn"
ARGS+=" --dcn_num_layers=3"
ARGS+=" --adagrad"
ARGS+=" --dcn_low_rank_dim=512"
ARGS+=" --numpy_rand_seed=12345"
ARGS+=" --log_freq 10"
ARGS+=" --amp"
ARGS+=" --inference_only"
ARGS+=" --snapshot_dir ${WEIGHT}"
ARGS+=" --limit_test_batches 50"
ARGS+=" --sharding_plan ${SP}"
ARGS+=" --num_nodes ${ND}"
ARGS+=" --learning_rate 0.005"

[ "$TV" = true ] && ARGS+=" --train_with_val"
if [ "$TF32" = false ]; then
[ "$FP16" = true ] && ARGS+=" --fp16"
echo "${ARGS}"
mpirun -np 8 -ppn 8 --prepend-rank python -u dlrm_main.py ${ARGS}
else
echo "${ARGS}"
IPEX_FP32_MATH_MODE=1 mpirun -np 8 -ppn 8 --prepend-rank python -u dlrm_main.py ${ARGS}
fi
Loading

0 comments on commit c598c81

Please sign in to comment.