-
Notifications
You must be signed in to change notification settings - Fork 220
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* add dlrm v2 * fix license issue * Update dlrm_dataloader.py * Update dist_models.py * Update dlrm_dataloader.py * Update dist_models.py * refactored to a new folder * update Readme --------- Co-authored-by: Mahathi Vatsal <mahathi.vatsal.salopanthula@intel.com>
- Loading branch information
1 parent
d74a8c1
commit c598c81
Showing
23 changed files
with
4,043 additions
and
52 deletions.
There are no files selected for viewing
101 changes: 101 additions & 0 deletions
101
models_v2/pytorch/torchrec_dlrm/inference/gpu/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
# DLRM v2 Inference | ||
|
||
DLRM v2 Inference best known configurations with Intel® Extension for PyTorch. | ||
|
||
## Model Information | ||
|
||
| **Use Case** | **Framework** | **Model Repo** | **Branch/Commit/Tag** | **Optional Patch** | | ||
|:---:| :---: |:--------------:|:---------------------:|:------------------:| | ||
| Inference | PyTorch | https://github.com/facebookresearch/dlrm/tree/main/torchrec_dlrm | - | - | | ||
|
||
# Pre-Requisite | ||
* Host has 4 Intel® Data Center GPU Max and two tiles for each. | ||
* Host has installed latest Intel® Data Center GPU Max Series Drivers https://dgpu-docs.intel.com/driver/installation.html | ||
* Host has installed [Intel® Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/xpu/latest/) | ||
|
||
# prepare Dataset | ||
After downloading and uncompressing the [Criteo 1TB Click Logs dataset](consisting of 24 files from day 0 to day 23), process the raw tsv files into the proper format for training by running ./scripts/process_Criteo_1TB_Click_Logs_dataset.sh with necessary command line arguments. | ||
|
||
Example usage: | ||
|
||
bash ./scripts/process_Criteo_1TB_Click_Logs_dataset.sh \ | ||
./criteo_1tb/raw_input_dataset_dir \ | ||
./criteo_1tb/temp_intermediate_files_dir \ | ||
./criteo_1tb/numpy_contiguous_shuffled_output_dataset_dir | ||
The script requires 700GB of RAM and takes 1-2 days to run. We currently have features in development to reduce the preproccessing time and memory overhead. MD5 checksums of the expected final preprocessed dataset files are in md5sums_preprocessed_criteo_click_logs_dataset.txt. | ||
|
||
the final dataset dir will be like below: | ||
dataset_dir | ||
|_day_0_dense.npy | ||
|_day_0_labels.npy | ||
|_day_0_sparse_multi_hot.npz | ||
|
||
this folder will be used as the parameter DATASET_DIR later | ||
|
||
wget https://cloud.mlcommons.org/index.php/s/XzfSeLgW8FYfR3S/download -O weigths.zip | ||
unzip weights.zip | ||
and the folder will be used as the parameter WEIGHT_DIR later | ||
|
||
|
||
## Inference | ||
1. `git clone https://github.com/IntelAI/models.git` | ||
2. `cd models/models_v2/pytorch/torchrec_dlrm/inference/gpu` | ||
3. Run `setup.sh` this will install all the required dependencies & create virtual environment `venv`. | ||
4. Activate virtual env: `. ./venv/bin/activate` | ||
5. Setup required environment paramaters | ||
|
||
| **Parameter** | **export command** | | ||
|:---------------------------:|:------------------------------------------------------------------------------------:| | ||
| **MULTI_TILE** | `export MULTI_TILE=True` (True or False) | | ||
| **PLATFORM** | `export PLATFORM=PVC` (PVC) | | ||
| **WEIGHT_DIR** | `export WEIGHT_DIR=` | | ||
| **DATASET_DIR** | `export DATASET_DIR=` | | ||
| **BATCH_SIZE** (optional) | `export BATCH_SIZE=32768` | | ||
| **PRECISION** (optional) | `export PRECISION=FP16` (FP16 and FP32 are supported for PVC) | | ||
| **OUTPUT_DIR** (optional) | `export OUTPUT_DIR=$PWD` | | ||
6. Run `run_model.sh` | ||
|
||
## Output | ||
|
||
Multi-tile output will typically looks like: | ||
``` | ||
[0] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03749502139621311 s | ||
[6] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03693882624308268 s | ||
[1] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03849502139621311 s | ||
[3] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03693882624308268 s | ||
[7] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03858977953592936 s | ||
[2] 2024-01-10 21:50:10,779 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.03870058589511448 s | ||
[4] 2024-01-10 21:50:10,780 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.022177388932969836 s | ||
[5] 2024-01-10 21:50:10,780 - __main__ - INFO - avg eval time per iter at ITER: 45, 0.037547969818115236 s | ||
[0] AUROC over test set: 0.8147445321083069. | ||
[0] Number of test samples: 3276800 | ||
[0] 2024:01:10-21:50:11:(34779) |CCL_INFO| finalize atl-mpi | ||
[0] 2024:01:10-21:50:11:(34779) |CCL_INFO| finalized atl-mpi | ||
[3] 2024:01:10-21:50:11:(34782) |CCL_INFO| finalizing level-zero | ||
[7] 2024:01:10-21:50:11:(34786) |CCL_INFO| finalizing level-zero | ||
[0] 2024:01:10-21:50:11:(34779) |CCL_INFO| finalizing level-zero | ||
[6] 2024:01:10-21:50:11:(34785) |CCL_INFO| finalizing level-zero | ||
[4] 2024:01:10-21:50:11:(34783) |CCL_INFO| finalizing level-zero | ||
[3] 2024:01:10-21:50:11:(34782) |CCL_INFO| finalized level-zero | ||
[5] 2024:01:10-21:50:11:(34784) |CCL_INFO| finalizing level-zero | ||
[7] 2024:01:10-21:50:11:(34786) |CCL_INFO| finalized level-zero | ||
[0] 2024:01:10-21:50:11:(34779) |CCL_INFO| finalized level-zero | ||
[2] 2024:01:10-21:50:11:(34781) |CCL_INFO| finalizing level-zero | ||
[6] 2024:01:10-21:50:11:(34785) |CCL_INFO| finalized level-zero | ||
[4] 2024:01:10-21:50:11:(34783) |CCL_INFO| finalized level-zero | ||
[5] 2024:01:10-21:50:11:(34784) |CCL_INFO| finalized level-zero | ||
[2] 2024:01:10-21:50:11:(34781) |CCL_INFO| finalized level-zero | ||
[1] 2024:01:10-21:50:11:(34780) |CCL_INFO| finalizing level-zero | ||
[1] 2024:01:10-21:50:11:(34780) |CCL_INFO| finalized level-zero | ||
``` | ||
|
||
Final results of the inference run can be found in `results.yaml` file. | ||
``` | ||
results: | ||
- key: throughput | ||
value: 1693411.31 | ||
unit: samples/s | ||
- key: accuracy | ||
value: 0.815 | ||
unit: AUROC | ||
``` |
131 changes: 131 additions & 0 deletions
131
models_v2/pytorch/torchrec_dlrm/inference/gpu/cmd_distributed_terabyte_test.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
#!/bin/bash | ||
# Copyright (c) 2023 Intel Corporation | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
set -x | ||
|
||
function Parser() { | ||
while [ $# -ne 0 ]; do | ||
case $1 in | ||
-b) | ||
shift | ||
export GLOBAL_BATCH_SIZE="$1" | ||
;; | ||
-fp16) | ||
shift | ||
FP16="$1" | ||
;; | ||
-d) | ||
shift | ||
DATA="$1" | ||
;; | ||
-m) | ||
shift | ||
WEIGHT="$1" | ||
;; | ||
-nd) | ||
shift | ||
ND="$1" | ||
;; | ||
-sp) | ||
shift | ||
SP="$1" | ||
;; | ||
-tf32) | ||
shift | ||
TF32="$1" | ||
;; | ||
-tv) | ||
shift | ||
TV="$1" | ||
;; | ||
-h | --help) | ||
echo "Usage: cmd_infer.sh [OPTION...] PAGE..." | ||
echo "-b, Optional Specify the batch size. The default value is 32768" | ||
echo "-fp16, Optional Specify the input dtype is fp16. The default value is true" | ||
echo "-d, Optional Specify the data file" | ||
echo "-m, Optional Specify the weight file" | ||
echo "-nd, Optional Specify the number of node" | ||
echo "-sp, Optional Specify the sharding plan of embedding" | ||
echo "-tf32, Optional Specify the input dtype is tf32. The default value is false" | ||
echo "-tv, Optional Training with val. The default value is false" | ||
exit | ||
;; | ||
--*|-*) | ||
echo ">>> New param: <$1>" | ||
;; | ||
*) | ||
echo ">>> Parsing mismatch: $1" | ||
;; | ||
esac | ||
shift | ||
done | ||
} | ||
|
||
torch_ccl_path=$(python -c "import torch; import oneccl_bindings_for_pytorch; import os; print(os.path.abspath(os.path.dirname(oneccl_bindings_for_pytorch.__file__)))") | ||
source $torch_ccl_path/env/setvars.sh | ||
export MASTER_ADDR='127.0.0.1' | ||
#export WORLD_SIZE=2 ; | ||
export MASTER_PORT='10088' | ||
export TOTAL_TRAINING_SAMPLES=4195197692; | ||
export GLOBAL_BATCH_SIZE=65536; | ||
|
||
ND=1 | ||
SP="round_robin" | ||
#export CCL_LOG_LEVEL=DEBUG; | ||
#export CCL_OP_SYNC=1 | ||
|
||
DATA=${DATA-'/home/sdp/xw/dlrm-v2/'} | ||
WEIGHT=${WEIGHT-'/home/sdp/xw/model_weights'} | ||
|
||
${FP16:=true} | ||
${TF32:=false} | ||
${TV:=false} | ||
Parser $@ | ||
ARGS+=" --embedding_dim 128" | ||
ARGS+=" --dense_arch_layer_sizes 512,256,128" | ||
ARGS+=" --over_arch_layer_sizes 1024,1024,512,256,1" | ||
ARGS+=" --num_embeddings_per_feature 40000000,39060,17295,7424,20265,3,7122,1543,63,40000000,3067956,405282,10,2209,11938,155,4,976,14,40000000,40000000,40000000,590152,12973,108,36" | ||
ARGS+=" --validation_freq_within_epoch $((TOTAL_TRAINING_SAMPLES / (GLOBAL_BATCH_SIZE * 20 * 1000)))" | ||
ARGS+=" --synthetic_multi_hot_criteo_path $DATA" | ||
ARGS+=" --multi_hot_sizes 3,2,1,2,6,1,1,1,1,7,3,8,1,6,9,5,1,1,1,12,100,27,10,3,1,1" | ||
#ARGS+=" --multi_hot_distribution_type uniform" | ||
ARGS+=" --use_xpu" | ||
ARGS+=" --epochs 1" | ||
ARGS+=" --pin_memory" | ||
ARGS+=" --mmap_mode" | ||
ARGS+=" --batch_size $GLOBAL_BATCH_SIZE" | ||
ARGS+=" --interaction_type=dcn" | ||
ARGS+=" --dcn_num_layers=3" | ||
ARGS+=" --adagrad" | ||
ARGS+=" --dcn_low_rank_dim=512" | ||
ARGS+=" --numpy_rand_seed=12345" | ||
ARGS+=" --log_freq 10" | ||
ARGS+=" --amp" | ||
ARGS+=" --inference_only" | ||
ARGS+=" --snapshot_dir ${WEIGHT}" | ||
ARGS+=" --limit_test_batches 50" | ||
ARGS+=" --sharding_plan ${SP}" | ||
ARGS+=" --num_nodes ${ND}" | ||
ARGS+=" --learning_rate 0.005" | ||
|
||
[ "$TV" = true ] && ARGS+=" --train_with_val" | ||
if [ "$TF32" = false ]; then | ||
[ "$FP16" = true ] && ARGS+=" --fp16" | ||
echo "${ARGS}" | ||
mpirun -np 8 -ppn 8 --prepend-rank python -u dlrm_main.py ${ARGS} | ||
else | ||
echo "${ARGS}" | ||
IPEX_FP32_MATH_MODE=1 mpirun -np 8 -ppn 8 --prepend-rank python -u dlrm_main.py ${ARGS} | ||
fi |
Oops, something went wrong.