Skip to content

Commit 1def26d

Browse files
committed
[NCF/PyT] Adding new logging
1 parent cef4bab commit 1def26d

File tree

13 files changed

+75
-982
lines changed

13 files changed

+75
-982
lines changed

PyTorch/Recommendation/NCF/.gitmodules

Whitespace-only changes.

PyTorch/Recommendation/NCF/README.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -214,7 +214,7 @@ After the Docker container is launched, the training with the default hyperparam
214214

215215
```bash
216216
./prepare_dataset.sh
217-
python -m torch.distributed.launch --nproc_per_node=8 ncf.py --data /data/cache/ml-20m
217+
python -m torch.distributed.launch --nproc_per_node=8 --use_env ncf.py --data /data/cache/ml-20m
218218
```
219219

220220
This will result in a checkpoint file being written to `/data/checkpoints/model.pth`.
@@ -225,7 +225,7 @@ This will result in a checkpoint file being written to `/data/checkpoints/model.
225225
The trained model can be evaluated by passing the `--mode` test flag to the `run.sh` script:
226226

227227
```bash
228-
python -m torch.distributed.launch --nproc_per_node=1 ncf.py --data /data/cache/ml-20m --mode test --load_checkpoint_path /data/checkpoints/model.pth
228+
python -m torch.distributed.launch --nproc_per_node=1 --use_env ncf.py --data /data/cache/ml-20m --mode test --load_checkpoint_path /data/checkpoints/model.pth
229229
```
230230

231231

@@ -330,13 +330,13 @@ For a smaller dataset you might experience slower performance.
330330
To download, preprocess and train on the ML-1m dataset run:
331331
```bash
332332
./prepare_dataset.sh ml-1m
333-
python -m torch.distributed.launch --nproc_per_node=8 ncf.py --data /data/cache/ml-1m
333+
python -m torch.distributed.launch --nproc_per_node=8 --use_env ncf.py --data /data/cache/ml-1m
334334
```
335335

336336
### Training process
337337
The name of the training script is `ncf.py`. Because of the multi-GPU support, it should always be run with the torch distributed launcher like this:
338338
```bash
339-
python -m torch.distributed.launch --nproc_per_node=<number_of_gpus> ncf.py --data <path_to_dataset> [other_parameters]
339+
python -m torch.distributed.launch --nproc_per_node=<number_of_gpus> --use_env ncf.py --data <path_to_dataset> [other_parameters]
340340
```
341341

342342
The main result of the training are checkpoints stored by default in `/data/checkpoints/`. This location can be controlled
@@ -351,7 +351,7 @@ The HR@10 metric is the number of hits in the entire test set divided by the num
351351

352352
Inference can be launched with the same script used for training by passing the `--mode test` flag:
353353
```bash
354-
python -m torch.distributed.launch --nproc_per_node=<number_of_gpus> ncf.py --data <path_to_dataset> --mode test [other_parameters]
354+
python -m torch.distributed.launch --nproc_per_node=<number_of_gpus> --use_env ncf.py --data <path_to_dataset> --mode test [other_parameters]
355355
```
356356

357357
The script will then:
@@ -368,7 +368,7 @@ The script will then:
368368
NCF training on NVIDIA DGX systems is very fast, therefore, in order to measure train and validation throughput, you can simply run the full training job with:
369369
```bash
370370
./prepare_dataset.sh
371-
python -m torch.distributed.launch --nproc_per_node=8 ncf.py --data /data/cache/ml-20m --epochs 5
371+
python -m torch.distributed.launch --nproc_per_node=8 --use_env ncf.py --data /data/cache/ml-20m --epochs 5
372372
```
373373

374374
At the end of the script, a line reporting the best train throughput is printed.
@@ -379,7 +379,7 @@ At the end of the script, a line reporting the best train throughput is printed.
379379
Validation throughput can be measured by running the full training job with:
380380
```bash
381381
./prepare_dataset.sh
382-
python -m torch.distributed.launch --nproc_per_node=8 ncf.py --data /data/cache/ml-20m --epochs 5
382+
python -m torch.distributed.launch --nproc_per_node=8 --use_env ncf.py --data /data/cache/ml-20m --epochs 5
383383
```
384384

385385
The best validation throughput is reported to the standard output.
@@ -405,7 +405,7 @@ The training time was measured excluding data downloading, preprocessing, valida
405405
To reproduce this result, start the NCF Docker container interactively and run:
406406
```bash
407407
./prepare_dataset.sh
408-
python -m torch.distributed.launch --nproc_per_node=8 ncf.py --data /data/cache/ml-20m
408+
python -m torch.distributed.launch --nproc_per_node=8 --use_env ncf.py --data /data/cache/ml-20m
409409
```
410410

411411
##### NVIDIA DGX-1 (8x V100 32G)
@@ -428,7 +428,7 @@ Here's an example validation accuracy curve for mixed precision vs single precis
428428
To reproduce this result, start the NCF Docker container interactively and run:
429429
```bash
430430
./prepare_dataset.sh
431-
python -m torch.distributed.launch --nproc_per_node=8 ncf.py --data /data/cache/ml-20m
431+
python -m torch.distributed.launch --nproc_per_node=8 --use_env ncf.py --data /data/cache/ml-20m
432432
```
433433

434434
##### NVIDIA DGX-2 (16x V100 32G)
@@ -449,7 +449,7 @@ The training time was measured excluding data downloading, preprocessing, valida
449449
To reproduce this result, start the NCF Docker container interactively and run:
450450
```bash
451451
./prepare_dataset.sh
452-
python -m torch.distributed.launch --nproc_per_node=16 ncf.py --data /data/cache/ml-20m
452+
python -m torch.distributed.launch --nproc_per_node=16 --use_env ncf.py --data /data/cache/ml-20m
453453
```
454454

455455

@@ -555,7 +555,8 @@ The following table shows the best inference throughput:
555555
4. September, 2019
556556
* Adjusting for API changes in PyTorch and APEX
557557
* Checkpoints loading fix
558-
558+
5. January, 2020
559+
* DLLogger support added
559560

560561
### Known issues
561562

PyTorch/Recommendation/NCF/convert.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -34,15 +34,10 @@
3434
import torch
3535
import tqdm
3636

37-
from logger.logger import LOGGER
38-
from logger import tags
39-
4037
MIN_RATINGS = 20
4138
USER_COLUMN = 'user_id'
4239
ITEM_COLUMN = 'item_id'
4340

44-
LOGGER.model = 'ncf'
45-
4641
def parse_args():
4742
parser = ArgumentParser()
4843
parser.add_argument('--path', type=str, default='/data/ml-20m/ratings.csv',
@@ -98,7 +93,6 @@ def main():
9893

9994
print("Filtering out users with less than {} ratings".format(MIN_RATINGS))
10095
grouped = df.groupby(USER_COLUMN)
101-
LOGGER.log(key=tags.PREPROC_HP_MIN_RATINGS, value=MIN_RATINGS)
10296
df = grouped.filter(lambda x: len(x) >= MIN_RATINGS)
10397

10498
print("Mapping original user and item IDs to new sequential IDs")

PyTorch/Recommendation/NCF/inference.py

Lines changed: 17 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -17,17 +17,14 @@
1717
import torch.jit
1818
import time
1919
from argparse import ArgumentParser
20-
20+
import numpy as np
2121
import torch
2222

2323
from neumf import NeuMF
2424

25-
from logger.logger import LOGGER, timed_block, timed_function
26-
from logger.autologging import log_hardware, log_args
27-
2825
from apex import amp
2926

30-
LOGGER.model = 'ncf'
27+
import dllogger
3128

3229

3330
def parse_args():
@@ -51,14 +48,19 @@ def parse_args():
5148
parser.add_argument('--opt_level', default='O2', type=str,
5249
help='Optimization level for Automatic Mixed Precision',
5350
choices=['O0', 'O2'])
51+
parser.add_argument('--log_path', default='log.json', type=str,
52+
help='Path for the JSON training log')
5453

5554
return parser.parse_args()
5655

5756

5857
def main():
59-
log_hardware()
6058
args = parse_args()
61-
log_args(args)
59+
dllogger.init(backends=[dllogger.JSONStreamBackend(verbosity=dllogger.Verbosity.VERBOSE,
60+
filename=args.log_path),
61+
dllogger.StdOutBackend(verbosity=dllogger.Verbosity.VERBOSE)])
62+
63+
dllogger.log(data=vars(args), step='PARAMETER')
6264

6365
model = NeuMF(nb_users=args.n_users, nb_items=args.n_items, mf_dim=args.factors,
6466
mlp_layer_sizes=args.layers, dropout=args.dropout)
@@ -85,10 +87,14 @@ def main():
8587
torch.cuda.synchronize()
8688
latencies.append(time.time() - start)
8789

88-
LOGGER.log(key='batch_size', value=args.batch_size)
89-
LOGGER.log(key='best_inference_throughput', value=args.batch_size / min(latencies))
90-
LOGGER.log(key='best_inference_latency', value=min(latencies))
91-
LOGGER.log(key='inference_latencies', value=latencies)
90+
dllogger.log(data={'batch_size': args.batch_size,
91+
'best_inference_throughput': args.batch_size / min(latencies),
92+
'best_inference_latency': min(latencies),
93+
'mean_inference_throughput': args.batch_size / np.mean(latencies),
94+
'mean_inference_latency': np.mean(latencies),
95+
'inference_latencies': latencies},
96+
step=tuple())
97+
dllogger.flush()
9298
return
9399

94100

PyTorch/Recommendation/NCF/logger/analyzer.py

Lines changed: 0 additions & 125 deletions
This file was deleted.

PyTorch/Recommendation/NCF/logger/autologging.py

Lines changed: 0 additions & 61 deletions
This file was deleted.

0 commit comments

Comments
 (0)