re-merge from NVIDIA main #68

RaymondLi0 · 2023-06-27T18:23:21Z

Among other things, fixes a backward compatibility issue of the checkpoint merging tools introduced by the previous merge.

Switches the cache to using md5 hashes of a text description instead of crafted filenames to determine a "cache hit". Changes the default location of these files to be an "index-cache" directory inside the data root. Should leave the data directories a bit cleaner, especially with these filenames being a bit "uglier". For GPT the code will first look in this default location before building a new index and caching it the specified data cache path (or this default if none is given). For Blendable dataset it will only look for and save the indices if a data cache path is provided, otherwise it will just rebuild every time.

Add option to overlap p2p communication. See merge request ADLR/megatron-lm!621

Add option to specify a data cache path separate from data directory. See merge request ADLR/megatron-lm!608

Fix GPTDataset assert. See merge request ADLR/megatron-lm!624

Fixed rotary_pos_emb's position in layer's forward args. See merge request ADLR/megatron-lm!625

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

Fix indexation for output tensor after gradscaler call See merge request ADLR/megatron-lm!627

Perform grad sync at correct place in interleaved pipeline parallelism See merge request ADLR/megatron-lm!628

Supporting loading checkpoints without add_position_embedding arg. See merge request ADLR/megatron-lm!623

Add workarounds for non-determinism in Megatron training See merge request ADLR/megatron-lm!607

Update gitlab to catch pytest errors See merge request ADLR/megatron-lm!635

Remove use of deprecated np.float in indexed_dataset.py See merge request ADLR/megatron-lm!634

Retro fix for tensor parallelism. See merge request ADLR/megatron-lm!632

jaredcasper and others added 28 commits May 19, 2023 14:05

Check for write failure of index cache and print error message.

ae37924

Merge branch 'main' into data-cache

41ba8a2

Add option to overlap p2p communication.

f9283c5

typo

621c9de

Consistent arg names.

2c13d1f

Merge branch 'p2p_overlap' into 'main'

e6d7e09

Add option to overlap p2p communication. See merge request ADLR/megatron-lm!621

Merge branch 'data-cache' into 'main'

e4adfbc

Add option to specify a data cache path separate from data directory. See merge request ADLR/megatron-lm!608

Supporting loading checkpoints without add_position_embedding arg.

4d564cf

Fix GPTDataset assert.

1997e94

Merge branch 'gptdataset-assert' into 'main'

f965380

Fix GPTDataset assert. See merge request ADLR/megatron-lm!624

Fixed rotary_pos_emb's position in layer's forward args.

a6c574d

Merge branch 'lmcafee/rotary-kwarg-dev' into 'main'

382fd9d

Fixed rotary_pos_emb's position in layer's forward args. See merge request ADLR/megatron-lm!625

fix indexation for output tensor after gradscaler call

41221b8

Signed-off-by: Abhinav Khattar <aklife97@gmail.com>

Merge branch 'outputtensor_index' into 'main'

d2891b4

Fix indexation for output tensor after gradscaler call See merge request ADLR/megatron-lm!627

Perform grad sync at correct place in interleaved pipeline parallelism

ea76ecd

Merge branch 'interleaved-pipeline-bugfix' into 'main'

992da75

Perform grad sync at correct place in interleaved pipeline parallelism See merge request ADLR/megatron-lm!628

Merge branch 'ckpt-load-fix' into 'main'

f6c6d86

Supporting loading checkpoints without add_position_embedding arg. See merge request ADLR/megatron-lm!623

Add workarounds for non-determinism in Megatron training

2880267

Merge branch 'jbarker/non_determinism_fix' into 'main'

db71a33

Add workarounds for non-determinism in Megatron training See merge request ADLR/megatron-lm!607

Update gitlab to catch pytest errors

1af380d

Merge branch 'pytestError' into 'main'

c7a0145

Update gitlab to catch pytest errors See merge request ADLR/megatron-lm!635

Remove use of deprecated np.float in indexed_dataset.py

bf5206e

Merge branch 'jbarker/np_float64_deprecation' into 'main'

000590e

Remove use of deprecated np.float in indexed_dataset.py See merge request ADLR/megatron-lm!634

Retro fix for tensor parallelism.

f479999

Merge branch 'lmcafee/retro-dataloader-fix' into 'main'

0604155

Retro fix for tensor parallelism. See merge request ADLR/megatron-lm!632

Merge branch 'main' of github.com:NVIDIA/Megatron-LM into NVIDIA-main2

e5a9d7b

pass on data_cache_pass in build_dataset_group

8b6ceeb

RaymondLi0 requested a review from jlamypoirier June 28, 2023 13:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

re-merge from NVIDIA main #68

re-merge from NVIDIA main #68

Uh oh!

RaymondLi0 commented Jun 27, 2023 •

edited

Loading

Uh oh!

Uh oh!

re-merge from NVIDIA main #68

Are you sure you want to change the base?

re-merge from NVIDIA main #68

Uh oh!

Conversation

RaymondLi0 commented Jun 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

RaymondLi0 commented Jun 27, 2023 •

edited

Loading