merge from Nvidia main #58

RaymondLi0 · 2023-06-05T15:52:00Z

No description provided.

'

different encoder/decoder num-layers support See merge request ADLR/megatron-lm!453

Validation dataset update 1 See merge request ADLR/megatron-lm!455

Adding proper test cases See merge request ADLR/megatron-lm!460

Core merge main See merge request ADLR/megatron-lm!464

… variable instead.

Remove noop used to try to force scheduling and check for environment variable instead. See merge request ADLR/megatron-lm!463

inverse_square_root learning param schedule See merge request ADLR/megatron-lm!466

Disable newline after colon See merge request ADLR/megatron-lm!469

…eck to make sure it's a list

Sending in prompts with the wrong type hangs the server. This is a check to make sure it's a list See merge request ADLR/megatron-lm!473

Move most of mpu functionality into a new "Megatron core" See merge request ADLR/megatron-lm!462

Fix merge error. See merge request ADLR/megatron-lm!478

ViT Backbone Tensor Shape Fix See merge request ADLR/megatron-lm!479

Support for variable sequence lengths across micro-batches See merge request ADLR/megatron-lm!472

Signed-off-by: Kirthi Shankar Sivamani <smkirthishankar@gmail.com>

Check if RoPE support is available in TE See merge request ADLR/megatron-lm!614

Signed-off-by: Kirthi Shankar Sivamani <smkirthishankar@gmail.com>

Bug fixes for full activation recompute using TransformerEngine See merge request ADLR/megatron-lm!615

Signed-off-by: Kirthi Shankar Sivamani <smkirthishankar@gmail.com>

Check TE version for rope during recompute See merge request ADLR/megatron-lm!619

mayank31398 · 2023-06-16T07:17:12Z

Hey @RaymondLi0 , thanks a lot for this effort.
Any ETA on this?

RaymondLi0 · 2023-06-16T14:24:45Z

Hopefully early next week :)

RaymondLi0 · 2023-06-19T17:58:26Z

Some things are broken (I created a branch named before-merge, in case someone wants to do one of these):

loading previous checkpoints that used the distributed optimizer will not work
the checkpoint merging tools will not work on previous checkpoints (may be fixed later with another merge from NVIDIA's repo)

jlamypoirier

Let's just merge and hope for the best 🤷‍♂️

shanmugamr1992 and others added 30 commits October 6, 2022 17:02

Adding some basic unit tests

4ec95a2

Changes'

11392f0

'

Changes'

94dd94e

'

Code covearage

2fd9ea1

Code covearage

c0329d8

Code covearage

f861467

removed assert for the case of evaluation only without training

45cd4e0

address review comments

69f3249

Merge branch 'nmt-main' into 'main'

a95fda7

different encoder/decoder num-layers support See merge request ADLR/megatron-lm!453

Merge branch 'anmolg/validation_1' into 'main'

c7d57ff

Validation dataset update 1 See merge request ADLR/megatron-lm!455

Adding proper test cases

8b94a16

Merge branch 'properTest' into 'core'

8806ba7

Adding proper test cases See merge request ADLR/megatron-lm!460

Merge branch 'main' into core

2a86fa2

Merge branch 'core-merge-main' into 'core'

5da3bb9

Core merge main See merge request ADLR/megatron-lm!464

inverse_square_root learning param schedule

dbed5e0

Remove noop used to try to force scheduling and check for environment…

bdd9731

… variable instead.

Merge branch 'core-noop' into 'core'

d3a416c

Remove noop used to try to force scheduling and check for environment variable instead. See merge request ADLR/megatron-lm!463

Merge branch 'nmt-main' into 'main'

abf60f7

inverse_square_root learning param schedule See merge request ADLR/megatron-lm!466

Disable newline after colon

544e250

Merge branch 'disable_newline_after_colon' into 'main'

f4a8b1d

Disable newline after colon See merge request ADLR/megatron-lm!469

Sending in prompts with the wrong type hangs the server. This is a ch…

2fdd54e

…eck to make sure it's a list

Merge branch 'check_prompts_is_list' into 'main'

fdc801e

Sending in prompts with the wrong type hangs the server. This is a check to make sure it's a list See merge request ADLR/megatron-lm!473

Merge branch 'core' into 'main'

42c4071

Move most of mpu functionality into a new "Megatron core" See merge request ADLR/megatron-lm!462

Fix merge error.

e0a12fe

Merge branch 'core-fix' into 'main'

1a26b29

Fix merge error. See merge request ADLR/megatron-lm!478

ViT Backbone Tensor Shape Fix

fabd3e4

Merge branch 'yuya/vit_fix' into 'main'

b4297c6

ViT Backbone Tensor Shape Fix See merge request ADLR/megatron-lm!479

Support for variable sequence lengths across micro-batches

c3e688d

Merge branch 'nmt-main' into 'main'

1ad1e1b

Support for variable sequence lengths across micro-batches See merge request ADLR/megatron-lm!472

Data Preprocessing Optimizations

7fc9611

ksivaman and others added 18 commits May 27, 2023 00:17

check if RoPE support is available in TE

c7f0670

Signed-off-by: Kirthi Shankar Sivamani <smkirthishankar@gmail.com>

Add arg to TE forward only when possible

5a674d3

Signed-off-by: Kirthi Shankar Sivamani <smkirthishankar@gmail.com>

delete packages after use

1796fda

Signed-off-by: Kirthi Shankar Sivamani <smkirthishankar@gmail.com>

Merge branch 'check_te_version_for_rope' into 'main'

c3cd553

Check if RoPE support is available in TE See merge request ADLR/megatron-lm!614

Use same custom forward for TE

6629e1a

Signed-off-by: Kirthi Shankar Sivamani <smkirthishankar@gmail.com>

Fix TE kwargs for uniform and block recompute methods

0db1aa3

Signed-off-by: Kirthi Shankar Sivamani <smkirthishankar@gmail.com>

remove unused parameter

701bd12

Signed-off-by: Kirthi Shankar Sivamani <smkirthishankar@gmail.com>

Merge branch 'fix_te_full_activation_recompute' into 'main'

08ef472

Bug fixes for full activation recompute using TransformerEngine See merge request ADLR/megatron-lm!615

Check TE version for rope during recompute

8b82480

Signed-off-by: Kirthi Shankar Sivamani <smkirthishankar@gmail.com>

Merge branch 'te_check_rope_for_full_activation_recompute' into 'main'

37563bc

Check TE version for rope during recompute See merge request ADLR/megatron-lm!619

Merge branch 'main' of github.com:NVIDIA/Megatron-LM into NVIDIA-main

648f32f

Merge branch 'multi-query-attention' into NVIDIA-main

8354f89

remove unused kernels

972f301

try with LayerNorm import from megatron.model

203b071

fix the merge

48c8046

move setting of TORCH_CUDA_ARCH_LIST

ac497ce

fix call to blendable dataset

04031a8

fix blended dataset size in dataset groups

17217f8

RaymondLi0 added 2 commits June 19, 2023 09:45

find_checkpoint_rank_0 returns a single value

0229a69

fix checkpoint merge tools

37353b1

RaymondLi0 changed the title ~~WIP: merge from Nvidia main~~ merge from Nvidia main Jun 19, 2023

RaymondLi0 requested review from jlamypoirier and loubnabnl June 19, 2023 17:52

remove --finetune-from argument to make checkpoint loading logic simpler

3dbd929

jlamypoirier approved these changes Jun 19, 2023

View reviewed changes

RaymondLi0 merged commit 3e22c9f into multi-query-attention Jun 19, 2023

RaymondLi0 deleted the NVIDIA-main branch June 19, 2023 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

merge from Nvidia main #58

merge from Nvidia main #58

Uh oh!

RaymondLi0 commented Jun 5, 2023

Uh oh!

mayank31398 commented Jun 16, 2023

Uh oh!

RaymondLi0 commented Jun 16, 2023

Uh oh!

RaymondLi0 commented Jun 19, 2023

Uh oh!

jlamypoirier left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

26 participants

merge from Nvidia main #58

merge from Nvidia main #58

Uh oh!

Conversation

RaymondLi0 commented Jun 5, 2023

Uh oh!

mayank31398 commented Jun 16, 2023

Uh oh!

RaymondLi0 commented Jun 16, 2023

Uh oh!

RaymondLi0 commented Jun 19, 2023

Uh oh!

jlamypoirier left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

26 participants