Skip to content

From NVIDIA Megatron-LM for visibility #18

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4,810 commits into
base: multi-query-attention
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
4810 commits
Select commit Hold shift + click to select a range
f8c8c9c
ADLR/megatron-lm!2812 - Inference functional test: 580M Minitron
mathemakitten May 9, 2025
f25dceb
Merge branch 'helenn-inference-functional-test' into 'main'
ko3n1g May 9, 2025
16aeade
Revert "ADLR/megatron-lm!2812 - Inference functional test: 580M Minit…
chtruong814 May 10, 2025
861f574
ADLR/megatron-lm!2812 - Inference functional test: 580M Minitron
mathemakitten May 9, 2025
b6212fd
ADLR/megatron-lm!3277 - Invalidate cached SSM tensors if batch size c…
santhnm2 May 12, 2025
d68c474
Merge branch 'mamba_variable_batch_size_fix' into 'main'
shanmugamr1992 May 12, 2025
0b084c6
ADLR/megatron-lm!3291 - ci: Move unit test logic to file
ko3n1g May 12, 2025
460e961
Merge branch 'ko3n1g/ci/unit-tests-script' into 'main'
ko3n1g May 12, 2025
f8b1172
ADLR/megatron-lm!3243 - Adapt _write_item call to new signature with …
skierat May 12, 2025
a3609ee
Merge branch 'skierat/write_item_signature' into 'main'
ko3n1g May 12, 2025
d87ba91
ADLR/megatron-lm!2711 - Add in-process restart
szmigacz May 13, 2025
0bdebc0
Merge branch 'inprocess_mr' into 'main'
deepakn94 May 13, 2025
5c7ecad
ci(hotfix): Update Dockerfile.ci.dev
ko3n1g May 13, 2025
e41dde6
Revert "ADLR/megatron-lm!2711 - Add in-process restart"
ko3n1g May 13, 2025
f61b17c
ADLR/megatron-lm!3292 - ci: Run on multiple clusters
ko3n1g May 13, 2025
c552e21
Merge branch 'ko3n1g/ci/multi-cluster' into 'main'
ko3n1g May 13, 2025
55343df
ADLR/megatron-lm!3302 - ci: Allow specific TE-ref
ko3n1g May 13, 2025
d50e830
Merge branch 'ko3n1g/ci/te-nightly' into 'main'
ko3n1g May 13, 2025
8c4875f
ADLR/megatron-lm!3299 - ci(fix): Write logs to log_dir
ko3n1g May 13, 2025
d6eb60b
Merge branch 'ko3n1g/ci/unit-tests-locally' into 'main'
ko3n1g May 13, 2025
c58e57f
ADLR/megatron-lm!3253 - Address dist checkpointing PyT 24.08 failure
ananthsub May 14, 2025
4a114e6
Merge branch 'dist-ckpt-2408' into 'main'
deepakn94 May 14, 2025
d2cbe5a
ADLR/megatron-lm!3307 - ci(hotfix): Downstream pipeline
ko3n1g May 14, 2025
53d55fb
Merge branch 'ko3n1g/ci/fix-downstream-pipeline' into 'main'
ko3n1g May 14, 2025
9c586bf
ADLR/megatron-lm!3308 - MR feedback: added units for arguments, optio…
rhewett-nv May 14, 2025
8416bff
Merge branch 'inprocess_mr' into 'main'
ko3n1g May 14, 2025
07b1992
ADLR/megatron-lm!2966 - Allow process group as optional argument for …
ZhiyuLi-Nvidia May 16, 2025
175497e
Merge branch 'zhiyul/orthotope/ssm' into 'main'
ko3n1g May 16, 2025
7f9f2bf
ADLR/megatron-lm!2588 - Add NVTX ranges to categorize execution
May 16, 2025
8a9e864
Merge branch 'llama31_automated_breakdown' into 'main'
jaredcasper May 16, 2025
1ff5a37
ADLR/megatron-lm!3116 - Move fsdp 2 import from _composable to public
BoxiangW May 16, 2025
ed0d528
Merge branch 'boxiangw/public_fsdp_import' into 'main'
ko3n1g May 16, 2025
d70e2e4
ADLR/megatron-lm!3321 - ci: Add nemo-image to `ci-rebuild-mcore-nemo-…
ko3n1g May 16, 2025
054fad5
Merge branch 'ko3n1g/ci/fix-rebuild-job' into 'main'
ko3n1g May 16, 2025
e494219
ADLR/megatron-lm!3197 - ci: Re-enable tests that failed on memory
ko3n1g May 16, 2025
bfc751a
Merge branch 'ko3n1g/ci/re-enable-broken-tests' into 'main'
ko3n1g May 16, 2025
a73b4d2
tests: Disable flaky test
ko3n1g May 16, 2025
407e504
ADLR/megatron-lm!3254 - Engine updates
shanmugamr1992 May 18, 2025
7fe8f69
Merge branch 'engine_updates' into 'main'
shanmugamr1992 May 18, 2025
ee1d765
ADLR/megatron-lm!3312 - ci: Onboard mr-slim to h100
ko3n1g May 19, 2025
861a8fa
Merge branch 'ko3n1g/ci/dev-on-h100' into 'main'
ko3n1g May 19, 2025
cf03fb2
ADLR/megatron-lm!3334 - chore: Deprecate T5 tests
ko3n1g May 19, 2025
8e1c3df
Merge branch 'ko3n1g/chore/remove-t5-from-lts' into 'main'
ko3n1g May 19, 2025
6eb0bcf
ADLR/megatron-lm!3062 - Fix wrong fp8_meta info when resume training …
BestJuly May 19, 2025
dfc0a3d
Merge branch 'lit/fix_fp8_moe_resume_training' into 'main'
ko3n1g May 19, 2025
ee4815f
ADLR/megatron-lm!3198 - Adding Audio Submodules for MiMO.
yashaswikarnati May 19, 2025
3895909
Merge branch 'yash/mimo_audio_submodules' into 'main'
jaredcasper May 19, 2025
5749637
ADLR/megatron-lm!3314 - Bugfix in huggingface hf_llava converter
May 20, 2025
5da8c0b
Merge branch 'tpoon/hf-small-fix' into 'main'
jaredcasper May 20, 2025
22d0305
ADLR/megatron-lm!3339 - ci: Remove deprecated bert tests
ko3n1g May 20, 2025
4d1a4e8
Merge branch 'ko3n1g/tests/deprecated-bert-tests' into 'main'
ko3n1g May 20, 2025
9eebd51
ADLR/megatron-lm!3252 - Skyw/force vp stage passing in core
skyw May 20, 2025
bed7dbd
Merge branch 'skyw/force_vp_stage_passing_in_core' into 'main'
ko3n1g May 20, 2025
000c978
ADLR/megatron-lm!3134 - Fix MMMU prompt and inference context
May 21, 2025
eb20e24
Merge branch 'matthieul/fix_mmodal_inference' into 'main'
jaredcasper May 21, 2025
3a49d53
ADLR/megatron-lm!3336 - fix: Make NVTX optional
ko3n1g May 21, 2025
5a676b3
Merge branch 'ko3n1g/fix/guard-nvtx' into 'main'
ko3n1g May 21, 2025
0db0e83
ADLR/megatron-lm!3329 - Fix CUDA_DEVICE_MAX_CONNECTIONS check on Blac…
duncanriach May 22, 2025
75b1ca1
Merge branch 'fix-cdmc-check-on-blackwell' into 'main'
deepakn94 May 22, 2025
40df28b
ADLR/megatron-lm!2902 - Userbuffer registration for MCore-FSDP
youngeunkwon0405 May 22, 2025
20304a3
Merge branch 'fsdp-ubr' into 'main'
ko3n1g May 22, 2025
6b6e9db
ci(hotfix):switch runner
ko3n1g May 22, 2025
497c3e2
ADLR/megatron-lm!3346 - Fix text generation
May 24, 2025
b8be0af
Merge branch 'matthieul/fix_text_generation' into 'main'
trintamaki May 24, 2025
194b2be
ADLR/megatron-lm!3343 - ci: Run tests on H100
ko3n1g May 24, 2025
3996ec2
Merge branch 'ko3n1g/ci/dev-on-h100-2' into 'main'
ko3n1g May 24, 2025
022bcb5
ADLR/megatron-lm!3274 - feat: add force-load-balancing for MoE router
Victarry May 24, 2025
18b32aa
Merge branch 'denliu/router_force_balance' into 'main'
ko3n1g May 24, 2025
37587af
ADLR/megatron-lm!3353 - tests: Onboard gpt-nemo test
ko3n1g May 24, 2025
957dc60
Merge branch 'ko3n1g/tests/gpt-nemo' into 'main'
ko3n1g May 24, 2025
32b6d48
ADLR/megatron-lm!3309 - Add user guide for Multi-Storage Client integ…
shunjiad May 24, 2025
cd88296
Merge branch 'chore-msc-doc' into 'main'
ko3n1g May 24, 2025
de7945b
ADLR/megatron-lm!3335 - tests: Onboard MoE memory test
ko3n1g May 25, 2025
1e05700
Merge branch 'ko3n1g/tests/moe-memory' into 'main'
ko3n1g May 25, 2025
c8cc2c6
ADLR/megatron-lm!3354 - ci: Restart on segfault
ko3n1g May 26, 2025
42ccbb8
Merge branch 'ko3n1g/ci/restart-on-segfault' into 'main'
ko3n1g May 26, 2025
1c7d3db
ADLR/megatron-lm!3364 - chore: add pre-commit config file
ko3n1g May 27, 2025
0e6223d
Merge branch 'ko3n1g/chore/add-precommit' into 'main'
ko3n1g May 27, 2025
f1c74a6
ADLR/megatron-lm!3359 - ci: Update nightlies
ko3n1g May 27, 2025
0d77a93
Merge branch 'ko3n1g/ci/update-nightlies-2' into 'main'
ko3n1g May 27, 2025
e582852
ADLR/megatron-lm!3231 - Multiple touches for TensorRT Model Optimizer…
ChenhanYu May 28, 2025
7b4fbe9
Merge branch 'chenhany/heterogenous_sharded_ckpt' into 'main'
ko3n1g May 28, 2025
fb68b3a
ADLR/megatron-lm!3360 - add more nemo2 tests
dimapihtar May 28, 2025
27c33cb
Merge branch 'add_nemo2_tests' into 'main'
ko3n1g May 28, 2025
e019f82
ADLR/megatron-lm!3348 - fix: handle checkpoint_dir path as a string
shunjiad May 29, 2025
0cb4da1
Merge branch 'fix-torch-msc-checkpointing' into 'main'
deepakn94 May 29, 2025
90e768c
ADLR/megatron-lm!3367 - Update dataset helper for online video decoding
May 29, 2025
705d312
Merge branch 'matthieul/fix_text_generation' into 'main'
trintamaki May 29, 2025
7c1baea
ADLR/megatron-lm!3365 - Do not use eval on arbitrary user input.
jaredcasper May 29, 2025
c820c68
Merge branch 'safer-eval' into 'main'
jaredcasper May 29, 2025
c6b08c2
ADLR/megatron-lm!3363 - tests: Update frozen-checkpoints
ko3n1g May 30, 2025
8a39761
Merge branch 'ko3n1g/tests/frozen-cpkt' into 'main'
ko3n1g May 30, 2025
8d08685
ADLR/megatron-lm!3375 - Consolidate eval methods across train and gen…
May 30, 2025
13898cb
Merge branch 'matthieul/consolidate_eval' into 'main'
trintamaki May 30, 2025
de245df
ADLR/megatron-lm!3388 - ci: Auto-restart on nan
ko3n1g May 30, 2025
0a438ed
Merge branch 'ko3n1g/ci/restart-on-nan' into 'main'
ko3n1g May 30, 2025
23e6471
ADLR/megatron-lm!2949 - perf(mla, experimental): MLA RoPE fusion and …
hxbai Jun 2, 2025
9c1a535
Merge branch 'hongxiaob/mla_rope' into 'main'
ko3n1g Jun 2, 2025
da3f0ff
ADLR/megatron-lm!3280 - Fix custom FSDP float8 tensor set_item
shjwudp Jun 3, 2025
549d637
Merge branch 'fix_cfsdp_fp8_param_load' into 'main'
chtruong814 Jun 3, 2025
24c60db
ADLR/megatron-lm!3401 - ci: Move queue blocker
ko3n1g Jun 3, 2025
cfea2ea
Merge branch 'ko3n1g/ci/move-queue-blocker' into 'main'
ko3n1g Jun 3, 2025
37b0afd
ADLR/megatron-lm!3400 - ci: Improve error-handling of missing logs
ko3n1g Jun 4, 2025
6a62a54
Merge branch 'ko3n1g/ci/better-log-failure-handling' into 'main'
ko3n1g Jun 4, 2025
4648912
ADLR/megatron-lm!3408 - ci: Control job concurrency
ko3n1g Jun 4, 2025
cde60ce
Merge branch 'ko3n1g/ci/job-concurrency' into 'main'
ko3n1g Jun 4, 2025
eab047c
ADLR/megatron-lm!3412 - ci: Catch missing logs
ko3n1g Jun 4, 2025
25a26ca
Merge branch 'ko3n1g/ci/fix-no-log' into 'main'
ko3n1g Jun 4, 2025
9bdfe31
ADLR/megatron-lm!3411 - ci: Remove tests from A100
ko3n1g Jun 4, 2025
ff64f96
Merge branch 'ko3n1g/ci/move-tests' into 'main'
ko3n1g Jun 4, 2025
d960800
ADLR/megatron-lm!3393 - Add an option to skip counting zeros in grad …
erhoo82 Jun 5, 2025
b47a9bb
Merge branch 'no_count_zeros' into 'main'
ko3n1g Jun 5, 2025
bc80491
ADLR/megatron-lm!3326 - Add an interface to set high priority stream …
youngeunkwon0405 Jun 5, 2025
957f348
Merge branch 'comm-priority-setting' into 'main'
ko3n1g Jun 5, 2025
7af72f9
ADLR/megatron-lm!3241 - Llama4 inference
wdykas Jun 6, 2025
4eb36f8
Merge branch 'llama4-inference' into 'main'
chtruong814 Jun 6, 2025
61a42f6
ADLR/megatron-lm!3421 - Change default value of high_priority_stream_…
youngeunkwon0405 Jun 6, 2025
7c64be3
Merge branch 'comm-priority-patch' into 'main'
jaredcasper Jun 6, 2025
92d68da
ADLR/megatron-lm!3170 - [feat, moe]: FP8 padding optimization of MoE …
Victarry Jun 9, 2025
140dce2
Merge branch 'denliu/router_pad' into 'main'
ko3n1g Jun 9, 2025
9e3adb5
ADLR/megatron-lm!3306 - Remove deprecated alltoall_seq dispatcher.
Victarry Jun 9, 2025
823466e
Merge branch 'denliu/remove_alltoall_seq_dispatcher' into 'main'
ko3n1g Jun 9, 2025
db07e3f
ADLR/megatron-lm!3347 - Fix flash decode bug caused by unnecessary ro…
santhnm2 Jun 9, 2025
2e15d12
Merge branch 'hybrid_example' into 'main'
ko3n1g Jun 9, 2025
1589517
ADLR/megatron-lm!3404 - Fix perf issues with NVTX range profiling
Jun 9, 2025
b04c901
Merge branch 'nvtx_perf_fix' into 'main'
ko3n1g Jun 9, 2025
791454d
ADLR/megatron-lm!3385 - Enforce param group ordering after checkpoint…
skierat Jun 9, 2025
40cb6e7
Merge branch 'skierat/fix_param_groups' into 'main'
ko3n1g Jun 9, 2025
54cdc7a
ADLR/megatron-lm!3399 - [MM] [Bug Fix] model parameter dtype, embeddi…
cuichenx Jun 10, 2025
d1409db
Merge branch 'chcui/llama-nemotron-nano-vl-8b' into 'main'
ko3n1g Jun 10, 2025
629b615
Revert "Merge branch 'chcui/llama-nemotron-nano-vl-8b' into 'main'"
ko3n1g Jun 10, 2025
50a1247
Reapply "Merge branch 'chcui/llama-nemotron-nano-vl-8b' into 'main'"
ko3n1g Jun 10, 2025
5ae21f8
Revert "ADLR/megatron-lm!3399 - [MM] [Bug Fix] model parameter dtype,…
ko3n1g Jun 10, 2025
62e7e60
ADLR/megatron-lm!3332 - fix(mtp): Fix issue with MTP+VPP after !3108 …
shifangx Jun 11, 2025
ad36348
Merge branch 'shifang/fix_vp_stage' into 'main'
ko3n1g Jun 11, 2025
0f4f095
ADLR/megatron-lm!3384 - Expose TE fused MLP with module spec
timmoon10 Jun 11, 2025
0595ef2
Merge branch 'mfutrega/fused_swiglu' into 'main'
ko3n1g Jun 11, 2025
9e5fe7a
ADLR/megatron-lm!3403 - Moe inference functional tests
wdykas Jun 12, 2025
0dea9a5
Merge branch 'moe-tests' into 'main'
ko3n1g Jun 12, 2025
80d66ec
ADLR/megatron-lm!3458 - ci: Benchmark release tests suite with TE2.2 …
ko3n1g Jun 12, 2025
a3e2222
Merge branch 'ko3n1g/chore/release-benchmarks-dev' into 'main'
ko3n1g Jun 12, 2025
15e4446
ADLR/megatron-lm!3371 - Move data to GPU for TP data processing
parthmannan Jun 12, 2025
d58f062
Merge branch 'pmannan/improve_data_processing' into 'main'
ko3n1g Jun 12, 2025
f5cfc10
Reapply "ADLR/megatron-lm!3399 - [MM] [Bug Fix] model parameter dtype…
ko3n1g Jun 12, 2025
5bb6cf3
update golden values
ko3n1g Jun 12, 2025
603592a
ADLR/megatron-lm!3366 - Optimize dummy weight tensors for cudagraph a…
gdengk Jun 12, 2025
40bfaf5
Merge branch 'gaod/llama4/cudagraph_optimize' into 'main'
ko3n1g Jun 12, 2025
6782fe4
ADLR/megatron-lm!3377 - Add --enable-experimental to args.
Victarry Jun 12, 2025
32737be
Merge branch 'denliu/add_enable_experimental_flag' into 'main'
ko3n1g Jun 12, 2025
e63aee4
ADLR/megatron-lm!3281 - perf(MLA): MLA down proj switch back to TELinear
yuzhongw-nvidia Jun 13, 2025
ae63c41
Merge branch 'mla_down_proj_telinear' into 'main'
ko3n1g Jun 13, 2025
9042182
ADLR/megatron-lm!3463 - ci: Retry on network errors
ko3n1g Jun 13, 2025
819f752
Merge branch 'ko3n1g/ci/wait-resources-resiliency' into 'main'
ko3n1g Jun 13, 2025
b8605c6
ADLR/megatron-lm!3361 - Add TE functional tests
ko3n1g Jun 13, 2025
107fc72
Merge branch 'ko3n1g/guyueh/te_functional_tests' into 'main'
ko3n1g Jun 13, 2025
effa991
revert
ko3n1g Jun 13, 2025
ad7d1df
ci: Restart on cuda error
ko3n1g Jun 13, 2025
f21a28b
Revert "ADLR/megatron-lm!3281 - perf(MLA): MLA down proj switch back …
ko3n1g Jun 13, 2025
a4fc916
Merge branch 'ko3n1g/ci/restart-on-cuda'
ko3n1g Jun 13, 2025
7f7ffcf
Merge branch 'ko3n1g/chore/re-apply-3399'
ko3n1g Jun 13, 2025
73558db
ci: Set gpt-nemo tests as allowed to fail
ko3n1g Jun 13, 2025
42f7f7f
ci: Fix while loop
ko3n1g Jun 13, 2025
0bbcbb1
ADLR/megatron-lm!3024 - Added support for offloading Swiglu activatio…
sanandaraj5597 Jun 13, 2025
fdcf52b
Merge branch 'swiglu_offload' into 'main'
ericharper Jun 13, 2025
cfe7b06
ADLR/megatron-lm!3279 - Fix MoE Aux loss
aklife97 Jun 13, 2025
aaddc23
Merge branch 'akhattar/auxloss_fix' into 'main'
ko3n1g Jun 13, 2025
db8cd9a
ADLR/megatron-lm!3429 - llama 3p1 nemotron nano vl 8b v1 instructions
Jun 13, 2025
dca59c6
Merge branch 'matthieul/llama_3p1_nemotron_nano_vl_8b_v1' into 'main'
ko3n1g Jun 13, 2025
9caa5d3
ADLR/megatron-lm!3289 - Fix attention unit test
santhnm2 Jun 14, 2025
8a03b29
Merge branch 'attention_unit_test_fix' into 'main'
ko3n1g Jun 14, 2025
04c93ae
ADLR/megatron-lm!3265 - Handle strict argument for local checkpointing
Jun 14, 2025
59ae4e3
Merge branch 'jszulc/local-ckpt-strict-loading' into 'main'
ko3n1g Jun 14, 2025
77732c3
ADLR/megatron-lm!2795 - feat(Pipeline Parallel, MoE): Flexible Asymme…
Shunkangz Jun 14, 2025
aec50ee
Merge branch 'flexible_vpp' into 'main'
ko3n1g Jun 14, 2025
19d30fa
ADLR/megatron-lm!3317 - Fix version check of test_fp8_param.py
kunlunl Jun 14, 2025
48396b2
Merge branch 'kunlunl/fix_fp8_param_ut_version_check' into 'main'
ko3n1g Jun 14, 2025
0d549aa
ADLR/megatron-lm!3461 - Fix common state comparison primitive
mikolajblaz Jun 14, 2025
de3da90
Merge branch 'mblaz/fix-dict-utils-diff' into 'main'
ko3n1g Jun 14, 2025
f2116e2
ADLR/megatron-lm!3153 - Update inference README
mathemakitten Jun 14, 2025
a981bf8
Merge branch 'helenn-update-inference-readme' into 'main'
jaredcasper Jun 14, 2025
d920c0d
ADLR/megatron-lm!3345 - M4 Taskforce: update get_rank & get_size of PG
yaoyu-33 Jun 14, 2025
fabb0a0
Merge branch 'yuya/m4_get_rank_get_size_of_pg_update' into 'main'
ko3n1g Jun 14, 2025
03322c1
ADLR/megatron-lm!3448 - CRADIO-g support
Jun 14, 2025
c85b6e7
Merge branch 'tpoon/cradio-g-mr' into 'main'
ko3n1g Jun 14, 2025
9d509a0
ADLR/megatron-lm!3127 - feat(optimizer): Support bf16 dtype for optim…
BestJuly Jun 14, 2025
083b1dc
Merge branch 'lit/support_bf16_optimzer_states' into 'main'
ko3n1g Jun 14, 2025
9900d9a
ADLR/megatron-lm!3379 - Megatron SFT
Jun 14, 2025
775a1d1
Merge branch 'megatron-main-sft' into 'main'
ko3n1g Jun 14, 2025
ee56591
ADLR/megatron-lm!3376 - Fix cuda graph for MambaLayer
guyueh1 Jun 14, 2025
5b4e466
Merge branch 'fix_cuda_graph_for_ssm' into 'main'
ko3n1g Jun 14, 2025
e3ec174
ADLR/megatron-lm!2276 - Add Mamba context parallel
duncanriach Jun 14, 2025
55080a3
Merge branch 'duncan/mamba-context-parallel' into 'main'
ericharper Jun 14, 2025
d559555
ADLR/megatron-lm!3415 - [MXFP8]Reduce memory footprint by initializin…
Jun 14, 2025
bcf96e3
Merge branch 'qiyuw/mxfp8-param' into 'main'
ko3n1g Jun 14, 2025
66194b7
ADLR/megatron-lm!3462 - Add hybrid functional inference test
wdykas Jun 14, 2025
d738935
Merge branch 'mamba-inference-test' into 'main'
ko3n1g Jun 14, 2025
bf6e998
ADLR/megatron-lm!3316 - added llama model training example with FP8
sbhavani Jun 14, 2025
38e30f5
Merge branch 'main' into 'main'
ko3n1g Jun 14, 2025
0f05866
ADLR/megatron-lm!3387 - feat(MoE): Using `te_general_gemm` to handle …
hxbai Jun 14, 2025
dc8372b
Merge branch 'hongxiaob/custom_router_gating' into 'main'
ko3n1g Jun 14, 2025
1674ce3
ADLR/megatron-lm!3190 - Mark weights from vision encoder to be non-te…
wdykas Jun 14, 2025
a165235
Merge branch 'hf-diverge-fix' into 'main'
ko3n1g Jun 14, 2025
0431153
ADLR/megatron-lm!2850 - Granular upcycling implementation
shifangx Jun 15, 2025
c2fb1de
Merge branch 'shifang/granular_upcycling' into 'main'
ko3n1g Jun 15, 2025
a0937dd
ADLR/megatron-lm!3424 - Add GPU energy (and ~power) monitoring for tr…
Jun 15, 2025
cca17b7
Merge branch 'energy-monitoring' into 'main'
ko3n1g Jun 15, 2025
8333bd5
ADLR/megatron-lm!3217 - feat(MoE): Support ep a2a overlap - (01) Add …
Wohox Jun 16, 2025
3e55583
Merge branch 'pingtianl/fine_grained_transformer_layer_submodules' in…
ko3n1g Jun 16, 2025
5005416
ADLR/megatron-lm!3397 - build: Switch to uv
ko3n1g Jun 16, 2025
0df9325
Merge branch 'ko3n1g/build/refactor-setup' into 'main'
ko3n1g Jun 16, 2025
59f2093
ADLR/megatron-lm!3468 - build: Simplify nemo image
ko3n1g Jun 16, 2025
df7401b
Merge branch 'ko3n1g/build/simplify-nemo-image' into 'main'
ko3n1g Jun 16, 2025
2b1c2d6
ADLR/megatron-lm!3272 - Make completions endpoint use MCore inference…
santhnm2 Jun 16, 2025
c40f31f
Merge branch 'completions_endpoint_fix' into 'main'
ko3n1g Jun 16, 2025
2b11af0
ADLR/megatron-lm!3420 - Implement dist-ckpt content versioning
mikolajblaz Jun 16, 2025
83a0f5a
Merge branch 'mblaz/dist-ckpt-content-versioning' into 'main'
ko3n1g Jun 16, 2025
8c1d0c7
ADLR/megatron-lm!3451 - fix (ckpt): Fix `_extra_state` for TE 2.5
yaox12 Jun 16, 2025
6bf889f
Merge branch 'xiny/fix_extra_state' into 'main'
ko3n1g Jun 16, 2025
6dc6050
ADLR/megatron-lm!3081 - Add Hybrid Shard Data-Parallel Support for Cu…
shjwudp Jun 16, 2025
aad967f
Merge branch 'custom_fsdp_hsdp_support' into 'main'
ko3n1g Jun 16, 2025
c7cf075
ADLR/megatron-lm!3450 - Revert `fork` to `spawn` based on stability i…
sbak5 Jun 16, 2025
c8f2f56
Merge branch 'sbak/ckpt_manager_fix' into 'main'
jaredcasper Jun 16, 2025
f7e4641
ADLR/megatron-lm!3301 - Add kitchen extension with per-layer configur…
kwyss-nvidia Jun 16, 2025
8c15450
Merge branch 'kwyss/megatron_kitchen_extension' into 'main'
jaredcasper Jun 16, 2025
1e8e9a4
ADLR/megatron-lm!3474 - Add deprecation warning for legacy inference
santhnm2 Jun 17, 2025
b87f147
Merge branch 'legacy_deprecation_warning' into 'main'
ko3n1g Jun 17, 2025
ab77e52
ADLR/megatron-lm!3181 - Change naming of original_max_position_embedd…
BoxiangW Jun 17, 2025
2386c6c
Merge branch 'boxiangw/mla-yarn-change-option-name' into 'main'
ericharper Jun 17, 2025
fee5600
ADLR/megatron-lm!3472 - Make cudagraph replay check more descriptive …
mathemakitten Jun 17, 2025
c3dc507
Merge branch 'helenn-flag-specific-error-for-cudagraph-replay' into '…
ericharper Jun 17, 2025
db70ed4
ADLR/megatron-lm!3414 - M4 Taskforce: Disable T5 and encoder_and_deco…
yaoyu-33 Jun 17, 2025
5615930
Merge branch 'yuya/m4_remove_encoder_pp_tests_ci_add_deprecation' int…
ko3n1g Jun 17, 2025
e0b2c60
ADLR/megatron-lm!3444 - Quick fix for NeMo: handle alternate key name…
skierat Jun 17, 2025
bfa39e8
Merge branch 'skierat/quick_nemo_fix' into 'main'
ko3n1g Jun 17, 2025
0e3af7e
ADLR/megatron-lm!3477 - chore: Bump version 0.14.0
ko3n1g Jun 17, 2025
27c9b6c
Merge branch 'ko3n1g/chore/release-version-0.14.0' into 'main'
ericharper Jun 17, 2025
3987e89
ADLR/megatron-lm!3071 - Added offloading support for MCore layers
sanandaraj5597 Jun 17, 2025
4a91173
Merge branch 'lora_offload' into 'main'
ericharper Jun 17, 2025
115785f
ADLR/megatron-lm!3437 - Bug fix to reset kv chunks assigned to -1 and…
shanmugamr1992 Jun 18, 2025
3b0f763
Merge branch 'bugFixDE' into 'main'
shanmugamr1992 Jun 18, 2025
642a181
ADLR/megatron-lm!3483 - chore: Add init to tools
ko3n1g Jun 18, 2025
0710137
Merge branch 'ko3n1g/chore/tool-init' into 'main'
ko3n1g Jun 18, 2025
171c351
ADLR/megatron-lm!3480 - Fix unit test test_fp8_param.py blockwise sca…
guyueh1 Jun 18, 2025
57082f9
Merge branch 'fix_2425' into 'main'
ko3n1g Jun 18, 2025
9f1c4b2
ADLR/megatron-lm!3492 - chore: Add init to examples
ko3n1g Jun 18, 2025
6ac5633
Merge branch 'ko3n1g/chore/examples-init' into 'main'
ko3n1g Jun 18, 2025
2074d19
ADLR/megatron-lm!3493 - build: Force pin down setuptools
ko3n1g Jun 18, 2025
0600a3c
Merge branch 'ko3n1g/build/fix-setuptools-version' into 'main'
ko3n1g Jun 18, 2025
a002d50
ADLR/megatron-lm!3341 - Pad input tensors and enable fp8 weights for …
santhnm2 Jun 18, 2025
6a6cd47
Merge branch 'fp8_inference' into 'main'
ko3n1g Jun 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
5 changes: 0 additions & 5 deletions .coveragerc

This file was deleted.

4 changes: 4 additions & 0 deletions .flake8
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
[flake8]
max-line-length = 100
extend-ignore = E203,E501,F401,E402,E714
per-file-ignores = __init__.py:F401
32 changes: 32 additions & 0 deletions .github/ISSUE_TEMPLATE/bug.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
name: BUG
about: Report a bug that needs attention
title: "[BUG]"
labels: ''
assignees: ''

---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Expected behavior**
A clear and concise description of what you expected to happen.

**Stack trace/logs**
If applicable, add the stack trace or logs from the time of the error.

**Environment (please complete the following information):**
- Megatron-LM commit ID
- PyTorch version
- CUDA version
- NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
23 changes: 23 additions & 0 deletions .github/ISSUE_TEMPLATE/enhancement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
name: ENHANCEMENT
about: Suggest an idea to improve this project
title: "[ENHANCEMENT]"
labels: ''
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Proposed implementation**
If you have a proposed implementation for the feature state it here or link to a PR.

**Additional context**
Add any other context or screenshots about the feature request here.
12 changes: 12 additions & 0 deletions .github/ISSUE_TEMPLATE/question.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
name: QUESTION
about: Ask a question about Megatron-LM that is not a bug, regression or enhancement
request
title: "[QUESTION]"
labels: ''
assignees: ''

---

**Your question**
Ask a clear and concise question about Megatron-LM.
39 changes: 39 additions & 0 deletions .github/ISSUE_TEMPLATE/regression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
name: REGRESSION
about: Report a regression in speed or accuracy due to a Megatron-LM update
title: "[REGRESSION]"
labels: ''
assignees: ''

---

**Describe the regression**
A clear and concise description of what the regression is.

**To Reproduce**
Steps to reproduce the behavior. The easier it is to reproduce the faster it will get maintainer attention.

**Previous performance**
What speed or accuracy did you previously see.

**New performance**
What speed or accuracy do you see after the update.

**Stack trace/logs**
If applicable, add the stack trace or logs related to the regression.

**Environment (please complete the following information):**
- Previous Megatron-LM commit ID
- New Megatron-LM commit ID
- Previous PyTorch version
- New PyTorch version
- Previous CUDA version
- New CUDA version
- Previous NCCL version
- New NCCL version

**Proposed fix**
If you have a proposal for how to fix the issue state it here or link to a PR.

**Additional context**
Add any other context about the problem here.
31 changes: 31 additions & 0 deletions .github/workflows/stale.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# This workflow warns and then closes issues and PRs that have had no activity for a specified amount of time.
#
# You can adjust the behavior by modifying this file.
# For more information, see:
# https://github.com/actions/stale
name: Mark stale issues and pull requests

on:
schedule:
- cron: '15 18 * * *'

jobs:
stale:

runs-on: ubuntu-latest
permissions:
issues: write
pull-requests: write

steps:
- uses: actions/stale@v5
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
days-before-stale: 60
stale-issue-message: 'Marking as stale. No activity in 60 days.'
stale-pr-message: 'Marking as stale. No activity in 60 days.'
stale-issue-label: 'stale'
stale-pr-label: 'stale'
remove-stale-when-updated: true
operations-per-run: 1000
days-before-close: -1
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,9 @@ build
*~
slurm*
logs
.vscode
local/
.gitmodules
wandb/
onelogger.log
onelogger.err
Loading