-
Notifications
You must be signed in to change notification settings - Fork 449
Insights: pytorch/torchtitan
Overview
Could not load contribution data
Please try again later
24 Pull requests merged by 15 people
-
Add logging for learning rates in MetricsProcessor
#1413 merged
Jul 31, 2025 -
Refactor script to use 'overwrites' variable for command-line arguments in training scripts
#1473 merged
Jul 31, 2025 -
Fix data_load_start position
#1481 merged
Jul 31, 2025 -
validation support for pipeline parallelism [WIP]
#1490 merged
Jul 31, 2025 -
fix creating leaf folder
#1502 merged
Jul 31, 2025 -
remove dead code
#1501 merged
Jul 31, 2025 -
[deepseek] integrate 16B tokenizer to match 16B official model
#1497 merged
Jul 31, 2025 -
Refactor PP splitting
#1416 merged
Jul 30, 2025 -
guard against nvidia-smi command exit code 1
#1496 merged
Jul 30, 2025 -
Change
lr_min
tomin_lr_factor
#1471 merged
Jul 30, 2025 -
Fixes the sd adapter in forge experiments
#1484 merged
Jul 29, 2025 -
log cuda driver version for debugging
#1479 merged
Jul 29, 2025 -
Fix tokenizer error message
#1476 merged
Jul 29, 2025 -
Temporarily Disable Memory Tracking Test for FSDP2
#1480 merged
Jul 29, 2025 -
Log total number of tokens seen
#1474 merged
Jul 29, 2025 -
improve reshard_after_forward logic
#1094 merged
Jul 29, 2025 -
Re-enable pipeline parallel tests
#1477 merged
Jul 29, 2025 -
[checkpoint] let user specify
intial_load_path
andinitial_load_in_hf
when using HF checkpoints#1466 merged
Jul 28, 2025 -
remove float8 force_recompute_fp8_weight_in_bwd flag
#1452 merged
Jul 28, 2025 -
Fix a none pointer exception in checkpoint.py
#1465 merged
Jul 28, 2025 -
make mxfp8 dim1 cast kernel configurable
#1427 merged
Jul 25, 2025 -
publish instructions on adding a new model
#1451 merged
Jul 25, 2025 -
Fix incorrect mapping of ffn_norm and attention_norm in HF Llama4 conversion script
#1455 merged
Jul 24, 2025 -
added model definition conversion for llama3
#1441 merged
Jul 24, 2025
12 Pull requests opened by 8 people
-
[WIP] Integrate autoparallel into torchtitan
#1458 opened
Jul 25, 2025 -
Autoparallel support for DP-only, DP+TP, or TP-only
#1459 opened
Jul 25, 2025 -
[autoparallel] Enable bucketing passes for autoparallel, reorder and sink_waits.
#1463 opened
Jul 25, 2025 -
[Evaluation] Adding evaluation feature to TorchTitan
#1470 opened
Jul 28, 2025 -
minimal repro for fsdp + tp incorrect permutation
#1483 opened
Jul 29, 2025 -
multi rank consolidation
#1485 opened
Jul 29, 2025 -
perf testing
#1488 opened
Jul 29, 2025 -
minimal repro of error saving pp in hf format
#1489 opened
Jul 29, 2025 -
[a2av] Add autograd support for token dispatch op
#1491 opened
Jul 30, 2025 -
minor fix
#1494 opened
Jul 30, 2025 -
Refactor checkpoint load directory check
#1498 opened
Jul 30, 2025 -
[deepseek] update to 16b base tokenizer
#1499 opened
Jul 31, 2025
9 Issues closed by 7 people
-
Float8 training command
#1443 closed
Jul 31, 2025 -
CUDA driver error during symmetric memory initialization
#1475 closed
Jul 29, 2025 -
DeepSeek V3 EP Functionality Issues in Parallelization
#1472 closed
Jul 29, 2025 -
FSDP2 root level parameter management
#1091 closed
Jul 29, 2025 -
[Question] About SimpleFSDP and FSDP2
#1426 closed
Jul 26, 2025 -
[Deepseek] FlexAttention support
#1412 closed
Jul 25, 2025 -
deepseek_v3 and llama4 moe router scores are not normlized
#1418 closed
Jul 24, 2025 -
Minor incorrect mapping while converting Llama4 HF to DCP
#1454 closed
Jul 24, 2025
6 Issues opened by 4 people
-
training gradient norm
#1500 opened
Jul 31, 2025 -
PP Stage init hangs on multi-nodes
#1492 opened
Jul 30, 2025 -
Is there documentation on what exactly are 'dp_shard_mod_ep' and 'dp_shard_in_ep'] ?
#1482 opened
Jul 29, 2025 -
Is FSDP+TP+EP supported for Llama4 ?
#1478 opened
Jul 28, 2025 -
[Question] TP only + amp by `fully_shard(..., ignored_params=...)`
#1469 opened
Jul 28, 2025 -
possible memory leaking of DP2EP with recompute
#1467 opened
Jul 26, 2025
14 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
model fragments for diloco
#1446 commented on
Jul 31, 2025 • 22 new comments -
Initial compile support for llama4
#1365 commented on
Jul 24, 2025 • 8 new comments -
[Refactor] Modular Integration Test Framework with DeepSeek-v3 Support
#1431 commented on
Jul 30, 2025 • 7 new comments -
Adding Qwen3 model to the experiments folder
#1429 commented on
Jul 25, 2025 • 6 new comments -
How to adapt HuggingFace or other models for TorchTitan
#1322 commented on
Jul 24, 2025 • 0 new comments -
Inconsistent loss when resume training with vocab size that is not divisible by world size.
#1136 commented on
Jul 29, 2025 • 0 new comments -
OOM recovery under multi-node FSDP/HSDP
#1329 commented on
Jul 29, 2025 • 0 new comments -
Any plans to support DPO training?
#756 commented on
Jul 29, 2025 • 0 new comments -
[AMD] [TorchTitan] Builds pass on PR, but not on nightly builds
#1486 commented on
Jul 29, 2025 • 0 new comments -
TP broken due to newly added fused RMSNorm op
#1421 commented on
Jul 30, 2025 • 0 new comments -
Circular imports
#1383 commented on
Jul 31, 2025 • 0 new comments -
[llama3] add configurations for Llama 3 1B and 3B models
#1376 commented on
Jul 28, 2025 • 0 new comments -
[WIP][Optimizers] Unofficial implementation of DION optimizer - DIstributed OrthoNormal updates
#1417 commented on
Jul 25, 2025 • 0 new comments -
add lr logging
#1453 commented on
Jul 24, 2025 • 0 new comments