Closed
Description
Every one of these improvements is an opportunity to make Fast-LLM better, faster, and smarter for everyone. Dive in and make an impact:
### Key Opportunities for Improvement
- [x] **Timeout Customization**: Distributed timeouts for dataset building, checkpoint saving, and tensor computations need to be set independently. These tasks have vastly different durations, and a one-size-fits-all timeout isn’t cutting it. #122
- [ ] **Faster Dataset Loading**: Current dataset loading and preprocessing in Fast-LLM are too slow for large-scale training (e.g., 10T tokens or more). A 10–100x speedup here would make a world of difference. See details: [#132](https://github.com/ServiceNow/Fast-LLM/issues/76).
- [ ] **Meaningful Train-Val-Test Splits**: Current splits (e.g., [99,1,0]) are random and only track training loss trends. Instead, validation should evaluate loss on specific, held-out datasets (C4, The Pile, Wikitext 103) to provide actionable insights and allow for meaningful comparisons across models and datasets. Proposed solution: [#65](https://github.com/ServiceNow/Fast-LLM/issues/65).
- [ ] **Full Configuration Dump**: Training runs use many parameters, and inspecting them currently requires digging into the code to check default values. By increasing the verbosity of configuration dumps, users could easily inspect all parameters used in a run. Potential solution: provide both a detailed config and a concise config summary for readability. See [#91](https://github.com/ServiceNow/Fast-LLM/issues/91).
- [ ] **Track Exact Training Version**: Ensure traceability by including a version string in the output directory of training runs. This should indicate the exact commit or tagged release used and should also be logged to wandb. See [#101](https://github.com/ServiceNow/Fast-LLM/issues/101).
- [ ] **Add Validation Step at Beginning of Training**, possibly through adding an extra argument: When starting from pretrained checkpoints, the initial validation loss can be useful information to have. See [#134]