Write-up:
- https://dudeperf3ct.github.io/projects/train_llm_part0/ (data)
- https://dudeperf3ct.github.io/projects/train_llm_part1/ (tokenizer)
codellm_data: Parses and download datasets.codellm_tokenizer: Train a custom byte-level BPE tokenizer using subset oftokyotech-llm/swallow-code-v2dataset
This project is licensed under the MIT License - see the LICENSE file for details.