Refactor: Decouple Core Logic into a Reusable Library#15
Merged
dariocazzani merged 6 commits intomainfrom Sep 12, 2025
Merged
Conversation
ayeganov
commented
Sep 12, 2025
ayeganov
commented
Sep 12, 2025
ayeganov
commented
Sep 12, 2025
ayeganov
commented
Sep 12, 2025
ayeganov
commented
Sep 12, 2025
dariocazzani
approved these changes
Sep 12, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR refactors the project from a monolithic script into a well-defined, reusable library. The core training, data handling, and tokenizer management logic have been extracted from
train.pyinto decoupled, object-oriented components. The goal is to create a clean API that can be easily used and extended in other projects.The
train.pyscript is now a simple command-line client that demonstrates how to use the new library components.Key Changes ✨
TrainerClass: A newscratchgpt/training/trainer.pymodule introduces theTrainerclass, which now encapsulates all logic for training loops, validation, pre-tokenization, and model checkpointing.DataSourceProtocol: A new, flexiblescratchgpt/data/datasource.pymodule defines a protocol for data loading. We've included concreteFileDataSourceandFolderDataSourceimplementations, replacing the oldTextProviderclasses.get_tokenizerfunction inscratchgpt/model_io.pyhas been updated to use a factory pattern. This makes creating a default tokenizer more robust and explicit.tests/directory (test_tokenizer_io.py,tests/tokenizers/..), improving maintainability.train.pyscript now uses a--tokenizerargument to dynamically load any tokenizer from the Hugging Face Hub, making it significantly more versatile.Highlights for Review 🔍
When reviewing, please pay special attention to:
TrainerAPI: This is the new heart of the library. Is its interface clear? Does it correctly encapsulate the training logic?DataSourceProtocol: This is our core data abstraction. Is it flexible enough for future use cases?get_tokenizerFactory Pattern: Review the new signature inmodel_io.py. This is a key design pattern for how we manage object creation.train.py: As the first client of our new library, does it demonstrate a clean and intuitive workflow?