-
Couldn't load subscription status.
- Fork 0
SFT Nemo - Qwen #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
SFT Nemo - Qwen #17
Conversation
| # Executor type | ||
| parser.add_argument( | ||
| "--executor", | ||
| type=str, | ||
| default="local", | ||
| choices=["local", "slurm"], | ||
| help="Executor type to use" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't have this since we don't support slurm, right?
| # Model configuration | ||
| parser.add_argument( | ||
| "--model", | ||
| type=str, | ||
| default="qwen2_7b", | ||
| choices=["qwen2_7b"], # Add more models as needed | ||
| help="Model type to use" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be the same between import_ckept and train, right? Is it better to explicitely set it in the run.sh and add a comment?
| training_project = definitions.TrainingProject( | ||
| name="Nemo-qwen2.5-nemo 1node", | ||
| job=training_job | ||
| ) No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a basic linter to this repo and clean up all the newline ends and formating in a seprate PR? check w Nico what's preferred for our repos.
| print(f"Number of GPUs: {torch.cuda.device_count()}") | ||
|
|
||
| ### Dataset | ||
| from data import BespokeDataModule |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason why we need to define this custom data class instead of relying on the pre-defined huffingface dataset one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using return run.Config(HFDatasetDataModule, path_or_dataset='bespokelabs/Bespoke-Stratos-17k', seq_length=seq_length, micro_batch_size=micro_batch_size, global_batch_size=global_batch_size, num_workers=num_workers) instead, but the run failed with missing num_micro_batches (or something like that).
If you want to try, you can replace line 15 in this file with something similar to the above
| resume_if_exists=True, | ||
| ) | ||
|
|
||
| def configure_finetuning_recipe(args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why define this recipe from scratch instead of re-using the the default nemo recipe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is using llm.finetune from nemo, are you referring to something else?
| return run.Config(llm.Qwen2Model, config=run.Config(llm.Qwen2Config7B)) | ||
|
|
||
| # Configure the resume | ||
| def resume(model_id: str = "Qwen/Qwen2.5-7B-Instruct") -> run.Config[nl.AutoResume]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have we tested that this works with cache? If so, can you add a comment about expected behavior?
|
|
||
| def prepare_data(self) -> None: | ||
| # if train file is specified, no need to do anything | ||
| if not self.train_path.exists() or self.force_redownload: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit error prone for partial download failures. it'd just delegate to the huggingface load_dataset logic and rely on it for skipping download.
Proof: