Skip to content

Conversation

@nidhihrmth
Copy link
Collaborator

@nidhihrmth nidhihrmth commented Sep 8, 2025

Proof:

image

@nidhihrmth nidhihrmth changed the title nemo working SFT Nemo - Qwen Sep 8, 2025
Comment on lines 35 to 42
# Executor type
parser.add_argument(
"--executor",
type=str,
default="local",
choices=["local", "slurm"],
help="Executor type to use"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't have this since we don't support slurm, right?

Comment on lines 10 to 17
# Model configuration
parser.add_argument(
"--model",
type=str,
default="qwen2_7b",
choices=["qwen2_7b"], # Add more models as needed
help="Model type to use"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be the same between import_ckept and train, right? Is it better to explicitely set it in the run.sh and add a comment?

@nidhihrmth nidhihrmth marked this pull request as ready for review September 9, 2025 21:56
training_project = definitions.TrainingProject(
name="Nemo-qwen2.5-nemo 1node",
job=training_job
) No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a basic linter to this repo and clean up all the newline ends and formating in a seprate PR? check w Nico what's preferred for our repos.

print(f"Number of GPUs: {torch.cuda.device_count()}")

### Dataset
from data import BespokeDataModule
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason why we need to define this custom data class instead of relying on the pre-defined huffingface dataset one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using return run.Config(HFDatasetDataModule, path_or_dataset='bespokelabs/Bespoke-Stratos-17k', seq_length=seq_length, micro_batch_size=micro_batch_size, global_batch_size=global_batch_size, num_workers=num_workers) instead, but the run failed with missing num_micro_batches (or something like that).
If you want to try, you can replace line 15 in this file with something similar to the above

resume_if_exists=True,
)

def configure_finetuning_recipe(args):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why define this recipe from scratch instead of re-using the the default nemo recipe?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is using llm.finetune from nemo, are you referring to something else?

return run.Config(llm.Qwen2Model, config=run.Config(llm.Qwen2Config7B))

# Configure the resume
def resume(model_id: str = "Qwen/Qwen2.5-7B-Instruct") -> run.Config[nl.AutoResume]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we tested that this works with cache? If so, can you add a comment about expected behavior?


def prepare_data(self) -> None:
# if train file is specified, no need to do anything
if not self.train_path.exists() or self.force_redownload:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit error prone for partial download failures. it'd just delegate to the huggingface load_dataset logic and rely on it for skipping download.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants