Skip to content

[QEff. Finetune]: Removed samsum dataset references from FT code. #482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

quic-meetkuma
Copy link
Contributor

  • Removed all the references of samsum dataset from finetuning code.
  • Samsum dataset can be used via custom dataset path.

@quic-meetkuma quic-meetkuma changed the title Removed samsum dataset references from FT code. [QEff. Finetune]: Removed samsum dataset references from FT code. Jun 27, 2025
@quic-meetkuma quic-meetkuma marked this pull request as ready for review June 30, 2025 07:57

DATASET_PREPROC = {
"alpaca_dataset": partial(get_alpaca_dataset),
"grammar_dataset": get_grammar_dataset,
"samsum_dataset": get_samsum_dataset,
Copy link
Contributor

@quic-swatia quic-swatia Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just removing this line of code is enough for the finetune.py to not throw DatasetNotFoundError error in case of "--dataset samsum_dataset". It will raise an error as follows: 'finetune.py: error: argument --dataset: invalid choice: 'samsum_dataset' (choose from 'alpaca_dataset', 'grammar_dataset', 'gsm8k_dataset', 'custom_dataset', 'imdb_dataset')'

Rest of the code changes of this PR are not required.

This way we can still keep the code for samsum_dataset for internal testing purpose and also if huggingface puts back the Samsum dataset, we would just need a single line of code to support it through QEfficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is discussed with Anuj and VB to remove all the references of this code. User should use this only via custom_dataset path.


DATASET_PREPROC = {
"alpaca_dataset": partial(get_alpaca_dataset),
"grammar_dataset": get_grammar_dataset,
"samsum_dataset": get_samsum_dataset,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is discussed with Anuj and VB to remove all the references of this code. User should use this only via custom_dataset path.

@@ -171,6 +170,28 @@ pipeline {
}
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quic-hemagnih , @vbaddi , @quic-rishinr - FYI, Made a separate env for FT tests.

Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>
Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>
Signed-off-by: Meet Patel <meetkuma@qti.qualcomm.com>
@quic-meetkuma quic-meetkuma requested a review from vbaddi July 3, 2025 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants