-
Notifications
You must be signed in to change notification settings - Fork 61
[QEff. Finetune]: Added Base dataset class and SFT dataset classes along with its test cases. #647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: ft_experimental
Are you sure you want to change the base?
Conversation
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Updated the dataset.py file to only enable support for SFTDataset types. Created test file to check the functionalities accordingly. Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
| train_dataset = SFTDataset( | ||
| dataset_name="ignored", | ||
| split="train", | ||
| split_ratio=0.8, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use constants for this.
quic-meetkuma
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it has really good changes and extended test cases. Some minor polishing is needed before merge. Thanks. :)
| """Subclasses should implement this to load and prepare the dataset.""" | ||
| raise NotImplementedError | ||
|
|
||
| @property |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can skip this and directly use
dataset_instance.dataset
Bdw, what extra hf_dataset brings? If it brings some more meta data information then lets us keep it else remove it.
| from QEfficient.finetune.experimental.core.component_registry import registry | ||
|
|
||
|
|
||
| class BaseDataset(Dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from abc import ABC, abstractmethod
class BaseDataset(Dataset, ABC):
Make the BaseDataset class inherit from ABC so that it will become abstract base class.
| """Return the underlying Hugging Face dataset object.""" | ||
| return self.dataset | ||
|
|
||
| def __len__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AbstractMethod
def len(self):
"""docstring"""
pass
same for getitem
| raise RuntimeError("Either provide completion_template or completion_func in the config.") | ||
|
|
||
| # Call parent class __init__ which will call _initialize_dataset | ||
| super().__init__(dataset_name, split, seed, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good cleanup in init
| self.dataset = splitted_dataset["train"] | ||
| else: | ||
| # Load dataset from HuggingFace | ||
| db = load_dataset_builder(self.dataset_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good addition over load_dataset.
| with pytest.raises(NotImplementedError): | ||
| _ = dataset[0] | ||
|
|
||
| def test_base_dataset_hf_dataset_property(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed
|
|
||
| @patch('QEfficient.finetune.experimental.core.dataset.load_dataset') | ||
| @patch('QEfficient.finetune.experimental.core.dataset.load_dataset_builder') | ||
| def test_sft_dataset_initialization_with_templates(self, mock_builder, mock_load, mock_hf_dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of mocking load_dataset and load_dataset_builder, can we use some dummy dataset from https://huggingface.co/hf-internal-testing.
That way we are able to track changes in the HF's API.
| @patch('QEfficient.finetune.experimental.core.dataset.load_dataset') | ||
| def test_sft_dataset_from_json_file(self, mock_load, temp_json_file): | ||
| """Test SFTDataset loading from JSON file.""" | ||
| mock_dataset = HFDataset.from_dict({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dont use mock, instead use your own sample json which is created above and let load_dataset actually read the json.
| assert len(dataset) > 0 | ||
|
|
||
| @patch('QEfficient.finetune.experimental.core.dataset.load_dataset') | ||
| def test_sft_dataset_train_test_split_from_json(self, mock_load): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above function
|
|
||
| @patch('QEfficient.finetune.experimental.core.dataset.load_dataset') | ||
| @patch('QEfficient.finetune.experimental.core.dataset.load_dataset_builder') | ||
| def test_sft_dataset_invalid_template_variable(self, mock_builder, mock_load, mock_hf_dataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not use mock, use actual APIs with dummy HF datasets.
Edited the SFTDataset class to enable custom dataset loading.
Updated the dataset.py file to only enable support for SFTDataset type.
Created test file to check the functionalities.