Skip to content

Conversation

@quic-dhirajku
Copy link
Contributor

Edited the SFTDataset class to enable custom dataset loading.
Updated the dataset.py file to only enable support for SFTDataset type.
Created test file to check the functionalities.

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Updated the dataset.py file to only enable support for SFTDataset types.
Created test file to check the functionalities accordingly.

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
@quic-meetkuma quic-meetkuma changed the title Ft datasets [QEff. Finetune]: Added Base dataset class and SFT dataset classes along with its test cases. Dec 2, 2025
train_dataset = SFTDataset(
dataset_name="ignored",
split="train",
split_ratio=0.8,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use constants for this.

Copy link
Contributor

@quic-meetkuma quic-meetkuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it has really good changes and extended test cases. Some minor polishing is needed before merge. Thanks. :)

"""Subclasses should implement this to load and prepare the dataset."""
raise NotImplementedError

@property
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can skip this and directly use

dataset_instance.dataset

Bdw, what extra hf_dataset brings? If it brings some more meta data information then lets us keep it else remove it.

from QEfficient.finetune.experimental.core.component_registry import registry


class BaseDataset(Dataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from abc import ABC, abstractmethod

class BaseDataset(Dataset, ABC):

Make the BaseDataset class inherit from ABC so that it will become abstract base class.

"""Return the underlying Hugging Face dataset object."""
return self.dataset

def __len__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AbstractMethod
def len(self):
"""docstring"""
pass

same for getitem

raise RuntimeError("Either provide completion_template or completion_func in the config.")

# Call parent class __init__ which will call _initialize_dataset
super().__init__(dataset_name, split, seed, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good cleanup in init

self.dataset = splitted_dataset["train"]
else:
# Load dataset from HuggingFace
db = load_dataset_builder(self.dataset_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good addition over load_dataset.

with pytest.raises(NotImplementedError):
_ = dataset[0]

def test_base_dataset_hf_dataset_property(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed


@patch('QEfficient.finetune.experimental.core.dataset.load_dataset')
@patch('QEfficient.finetune.experimental.core.dataset.load_dataset_builder')
def test_sft_dataset_initialization_with_templates(self, mock_builder, mock_load, mock_hf_dataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of mocking load_dataset and load_dataset_builder, can we use some dummy dataset from https://huggingface.co/hf-internal-testing.

That way we are able to track changes in the HF's API.

@patch('QEfficient.finetune.experimental.core.dataset.load_dataset')
def test_sft_dataset_from_json_file(self, mock_load, temp_json_file):
"""Test SFTDataset loading from JSON file."""
mock_dataset = HFDataset.from_dict({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont use mock, instead use your own sample json which is created above and let load_dataset actually read the json.

assert len(dataset) > 0

@patch('QEfficient.finetune.experimental.core.dataset.load_dataset')
def test_sft_dataset_train_test_split_from_json(self, mock_load):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above function


@patch('QEfficient.finetune.experimental.core.dataset.load_dataset')
@patch('QEfficient.finetune.experimental.core.dataset.load_dataset_builder')
def test_sft_dataset_invalid_template_variable(self, mock_builder, mock_load, mock_hf_dataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use mock, use actual APIs with dummy HF datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants