Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,29 @@

logger = getLogger(__name__)

CALIBRATION_SAMPLE_INDEX = {
20232,
21162,
33584,
46825,
45190,
46143,
14189,
16658,
26406,
9565,
33733,
31057,
47465,
33503,
42293,
7768,
1962,
39746,
13568,
22527,
}


def _process_sample_to_row(sample: dict[str, Any]) -> dict[str, Any]:
"""Convert a single HF dataset sample to a row dict for parquet storage.
Expand Down Expand Up @@ -148,6 +171,8 @@ def generate(
desc=f"Converting images ({split_key})",
unit="rows",
):
if i in CALIBRATION_SAMPLE_INDEX:
continue
Comment on lines +174 to +175
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Filtering by absolute indices is fragile because it depends on the specific dataset split and ordering. If generate is called with a different split (e.g., ['test'] instead of the default ['train', 'test']), the index i will no longer correspond to the same samples, and the calibration set may not be correctly excluded. Consider using a unique identifier for filtering if available, or adding a check to ensure the exclusion only applies to the expected split.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataset is fixed for this task and predefined dataset will be pulled in the same way always, these splits are not really configurable from any configs. So this should be OK for now.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who is the caller of ShopifyProductCatalogue.generate()?

Looking at the code alone, split is a parameter of ShopifyProductCatalogue.generate() which means it could change if the caller chooses to pass in an argument for split specifically. Taking this into consideration, a design that might make more sense is to have calibration_sample_indices also as an parameter of ShopifyProductCatalogue.generate(), and its default value would be CALIBRATION_SAMPLE_INDEX (or DEFAULT_CALIBRATION_SAMPLE_INDEX). When defining the v6.1 round benchmark settings, you could then put into the readme or the yaml that both split and calibration_sample_indices should not be specified to something other than None or the default values.

Copy link
Copy Markdown
Collaborator Author

@Victor49152 Victor49152 Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The funny part is that at least in the current setup, when you call a predefined dataset, the individual kwargs of this subclass is not configurable by the user. That means if the user don't hack the source code, those kwargs for generate() is the default value all the time. That's why I didn't bother.

Your idea is absolutely a better design in general though.

sample = ds[i]
all_rows.append(_process_sample_to_row(sample))

Expand Down
Loading