feat(data): add kaput dataset#3330
feat(data): add kaput dataset#3330ashwinvaidya17 wants to merge 2 commits intoopen-edge-platform:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds first-class support for the Kaputt dataset by introducing a new dataset + datamodule, wiring them into the public anomalib.data API, and adding a config and unit test scaffold.
Changes:
- Added
KaputtDataset(Parquet-driven sample parsing) andKaputtLightning datamodule. - Integrated Kaputt into package exports and added an example Hydra config.
- Added dummy-data generation utilities and a unit test file for the new datamodule.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/data/datamodule/image/test_kaputt.py | Adds unit test scaffold for Kaputt datamodule creation/config. |
| tests/helpers/data.py | Adds dummy Kaputt dataset generator including Parquet metadata. |
| src/anomalib/data/datasets/image/kaputt.py | Implements Kaputt dataset parsing from Parquet + sample construction. |
| src/anomalib/data/datasets/image/init.py | Exposes KaputtDataset in the image datasets package. |
| src/anomalib/data/datamodules/image/kaputt.py | Adds Kaputt datamodule with native split handling and dataset availability checks. |
| src/anomalib/data/datamodules/image/init.py | Exposes Kaputt and registers KAPUTT format. |
| src/anomalib/data/init.py | Re-exports Kaputt / KaputtDataset at top-level anomalib.data. |
| pyproject.toml | Adds datasets extra (pyarrow) and includes it in full. |
| examples/configs/data/kaputt.yaml | Adds example data config for Kaputt. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @pytest.fixture() | ||
| @staticmethod | ||
| def datamodule(dataset_path: Path) -> Kaputt: |
There was a problem hiding this comment.
The decorator order is likely broken: decorators apply bottom-up, so @staticmethod runs first and turns the function into a staticmethod object before @pytest.fixture() sees it. Pytest fixtures expect a callable function, and this pattern commonly fails at collection time. Fix by removing @staticmethod (recommended) or swapping decorator order (@staticmethod above @pytest.fixture).
| @pytest.fixture() | ||
| @staticmethod | ||
| def fxt_data_config_path() -> str: |
There was a problem hiding this comment.
Same issue as the datamodule fixture: @staticmethod is applied before @pytest.fixture(), which can prevent pytest from treating this as a valid fixture. Remove @staticmethod or swap the decorator order.
| # Image extensions used in Kaputt dataset | ||
| IMG_EXTENSIONS = (".jpg", ".JPG", ".jpeg", ".JPEG", ".png", ".PNG") | ||
|
|
||
| # Material categories in Kaputt dataset (based on item_material field) | ||
| MATERIAL_CATEGORIES = ( | ||
| "cardboard", | ||
| "glass", | ||
| "metal", | ||
| "paper", | ||
| "plastic", | ||
| "styrofoam", | ||
| "wood", | ||
| ) | ||
|
|
||
|
|
There was a problem hiding this comment.
IMG_EXTENSIONS and MATERIAL_CATEGORIES are defined but not used anywhere in this module. Either remove them to avoid dead code, or use them (e.g., resolve the actual image extension rather than hardcoding .jpg, and/or validate/normalize item_material).
| # Image extensions used in Kaputt dataset | |
| IMG_EXTENSIONS = (".jpg", ".JPG", ".jpeg", ".JPEG", ".png", ".PNG") | |
| # Material categories in Kaputt dataset (based on item_material field) | |
| MATERIAL_CATEGORIES = ( | |
| "cardboard", | |
| "glass", | |
| "metal", | |
| "paper", | |
| "plastic", | |
| "styrofoam", | |
| "wood", | |
| ) |
| image_path = ( | ||
| root / f"query-{image_type}" / "data" / split_name / "query-data" / image_type / f"{capture_id}.jpg" | ||
| ) |
There was a problem hiding this comment.
Image paths are hardcoded to .jpg, but the module defines IMG_EXTENSIONS and the dataset description implies multiple possible image formats. If a capture is stored as .png/.jpeg, this will silently produce non-existent paths and later failures. Prefer reading the filename/extension from metadata (if available) or resolving by checking for an existing file with any allowed extension before constructing the sample.
| image_path = ( | ||
| root | ||
| / f"reference-{image_type}" | ||
| / "data" | ||
| / split_name | ||
| / "reference-data" | ||
| / image_type | ||
| / f"{capture_id}.jpg" | ||
| ) |
There was a problem hiding this comment.
Same hardcoded .jpg issue for reference images. If reference images are not strictly .jpg in the real dataset, this will generate incorrect paths. Use metadata or extension resolution (possibly leveraging IMG_EXTENSIONS) to build correct paths.
| image_path = ( | |
| root | |
| / f"reference-{image_type}" | |
| / "data" | |
| / split_name | |
| / "reference-data" | |
| / image_type | |
| / f"{capture_id}.jpg" | |
| ) | |
| base_dir = ( | |
| root | |
| / f"reference-{image_type}" | |
| / "data" | |
| / split_name | |
| / "reference-data" | |
| / image_type | |
| ) | |
| image_path = None | |
| for ext in IMG_EXTENSIONS: | |
| candidate = base_dir / f"{capture_id}{ext}" | |
| if candidate.exists(): | |
| image_path = candidate | |
| break | |
| if image_path is None: | |
| # Fallback to original .jpg behavior if no file with a known extension is found | |
| image_path = base_dir / f"{capture_id}.jpg" |
| # Process query images | ||
| for _, row in query_df.iterrows(): | ||
| capture_id = row["capture_id"] | ||
|
|
||
| # Determine if defective | ||
| is_defective = row.get("defect", False) | ||
| label = "abnormal" if is_defective else "normal" | ||
| label_index = LabelName.ABNORMAL if is_defective else LabelName.NORMAL | ||
|
|
||
| # Build image path | ||
| image_path = ( | ||
| root / f"query-{image_type}" / "data" / split_name / "query-data" / image_type / f"{capture_id}.jpg" | ||
| ) | ||
|
|
||
| # Build mask path for defective images | ||
| mask_path = "" | ||
| if is_defective: | ||
| mask_path = str( | ||
| root / "query-mask" / "data" / split_name / "query-data" / "mask" / f"{capture_id}.png", | ||
| ) | ||
|
|
||
| # Convert split name back to anomalib format | ||
| anomalib_split = "val" if split_name == "validation" else split_name | ||
|
|
||
| sample = { | ||
| "split": anomalib_split, | ||
| "label": label, | ||
| "image_path": str(image_path), | ||
| "mask_path": mask_path, | ||
| "label_index": int(label_index), | ||
| "capture_id": capture_id, | ||
| "defect_types": row.get("defect_types", []), | ||
| "item_material": row.get("item_material", ""), | ||
| } | ||
| all_samples.append(sample) |
There was a problem hiding this comment.
iterrows() over a (potentially) 230k+ row dataset will be a bottleneck. Consider vectorizing sample construction (e.g., build columns with pandas ops and then to_dict('records')) to reduce Python-level looping and speed up dataset initialization.
| # Process query images | |
| for _, row in query_df.iterrows(): | |
| capture_id = row["capture_id"] | |
| # Determine if defective | |
| is_defective = row.get("defect", False) | |
| label = "abnormal" if is_defective else "normal" | |
| label_index = LabelName.ABNORMAL if is_defective else LabelName.NORMAL | |
| # Build image path | |
| image_path = ( | |
| root / f"query-{image_type}" / "data" / split_name / "query-data" / image_type / f"{capture_id}.jpg" | |
| ) | |
| # Build mask path for defective images | |
| mask_path = "" | |
| if is_defective: | |
| mask_path = str( | |
| root / "query-mask" / "data" / split_name / "query-data" / "mask" / f"{capture_id}.png", | |
| ) | |
| # Convert split name back to anomalib format | |
| anomalib_split = "val" if split_name == "validation" else split_name | |
| sample = { | |
| "split": anomalib_split, | |
| "label": label, | |
| "image_path": str(image_path), | |
| "mask_path": mask_path, | |
| "label_index": int(label_index), | |
| "capture_id": capture_id, | |
| "defect_types": row.get("defect_types", []), | |
| "item_material": row.get("item_material", ""), | |
| } | |
| all_samples.append(sample) | |
| # Ensure optional columns exist with defaults | |
| if "defect" not in query_df.columns: | |
| query_df["defect"] = False | |
| if "defect_types" not in query_df.columns: | |
| query_df["defect_types"] = [[] for _ in range(len(query_df))] | |
| if "item_material" not in query_df.columns: | |
| query_df["item_material"] = "" | |
| # Process query images using vectorized operations | |
| query_df["is_defective"] = query_df["defect"].astype(bool) | |
| query_df["label"] = query_df["is_defective"].map({True: "abnormal", False: "normal"}) | |
| abnormal_value = int(LabelName.ABNORMAL) | |
| normal_value = int(LabelName.NORMAL) | |
| query_df["label_index"] = query_df["is_defective"].map({True: abnormal_value, False: normal_value}) | |
| # Build image paths | |
| base_image_dir = root / f"query-{image_type}" / "data" / split_name / "query-data" / image_type | |
| query_df["image_path"] = query_df["capture_id"].astype(str).apply( | |
| lambda cid: str(base_image_dir / f"{cid}.jpg"), | |
| ) | |
| # Build mask paths (only for defective samples) | |
| base_mask_dir = root / "query-mask" / "data" / split_name / "query-data" / "mask" | |
| query_df["mask_path"] = "" | |
| defective_mask = query_df["is_defective"] | |
| query_df.loc[defective_mask, "mask_path"] = query_df.loc[defective_mask, "capture_id"].astype(str).apply( | |
| lambda cid: str(base_mask_dir / f"{cid}.png"), | |
| ) | |
| # Convert split name back to anomalib format | |
| anomalib_split = "val" if split_name == "validation" else split_name | |
| query_df["split"] = anomalib_split | |
| # Collect samples from this split | |
| split_samples = query_df[ | |
| [ | |
| "split", | |
| "label", | |
| "image_path", | |
| "mask_path", | |
| "label_index", | |
| "capture_id", | |
| "defect_types", | |
| "item_material", | |
| ] | |
| ].to_dict(orient="records") | |
| all_samples.extend(split_samples) |
| self.train_data = KaputtDataset( | ||
| split=Split.TRAIN, | ||
| root=self.root, | ||
| image_type=self.image_type, | ||
| use_reference=self.use_reference, | ||
| ) | ||
| self.test_data = KaputtDataset( | ||
| split=Split.TEST, | ||
| root=self.root, | ||
| image_type=self.image_type, | ||
| use_reference=False, # Don't use reference for test | ||
| ) |
There was a problem hiding this comment.
test_split_mode and test_split_ratio are accepted by the datamodule (and passed to super().__init__), but _setup always uses the native Split.TEST regardless of those settings. Either (a) explicitly enforce test_split_mode == FROM_DIR for Kaputt (raise a clear error otherwise), or (b) implement the alternative split behavior when users request it, so the public API inputs aren’t silently ignored.
| # dataset specific dependencies | ||
| datasets = ["pyarrow"] |
There was a problem hiding this comment.
The Kaputt dummy dataset generator writes Parquet files (to_parquet) and the dataset reader uses read_parquet, which typically requires pyarrow (or another parquet engine) to be installed in the test environment. Since pyarrow is only in the datasets extra (not test), CI that installs only [test] may fail. Consider adding pyarrow to the test extra or skipping Kaputt tests when a parquet engine isn’t available.
| # testing dependencies | ||
| test = [ | ||
| "prek", | ||
| "pytest", |
There was a problem hiding this comment.
The Kaputt dummy dataset generator writes Parquet files (to_parquet) and the dataset reader uses read_parquet, which typically requires pyarrow (or another parquet engine) to be installed in the test environment. Since pyarrow is only in the datasets extra (not test), CI that installs only [test] may fail. Consider adding pyarrow to the test extra or skipping Kaputt tests when a parquet engine isn’t available.
| "pytest", | |
| "pytest", | |
| "pyarrow", |
| def _generate_dummy_kaputt_dataset(self) -> None: | ||
| """Generate dummy Kaputt dataset with Parquet metadata files. |
There was a problem hiding this comment.
The new dummy Kaputt generator is added, but the PR’s unit test shown only validates datamodule construction; it doesn’t assert key Kaputt-specific behaviors (e.g., use_reference=True adds reference rows, image_type='crop' switches subdirectories, abnormal samples include non-empty mask_path). Add targeted assertions in the new Kaputt test file to cover these new behaviors.
📝 Description
Warning
Untested as I don't have access to the dataset yet
✨ Changes
Select what type of change your PR is:
✅ Checklist
Before you submit your pull request, please make sure you have completed the following steps:
For more information about code review checklists, see the Code Review Checklist.