feat(data): add kaput dataset by ashwinvaidya17 · Pull Request #3330 · open-edge-platform/anomalib

ashwinvaidya17 · 2026-02-11T11:03:00Z

📝 Description

Add kaput dataset
🛠️ Fixes ✨ Add the Kaputt dataset from "Kaputt: A Large-Scale Dataset for Visual Defect Detection" #3006

Warning

Untested as I don't have access to the dataset yet

✨ Changes

Select what type of change your PR is:

✅ Checklist

Before you submit your pull request, please make sure you have completed the following steps:

📚 I have made the necessary updates to the documentation (if applicable).
🧪 I have written tests that support my changes and prove that my fix is effective or my feature works (if applicable).
🏷️ My PR title follows conventional commit format.

For more information about code review checklists, see the Code Review Checklist.

Signed-off-by: Ashwin Vaidya <ashwin.vaidya@intel.com>

Copilot

Pull request overview

Adds first-class support for the Kaputt dataset by introducing a new dataset + datamodule, wiring them into the public anomalib.data API, and adding a config and unit test scaffold.

Changes:

Added KaputtDataset (Parquet-driven sample parsing) and Kaputt Lightning datamodule.
Integrated Kaputt into package exports and added an example Hydra config.
Added dummy-data generation utilities and a unit test file for the new datamodule.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
tests/unit/data/datamodule/image/test_kaputt.py	Adds unit test scaffold for `Kaputt` datamodule creation/config.
tests/helpers/data.py	Adds dummy Kaputt dataset generator including Parquet metadata.
src/anomalib/data/datasets/image/kaputt.py	Implements Kaputt dataset parsing from Parquet + sample construction.
src/anomalib/data/datasets/image/init.py	Exposes `KaputtDataset` in the image datasets package.
src/anomalib/data/datamodules/image/kaputt.py	Adds Kaputt datamodule with native split handling and dataset availability checks.
src/anomalib/data/datamodules/image/init.py	Exposes `Kaputt` and registers `KAPUTT` format.
src/anomalib/data/init.py	Re-exports `Kaputt` / `KaputtDataset` at top-level `anomalib.data`.
pyproject.toml	Adds `datasets` extra (pyarrow) and includes it in `full`.
examples/configs/data/kaputt.yaml	Adds example data config for Kaputt.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-11T11:06:02Z

tests/unit/data/datamodule/image/test_kaputt.py

+    @pytest.fixture()
+    @staticmethod
+    def datamodule(dataset_path: Path) -> Kaputt:


The decorator order is likely broken: decorators apply bottom-up, so @staticmethod runs first and turns the function into a staticmethod object before @pytest.fixture() sees it. Pytest fixtures expect a callable function, and this pattern commonly fails at collection time. Fix by removing @staticmethod (recommended) or swapping decorator order (@staticmethod above @pytest.fixture).

Copilot · 2026-02-11T11:06:03Z

tests/unit/data/datamodule/image/test_kaputt.py

+    @pytest.fixture()
+    @staticmethod
+    def fxt_data_config_path() -> str:


Same issue as the datamodule fixture: @staticmethod is applied before @pytest.fixture(), which can prevent pytest from treating this as a valid fixture. Remove @staticmethod or swap the decorator order.

Copilot · 2026-02-11T11:06:03Z

src/anomalib/data/datasets/image/kaputt.py

+# Image extensions used in Kaputt dataset
+IMG_EXTENSIONS = (".jpg", ".JPG", ".jpeg", ".JPEG", ".png", ".PNG")
+
+# Material categories in Kaputt dataset (based on item_material field)
+MATERIAL_CATEGORIES = (
+    "cardboard",
+    "glass",
+    "metal",
+    "paper",
+    "plastic",
+    "styrofoam",
+    "wood",
+)
+
+


IMG_EXTENSIONS and MATERIAL_CATEGORIES are defined but not used anywhere in this module. Either remove them to avoid dead code, or use them (e.g., resolve the actual image extension rather than hardcoding .jpg, and/or validate/normalize item_material).

Suggested change

# Image extensions used in Kaputt dataset

IMG_EXTENSIONS = (".jpg", ".JPG", ".jpeg", ".JPEG", ".png", ".PNG")

# Material categories in Kaputt dataset (based on item_material field)

MATERIAL_CATEGORIES = (

"cardboard",

"glass",

"metal",

"paper",

"plastic",

"styrofoam",

"wood",

)

Copilot · 2026-02-11T11:06:03Z

src/anomalib/data/datasets/image/kaputt.py

+            image_path = (
+                root / f"query-{image_type}" / "data" / split_name / "query-data" / image_type / f"{capture_id}.jpg"
+            )


Image paths are hardcoded to .jpg, but the module defines IMG_EXTENSIONS and the dataset description implies multiple possible image formats. If a capture is stored as .png/.jpeg, this will silently produce non-existent paths and later failures. Prefer reading the filename/extension from metadata (if available) or resolving by checking for an existing file with any allowed extension before constructing the sample.

Copilot · 2026-02-11T11:06:04Z

src/anomalib/data/datasets/image/kaputt.py

+                    image_path = (
+                        root
+                        / f"reference-{image_type}"
+                        / "data"
+                        / split_name
+                        / "reference-data"
+                        / image_type
+                        / f"{capture_id}.jpg"
+                    )


Same hardcoded .jpg issue for reference images. If reference images are not strictly .jpg in the real dataset, this will generate incorrect paths. Use metadata or extension resolution (possibly leveraging IMG_EXTENSIONS) to build correct paths.

Suggested change

image_path = (

root

/ f"reference-{image_type}"

/ "data"

/ split_name

/ "reference-data"

/ image_type

/ f"{capture_id}.jpg"

)

base_dir = (

root

/ f"reference-{image_type}"

/ "data"

/ split_name

/ "reference-data"

/ image_type

)

image_path = None

for ext in IMG_EXTENSIONS:

candidate = base_dir / f"{capture_id}{ext}"

if candidate.exists():

image_path = candidate

break

if image_path is None:

# Fallback to original .jpg behavior if no file with a known extension is found

image_path = base_dir / f"{capture_id}.jpg"

Copilot · 2026-02-11T11:06:04Z

src/anomalib/data/datasets/image/kaputt.py

+        # Process query images
+        for _, row in query_df.iterrows():
+            capture_id = row["capture_id"]
+
+            # Determine if defective
+            is_defective = row.get("defect", False)
+            label = "abnormal" if is_defective else "normal"
+            label_index = LabelName.ABNORMAL if is_defective else LabelName.NORMAL
+
+            # Build image path
+            image_path = (
+                root / f"query-{image_type}" / "data" / split_name / "query-data" / image_type / f"{capture_id}.jpg"
+            )
+
+            # Build mask path for defective images
+            mask_path = ""
+            if is_defective:
+                mask_path = str(
+                    root / "query-mask" / "data" / split_name / "query-data" / "mask" / f"{capture_id}.png",
+                )
+
+            # Convert split name back to anomalib format
+            anomalib_split = "val" if split_name == "validation" else split_name
+
+            sample = {
+                "split": anomalib_split,
+                "label": label,
+                "image_path": str(image_path),
+                "mask_path": mask_path,
+                "label_index": int(label_index),
+                "capture_id": capture_id,
+                "defect_types": row.get("defect_types", []),
+                "item_material": row.get("item_material", ""),
+            }
+            all_samples.append(sample)


iterrows() over a (potentially) 230k+ row dataset will be a bottleneck. Consider vectorizing sample construction (e.g., build columns with pandas ops and then to_dict('records')) to reduce Python-level looping and speed up dataset initialization.

Suggested change

# Process query images

for _, row in query_df.iterrows():

capture_id = row["capture_id"]

# Determine if defective

is_defective = row.get("defect", False)

label = "abnormal" if is_defective else "normal"

label_index = LabelName.ABNORMAL if is_defective else LabelName.NORMAL

# Build image path

image_path = (

root / f"query-{image_type}" / "data" / split_name / "query-data" / image_type / f"{capture_id}.jpg"

)

# Build mask path for defective images

mask_path = ""

if is_defective:

mask_path = str(

root / "query-mask" / "data" / split_name / "query-data" / "mask" / f"{capture_id}.png",

)

# Convert split name back to anomalib format

anomalib_split = "val" if split_name == "validation" else split_name

sample = {

"split": anomalib_split,

"label": label,

"image_path": str(image_path),

"mask_path": mask_path,

"label_index": int(label_index),

"capture_id": capture_id,

"defect_types": row.get("defect_types", []),

"item_material": row.get("item_material", ""),

}

all_samples.append(sample)

# Ensure optional columns exist with defaults

if "defect" not in query_df.columns:

query_df["defect"] = False

if "defect_types" not in query_df.columns:

query_df["defect_types"] = [[] for _ in range(len(query_df))]

if "item_material" not in query_df.columns:

query_df["item_material"] = ""

# Process query images using vectorized operations

query_df["is_defective"] = query_df["defect"].astype(bool)

query_df["label"] = query_df["is_defective"].map({True: "abnormal", False: "normal"})

abnormal_value = int(LabelName.ABNORMAL)

normal_value = int(LabelName.NORMAL)

query_df["label_index"] = query_df["is_defective"].map({True: abnormal_value, False: normal_value})

# Build image paths

base_image_dir = root / f"query-{image_type}" / "data" / split_name / "query-data" / image_type

query_df["image_path"] = query_df["capture_id"].astype(str).apply(

lambda cid: str(base_image_dir / f"{cid}.jpg"),

)

# Build mask paths (only for defective samples)

base_mask_dir = root / "query-mask" / "data" / split_name / "query-data" / "mask"

query_df["mask_path"] = ""

defective_mask = query_df["is_defective"]

query_df.loc[defective_mask, "mask_path"] = query_df.loc[defective_mask, "capture_id"].astype(str).apply(

lambda cid: str(base_mask_dir / f"{cid}.png"),

)

# Convert split name back to anomalib format

anomalib_split = "val" if split_name == "validation" else split_name

query_df["split"] = anomalib_split

# Collect samples from this split

split_samples = query_df[

[

"split",

"label",

"image_path",

"mask_path",

"label_index",

"capture_id",

"defect_types",

"item_material",

]

].to_dict(orient="records")

all_samples.extend(split_samples)

Copilot · 2026-02-11T11:06:05Z

src/anomalib/data/datamodules/image/kaputt.py

+        self.train_data = KaputtDataset(
+            split=Split.TRAIN,
+            root=self.root,
+            image_type=self.image_type,
+            use_reference=self.use_reference,
+        )
+        self.test_data = KaputtDataset(
+            split=Split.TEST,
+            root=self.root,
+            image_type=self.image_type,
+            use_reference=False,  # Don't use reference for test
+        )


test_split_mode and test_split_ratio are accepted by the datamodule (and passed to super().__init__), but _setup always uses the native Split.TEST regardless of those settings. Either (a) explicitly enforce test_split_mode == FROM_DIR for Kaputt (raise a clear error otherwise), or (b) implement the alternative split behavior when users request it, so the public API inputs aren’t silently ignored.

Copilot · 2026-02-11T11:06:05Z

pyproject.toml

+# dataset specific dependencies
+datasets = ["pyarrow"]


The Kaputt dummy dataset generator writes Parquet files (to_parquet) and the dataset reader uses read_parquet, which typically requires pyarrow (or another parquet engine) to be installed in the test environment. Since pyarrow is only in the datasets extra (not test), CI that installs only [test] may fail. Consider adding pyarrow to the test extra or skipping Kaputt tests when a parquet engine isn’t available.

Copilot · 2026-02-11T11:06:05Z

pyproject.toml

+# testing dependencies
 test = [
    "prek",
    "pytest",


The Kaputt dummy dataset generator writes Parquet files (to_parquet) and the dataset reader uses read_parquet, which typically requires pyarrow (or another parquet engine) to be installed in the test environment. Since pyarrow is only in the datasets extra (not test), CI that installs only [test] may fail. Consider adding pyarrow to the test extra or skipping Kaputt tests when a parquet engine isn’t available.

Suggested change

"pytest",

"pytest",

"pyarrow",

Copilot · 2026-02-11T11:06:05Z

tests/helpers/data.py

+    def _generate_dummy_kaputt_dataset(self) -> None:
+        """Generate dummy Kaputt dataset with Parquet metadata files.


The new dummy Kaputt generator is added, but the PR’s unit test shown only validates datamodule construction; it doesn’t assert key Kaputt-specific behaviors (e.g., use_reference=True adds reference rows, image_type='crop' switches subdirectories, abnormal samples include non-empty mask_path). Add targeted assertions in the new Kaputt test file to cover these new behaviors.

Add files

a416eb8

Signed-off-by: Ashwin Vaidya <ashwin.vaidya@intel.com>

Copilot AI review requested due to automatic review settings February 11, 2026 11:03

Merge branch 'main' into ashwin/kaputt_dataset

52435c2

Copilot AI reviewed Feb 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data): add kaput dataset#3330

feat(data): add kaput dataset#3330
ashwinvaidya17 wants to merge 2 commits intoopen-edge-platform:mainfrom
ashwinvaidya17:ashwin/kaputt_dataset

ashwinvaidya17 commented Feb 11, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Copilot AI Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		def _generate_dummy_kaputt_dataset(self) -> None:
		"""Generate dummy Kaputt dataset with Parquet metadata files.

Conversation

ashwinvaidya17 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📝 Description

✨ Changes

✅ Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ashwinvaidya17 commented Feb 11, 2026 •

edited

Loading