Skip to content

feat(data): add kaput dataset#3330

Draft
ashwinvaidya17 wants to merge 2 commits intoopen-edge-platform:mainfrom
ashwinvaidya17:ashwin/kaputt_dataset
Draft

feat(data): add kaput dataset#3330
ashwinvaidya17 wants to merge 2 commits intoopen-edge-platform:mainfrom
ashwinvaidya17:ashwin/kaputt_dataset

Conversation

@ashwinvaidya17
Copy link
Contributor

@ashwinvaidya17 ashwinvaidya17 commented Feb 11, 2026

📝 Description

Warning

Untested as I don't have access to the dataset yet

✨ Changes

Select what type of change your PR is:

  • 🚀 New feature (non-breaking change which adds functionality)
  • 🐞 Bug fix (non-breaking change which fixes an issue)
  • 🔄 Refactor (non-breaking change which refactors the code base)
  • ⚡ Performance improvements
  • 🎨 Style changes (code style/formatting)
  • 🧪 Tests (adding/modifying tests)
  • 📚 Documentation update
  • 📦 Build system changes
  • 🚧 CI/CD configuration
  • 🔧 Chore (general maintenance)
  • 🔒 Security update
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)

✅ Checklist

Before you submit your pull request, please make sure you have completed the following steps:

  • 📚 I have made the necessary updates to the documentation (if applicable).
  • 🧪 I have written tests that support my changes and prove that my fix is effective or my feature works (if applicable).
  • 🏷️ My PR title follows conventional commit format.

For more information about code review checklists, see the Code Review Checklist.

Signed-off-by: Ashwin Vaidya <ashwin.vaidya@intel.com>
Copilot AI review requested due to automatic review settings February 11, 2026 11:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds first-class support for the Kaputt dataset by introducing a new dataset + datamodule, wiring them into the public anomalib.data API, and adding a config and unit test scaffold.

Changes:

  • Added KaputtDataset (Parquet-driven sample parsing) and Kaputt Lightning datamodule.
  • Integrated Kaputt into package exports and added an example Hydra config.
  • Added dummy-data generation utilities and a unit test file for the new datamodule.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
tests/unit/data/datamodule/image/test_kaputt.py Adds unit test scaffold for Kaputt datamodule creation/config.
tests/helpers/data.py Adds dummy Kaputt dataset generator including Parquet metadata.
src/anomalib/data/datasets/image/kaputt.py Implements Kaputt dataset parsing from Parquet + sample construction.
src/anomalib/data/datasets/image/init.py Exposes KaputtDataset in the image datasets package.
src/anomalib/data/datamodules/image/kaputt.py Adds Kaputt datamodule with native split handling and dataset availability checks.
src/anomalib/data/datamodules/image/init.py Exposes Kaputt and registers KAPUTT format.
src/anomalib/data/init.py Re-exports Kaputt / KaputtDataset at top-level anomalib.data.
pyproject.toml Adds datasets extra (pyarrow) and includes it in full.
examples/configs/data/kaputt.yaml Adds example data config for Kaputt.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +19 to +21
@pytest.fixture()
@staticmethod
def datamodule(dataset_path: Path) -> Kaputt:
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decorator order is likely broken: decorators apply bottom-up, so @staticmethod runs first and turns the function into a staticmethod object before @pytest.fixture() sees it. Pytest fixtures expect a callable function, and this pattern commonly fails at collection time. Fix by removing @staticmethod (recommended) or swapping decorator order (@staticmethod above @pytest.fixture).

Copilot uses AI. Check for mistakes.
Comment on lines +35 to +37
@pytest.fixture()
@staticmethod
def fxt_data_config_path() -> str:
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as the datamodule fixture: @staticmethod is applied before @pytest.fixture(), which can prevent pytest from treating this as a valid fixture. Remove @staticmethod or swap the decorator order.

Copilot uses AI. Check for mistakes.
Comment on lines +54 to +68
# Image extensions used in Kaputt dataset
IMG_EXTENSIONS = (".jpg", ".JPG", ".jpeg", ".JPEG", ".png", ".PNG")

# Material categories in Kaputt dataset (based on item_material field)
MATERIAL_CATEGORIES = (
"cardboard",
"glass",
"metal",
"paper",
"plastic",
"styrofoam",
"wood",
)


Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMG_EXTENSIONS and MATERIAL_CATEGORIES are defined but not used anywhere in this module. Either remove them to avoid dead code, or use them (e.g., resolve the actual image extension rather than hardcoding .jpg, and/or validate/normalize item_material).

Suggested change
# Image extensions used in Kaputt dataset
IMG_EXTENSIONS = (".jpg", ".JPG", ".jpeg", ".JPEG", ".png", ".PNG")
# Material categories in Kaputt dataset (based on item_material field)
MATERIAL_CATEGORIES = (
"cardboard",
"glass",
"metal",
"paper",
"plastic",
"styrofoam",
"wood",
)

Copilot uses AI. Check for mistakes.
Comment on lines +225 to +227
image_path = (
root / f"query-{image_type}" / "data" / split_name / "query-data" / image_type / f"{capture_id}.jpg"
)
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image paths are hardcoded to .jpg, but the module defines IMG_EXTENSIONS and the dataset description implies multiple possible image formats. If a capture is stored as .png/.jpeg, this will silently produce non-existent paths and later failures. Prefer reading the filename/extension from metadata (if available) or resolving by checking for an existing file with any allowed extension before constructing the sample.

Copilot uses AI. Check for mistakes.
Comment on lines +261 to +269
image_path = (
root
/ f"reference-{image_type}"
/ "data"
/ split_name
/ "reference-data"
/ image_type
/ f"{capture_id}.jpg"
)
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same hardcoded .jpg issue for reference images. If reference images are not strictly .jpg in the real dataset, this will generate incorrect paths. Use metadata or extension resolution (possibly leveraging IMG_EXTENSIONS) to build correct paths.

Suggested change
image_path = (
root
/ f"reference-{image_type}"
/ "data"
/ split_name
/ "reference-data"
/ image_type
/ f"{capture_id}.jpg"
)
base_dir = (
root
/ f"reference-{image_type}"
/ "data"
/ split_name
/ "reference-data"
/ image_type
)
image_path = None
for ext in IMG_EXTENSIONS:
candidate = base_dir / f"{capture_id}{ext}"
if candidate.exists():
image_path = candidate
break
if image_path is None:
# Fallback to original .jpg behavior if no file with a known extension is found
image_path = base_dir / f"{capture_id}.jpg"

Copilot uses AI. Check for mistakes.
Comment on lines +215 to +249
# Process query images
for _, row in query_df.iterrows():
capture_id = row["capture_id"]

# Determine if defective
is_defective = row.get("defect", False)
label = "abnormal" if is_defective else "normal"
label_index = LabelName.ABNORMAL if is_defective else LabelName.NORMAL

# Build image path
image_path = (
root / f"query-{image_type}" / "data" / split_name / "query-data" / image_type / f"{capture_id}.jpg"
)

# Build mask path for defective images
mask_path = ""
if is_defective:
mask_path = str(
root / "query-mask" / "data" / split_name / "query-data" / "mask" / f"{capture_id}.png",
)

# Convert split name back to anomalib format
anomalib_split = "val" if split_name == "validation" else split_name

sample = {
"split": anomalib_split,
"label": label,
"image_path": str(image_path),
"mask_path": mask_path,
"label_index": int(label_index),
"capture_id": capture_id,
"defect_types": row.get("defect_types", []),
"item_material": row.get("item_material", ""),
}
all_samples.append(sample)
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iterrows() over a (potentially) 230k+ row dataset will be a bottleneck. Consider vectorizing sample construction (e.g., build columns with pandas ops and then to_dict('records')) to reduce Python-level looping and speed up dataset initialization.

Suggested change
# Process query images
for _, row in query_df.iterrows():
capture_id = row["capture_id"]
# Determine if defective
is_defective = row.get("defect", False)
label = "abnormal" if is_defective else "normal"
label_index = LabelName.ABNORMAL if is_defective else LabelName.NORMAL
# Build image path
image_path = (
root / f"query-{image_type}" / "data" / split_name / "query-data" / image_type / f"{capture_id}.jpg"
)
# Build mask path for defective images
mask_path = ""
if is_defective:
mask_path = str(
root / "query-mask" / "data" / split_name / "query-data" / "mask" / f"{capture_id}.png",
)
# Convert split name back to anomalib format
anomalib_split = "val" if split_name == "validation" else split_name
sample = {
"split": anomalib_split,
"label": label,
"image_path": str(image_path),
"mask_path": mask_path,
"label_index": int(label_index),
"capture_id": capture_id,
"defect_types": row.get("defect_types", []),
"item_material": row.get("item_material", ""),
}
all_samples.append(sample)
# Ensure optional columns exist with defaults
if "defect" not in query_df.columns:
query_df["defect"] = False
if "defect_types" not in query_df.columns:
query_df["defect_types"] = [[] for _ in range(len(query_df))]
if "item_material" not in query_df.columns:
query_df["item_material"] = ""
# Process query images using vectorized operations
query_df["is_defective"] = query_df["defect"].astype(bool)
query_df["label"] = query_df["is_defective"].map({True: "abnormal", False: "normal"})
abnormal_value = int(LabelName.ABNORMAL)
normal_value = int(LabelName.NORMAL)
query_df["label_index"] = query_df["is_defective"].map({True: abnormal_value, False: normal_value})
# Build image paths
base_image_dir = root / f"query-{image_type}" / "data" / split_name / "query-data" / image_type
query_df["image_path"] = query_df["capture_id"].astype(str).apply(
lambda cid: str(base_image_dir / f"{cid}.jpg"),
)
# Build mask paths (only for defective samples)
base_mask_dir = root / "query-mask" / "data" / split_name / "query-data" / "mask"
query_df["mask_path"] = ""
defective_mask = query_df["is_defective"]
query_df.loc[defective_mask, "mask_path"] = query_df.loc[defective_mask, "capture_id"].astype(str).apply(
lambda cid: str(base_mask_dir / f"{cid}.png"),
)
# Convert split name back to anomalib format
anomalib_split = "val" if split_name == "validation" else split_name
query_df["split"] = anomalib_split
# Collect samples from this split
split_samples = query_df[
[
"split",
"label",
"image_path",
"mask_path",
"label_index",
"capture_id",
"defect_types",
"item_material",
]
].to_dict(orient="records")
all_samples.extend(split_samples)

Copilot uses AI. Check for mistakes.
Comment on lines +184 to +195
self.train_data = KaputtDataset(
split=Split.TRAIN,
root=self.root,
image_type=self.image_type,
use_reference=self.use_reference,
)
self.test_data = KaputtDataset(
split=Split.TEST,
root=self.root,
image_type=self.image_type,
use_reference=False, # Don't use reference for test
)
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_split_mode and test_split_ratio are accepted by the datamodule (and passed to super().__init__), but _setup always uses the native Split.TEST regardless of those settings. Either (a) explicitly enforce test_split_mode == FROM_DIR for Kaputt (raise a clear error otherwise), or (b) implement the alternative split behavior when users request it, so the public API inputs aren’t silently ignored.

Copilot uses AI. Check for mistakes.
Comment on lines +78 to +79
# dataset specific dependencies
datasets = ["pyarrow"]
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Kaputt dummy dataset generator writes Parquet files (to_parquet) and the dataset reader uses read_parquet, which typically requires pyarrow (or another parquet engine) to be installed in the test environment. Since pyarrow is only in the datasets extra (not test), CI that installs only [test] may fail. Consider adding pyarrow to the test extra or skipping Kaputt tests when a parquet engine isn’t available.

Copilot uses AI. Check for mistakes.
# testing dependencies
test = [
"prek",
"pytest",
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Kaputt dummy dataset generator writes Parquet files (to_parquet) and the dataset reader uses read_parquet, which typically requires pyarrow (or another parquet engine) to be installed in the test environment. Since pyarrow is only in the datasets extra (not test), CI that installs only [test] may fail. Consider adding pyarrow to the test extra or skipping Kaputt tests when a parquet engine isn’t available.

Suggested change
"pytest",
"pytest",
"pyarrow",

Copilot uses AI. Check for mistakes.
Comment on lines +689 to +690
def _generate_dummy_kaputt_dataset(self) -> None:
"""Generate dummy Kaputt dataset with Parquet metadata files.
Copy link

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new dummy Kaputt generator is added, but the PR’s unit test shown only validates datamodule construction; it doesn’t assert key Kaputt-specific behaviors (e.g., use_reference=True adds reference rows, image_type='crop' switches subdirectories, abnormal samples include non-empty mask_path). Add targeted assertions in the new Kaputt test file to cover these new behaviors.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

✨ Add the Kaputt dataset from "Kaputt: A Large-Scale Dataset for Visual Defect Detection"

1 participant