Feat size: Added detection for images which are way bigger than average. (#175)

wirthual · web-flow · commit fa42afd03a92 · 2023-05-30T17:24:51.000-07:00
* add issue type for bigger images

* add two images in description

* commit additional changes

* remove smaller tests for now.

* blak format new files

* black formated changes

* removed unused imports

* fixed typing problems and flake8.

* Increased expected column count by 8, those are ['width','height', 'width_score_raw', 'is_width_issue', 'width_score','height_score_raw', 'is_height_issue','height_score']

* add size to issue properties

* removed unused field

* update readme. Added a test with custom threshold.

* cleanup

* removed unnecessary report calls

* use float type as return for image area to be compatible with types.

* get rid of flake8 errors

* added changes requested. Use sqrt of area as score now. Rename to odd_size

* change naming of issue

* only show small image

* add new image
diff --git a/README.md b/README.md
@@ -78,16 +78,17 @@ use the [cleanlab](https://github.com/cleanlab/cleanlab/) package.
 
 In any collection of image files (most [formats](https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html) supported), CleanVision can detect the following types of issues:
 
-|     | Issue Type       | Description                                               | Issue Key        | Example                                                                                                             |
-|-----|------------------|-----------------------------------------------------------|------------------|---------------------------------------------------------------------------------------------------------------------|
-| 1   | Exact Duplicates | Images that are identical to each other            | exact_duplicates | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/exact_duplicates.png) |
-| 2   | Near Duplicates  | Images that are visually almost identical          | near_duplicates  | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/near_duplicates.png)  |
-| 3   | Blurry           | Images where details are fuzzy (out of focus)                             | blurry           | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/blurry.png)           |
-| 4   | Low Information  | Images lacking content (little entropy in pixel values) | low_information  | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/low_information.png)  |
-| 5   | Dark             | Irregularly dark images (*under*exposed)                                   | dark             | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/dark.jpg)             |
-| 6   | Light            | Irregularly bright images (*over*exposed)                       | light            | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/light.jpg)            |
-| 7   | Grayscale        | Images lacking color                                      | grayscale        | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/grayscale.jpg)        |
-| 8   | Odd Aspect Ratio | Images with an unusual aspect ratio (overly skinny/wide)            | odd_aspect_ratio | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/odd_aspect_ratio.jpg) |
+|   | Issue Type       | Description                                                     | Issue Key        | Example                                                                                                                                 |
+|---|------------------|-----------------------------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
+| 1 | Exact Duplicates | Images that are identical to each other                         | exact_duplicates | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/exact_duplicates.png)                     |
+| 2 | Near Duplicates  | Images that are visually almost identical                       | near_duplicates  | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/near_duplicates.png)                      |
+| 3 | Blurry           | Images where details are fuzzy (out of focus)                   | blurry           | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/blurry.png)                               |
+| 4 | Low Information  | Images lacking content (little entropy in pixel values)         | low_information  | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/low_information.png)                      |
+| 5 | Dark             | Irregularly dark images (*under*exposed)                        | dark             | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/dark.jpg)                                 |
+| 6 | Light            | Irregularly bright images (*over*exposed)                       | light            | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/light.jpg)                                |
+| 7 | Grayscale        | Images lacking color                                            | grayscale        | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/grayscale.jpg)                            |
+| 8 | Odd Aspect Ratio | Images with an unusual aspect ratio (overly skinny/wide)        | odd_aspect_ratio | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/odd_aspect_ratio.jpg)                     |
+| 9 | Odd Size         | Images which are n times larger or smaller than the median size | odd_size         | <img src="https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/odd_size.png" width=20% height=20%> |
 
 This package is still a work in progress, so expect sharp edges.
 Feel free to submit any found bugs or desired functionality as an [issue][issue]!
diff --git a/src/cleanvision/imagelab.py b/src/cleanvision/imagelab.py
@@ -162,6 +162,7 @@ def _set_default_config(self) -> Dict[str, Any]:
                 IssueType.NEAR_DUPLICATES,
                 IssueType.BLURRY,
                 IssueType.GRAYSCALE,
+                IssueType.ODD_SIZE,
             ],
         }
 
diff --git a/src/cleanvision/issue_managers/__init__.py b/src/cleanvision/issue_managers/__init__.py
@@ -15,6 +15,7 @@ class IssueType(Enum):
     NEAR_DUPLICATES = "near_duplicates"
     BLURRY = "blurry"
     GRAYSCALE = "grayscale"
+    ODD_SIZE = "odd_size"
 
 
 ISSUE_MANAGER_REGISTRY: Dict[str, Type[IssueManager]] = {}
diff --git a/src/cleanvision/issue_managers/image_property.py b/src/cleanvision/issue_managers/image_property.py
@@ -1,3 +1,4 @@
+import math
 from abc import ABC, abstractmethod
 from typing import List, Dict, Any, Union, overload
 
@@ -292,6 +293,11 @@ def calc_color_space(image: Image) -> str:
     return get_image_mode(image)
 
 
+def calc_image_area_sqrt(image: Image) -> float:
+    size = image.size
+    return math.sqrt(size[0] * size[1])
+
+
 class ColorSpaceProperty(ImageProperty):
     name = "color_space"
 
@@ -329,6 +335,49 @@ def mark_issue(
         return is_issue
 
 
+class SizeProperty(ImageProperty):
+    name = "size"
+
+    @property
+    def score_columns(self) -> List[str]:
+        return self._score_columns
+
+    def __init__(self) -> None:
+        self._score_columns = [self.name]
+
+    def calculate(self, image: Image) -> Dict[str, Union[float, str]]:
+        return {self.name: calc_image_area_sqrt(image)}
+
+    def get_scores(
+        self,
+        raw_scores: pd.DataFrame,
+        issue_type: str,
+        **kwargs: Any,
+    ) -> pd.DataFrame:
+        super().get_scores(raw_scores, issue_type, **kwargs)
+        assert raw_scores is not None
+        scores = pd.DataFrame(index=raw_scores.index)
+        scores[get_score_colname(issue_type)] = raw_scores[self.score_columns[0]].apply(
+            lambda x: 1.0
+            / max(
+                x / raw_scores[self.score_columns[0]].median(),
+                raw_scores[self.score_columns[0]].median() / x,
+            )
+        )
+        return scores
+
+    def mark_issue(
+        self, scores: pd.DataFrame, threshold: float, issue_type: str
+    ) -> pd.DataFrame:
+        is_issue = pd.DataFrame(index=scores.index)
+        is_issue[get_is_issue_colname(issue_type)] = np.where(
+            scores[get_score_colname(issue_type)] < 1.0 / threshold,
+            True,
+            False,
+        )
+        return is_issue
+
+
 def get_image_mode(image: Image) -> str:
     if image.mode:
         image_mode = image.mode
diff --git a/src/cleanvision/issue_managers/image_property_issue_manager.py b/src/cleanvision/issue_managers/image_property_issue_manager.py
@@ -13,6 +13,7 @@
     BlurrinessProperty,
     ColorSpaceProperty,
     ImageProperty,
+    SizeProperty,
 )
 from cleanvision.utils.base_issue_manager import IssueManager
 from cleanvision.utils.constants import (
@@ -71,6 +72,7 @@ def get_default_params(self) -> Dict[str, Any]:
                 "color_threshold": 0.18,
             },
             IssueType.GRAYSCALE.value: {},
+            IssueType.ODD_SIZE.value: {"threshold": 10.0},
         }
 
     def update_params(self, params: Dict[str, Any]) -> None:
@@ -88,6 +90,7 @@ def _get_image_properties(self) -> Dict[str, ImageProperty]:
             IssueType.LOW_INFORMATION.value: EntropyProperty(),
             IssueType.BLURRY.value: BlurrinessProperty(),
             IssueType.GRAYSCALE.value: ColorSpaceProperty(),
+            IssueType.ODD_SIZE.value: SizeProperty(),
         }
 
     def _get_defer_set(
diff --git a/src/cleanvision/utils/constants.py b/src/cleanvision/utils/constants.py
@@ -2,13 +2,15 @@
 
 IMAGE_PROPERTY: str = "image_property"
 DUPLICATE: str = "duplicate"
+
 IMAGE_PROPERTY_ISSUE_TYPES_LIST: List[str] = [
     "dark",
     "light",
     "odd_aspect_ratio",
     "low_information",
     "blurry",
     "grayscale",
+    "odd_size",
 ]
 DUPLICATE_ISSUE_TYPES_LIST: List[str] = ["exact_duplicates", "near_duplicates"]
 SETS: str = "sets"
diff --git a/tests/conftest.py b/tests/conftest.py
@@ -35,9 +35,7 @@ def generate_single_image_file(tmpdir_factory, img_name="img.png", arr=None):
     return str(fn)
 
 
-@pytest.fixture(scope="session")
-def generate_local_dataset(tmp_path_factory, n_classes, images_per_class):
-    """Generates n temporary images for testing and returns dir of images"""
+def generate_local_dataset_base(tmp_path_factory, n_classes, images_per_class):
     tmp_image_dir = tmp_path_factory.mktemp("data")
     for i in range(n_classes):
         class_dir = tmp_image_dir / f"class_{i}"
@@ -48,3 +46,15 @@ def generate_local_dataset(tmp_path_factory, n_classes, images_per_class):
             fn = class_dir / img_name
             img.save(fn)
     return tmp_image_dir
+
+
+@pytest.fixture(scope="session")
+def generate_local_dataset(tmp_path_factory, n_classes, images_per_class):
+    """Generates n temporary images for testing and returns dir of images"""
+    return generate_local_dataset_base(tmp_path_factory, n_classes, images_per_class)
+
+
+@pytest.fixture(scope="function")
+def generate_local_dataset_once(tmp_path_factory, n_classes, images_per_class):
+    """Generates n temporary images for testing and returns dir of images"""
+    return generate_local_dataset_base(tmp_path_factory, n_classes, images_per_class)
diff --git a/tests/test_image_property_helpers.py b/tests/test_image_property_helpers.py
@@ -4,13 +4,15 @@
 from PIL import Image
 
 import cleanvision
+import math
 from cleanvision.issue_managers import IssueType
 from cleanvision.issue_managers.image_property import (
     BrightnessProperty,
     calculate_brightness,
     get_image_mode,
     calc_aspect_ratio,
     calc_entropy,
+    calc_image_area_sqrt,
     calc_blurriness,
 )
 from cleanvision.utils.utils import get_is_issue_colname, get_score_colname
@@ -50,6 +52,12 @@ def test_calc_bluriness():
     assert blurriness == 0
 
 
+def test_calc_area():
+    img = Image.new("RGB", (200, 200), (255, 0, 0))
+    area = calc_image_area_sqrt(img)  # img.size[0] * img.size[1]
+    assert area == math.sqrt(200 * 200)
+
+
 @pytest.mark.parametrize(
     "image,expected_mode",
     [
diff --git a/tests/test_run.py b/tests/test_run.py
@@ -1,9 +1,10 @@
 import os
-
+import numpy as np
+from PIL import Image
 import pytest
 import torchvision
 from datasets import load_dataset
-
+from pathlib import Path
 from cleanvision.dataset.folder_dataset import FolderDataset
 from cleanvision import Imagelab
 from cleanvision.issue_managers.image_property import BrightnessProperty
@@ -28,6 +29,7 @@ def test_example1(capsys, generate_local_dataset):
         "near_duplicates",
         "blurry",
         "grayscale",
+        "odd_size",
     ]
     captured = capsys.readouterr()
 
@@ -171,7 +173,7 @@ def test_hf_dataset_run(generate_local_dataset, n_classes, images_per_class):
     imagelab = Imagelab(hf_dataset=hf_dataset, image_key="image")
     imagelab.find_issues()
     imagelab.report()
-    assert len(imagelab.issues.columns) == 16
+    assert len(imagelab.issues.columns) == 18
     assert len(imagelab.issues) == n_classes * images_per_class
 
 
@@ -181,7 +183,7 @@ def test_torch_dataset_run(generate_local_dataset, n_classes, images_per_class):
     imagelab = Imagelab(torchvision_dataset=torch_ds)
     imagelab.find_issues()
     imagelab.report()
-    assert len(imagelab.issues.columns) == 16
+    assert len(imagelab.issues.columns) == 18
     assert len(imagelab.issues) == n_classes * images_per_class
 
 
@@ -206,5 +208,96 @@ def test_filepath_dataset_run(generate_local_dataset, images_per_class):
     imagelab = Imagelab(filepaths=filepaths)
     imagelab.find_issues()
     imagelab.report()
-    assert len(imagelab.issues.columns) == 16
+    assert len(imagelab.issues.columns) == 18
     assert len(imagelab.issues) == images_per_class
+
+
+@pytest.mark.usefixtures("set_plt_show")
+def test_filepath_dataset_size_negative(generate_local_dataset_once, images_per_class):
+    """
+
+    All images are same size, so no image should have an size issue
+
+    """
+    files = os.listdir(generate_local_dataset_once / "class_0")
+    filepaths = [
+        os.path.join(generate_local_dataset_once / "class_0", f) for f in files
+    ]
+    imagelab = Imagelab(filepaths=filepaths)
+    imagelab.find_issues()
+    assert len(imagelab.issues.columns) == 18
+    assert len(imagelab.issues[imagelab.issues["is_odd_size_issue"]]) == 0
+
+
+@pytest.mark.usefixtures("set_plt_show")
+def test_filepath_dataset_size_to_large(generate_local_dataset_once, images_per_class):
+    """
+    Size issue is defined based on the area of an image. If the sqrt(width * height) is larger than the median
+    sqrt(width * height)*threshold(default 10),is_odd_size_issue is set to True. In this example, the median area is sqrt(300x300) so 300.
+    An image with 3001 x 3001 has an value of 3001 so its more than 10x smaller and thus should be flagged.
+    """
+    arr = np.random.randint(low=0, high=256, size=(3001, 3001, 3), dtype=np.uint8)
+    img = Image.fromarray(arr, mode="RGB")
+    img.save(Path(generate_local_dataset_once / "class_0" / "larger.png"))
+
+    files = os.listdir(generate_local_dataset_once / "class_0")
+    filepaths = [
+        os.path.join(generate_local_dataset_once / "class_0", f) for f in files
+    ]
+    imagelab = Imagelab(filepaths=filepaths)
+    imagelab.find_issues()
+    assert len(imagelab.issues.columns) == 18
+    assert len(imagelab.issues[imagelab.issues["is_odd_size_issue"]]) == 1
+
+
+@pytest.mark.usefixtures("set_plt_show")
+def test_filepath_dataset_size_to_small(generate_local_dataset_once, images_per_class):
+    """
+    Size issue is defined based on the area of an image. If the sqrt(width * height) is larger than the median
+    sqrt(width * height)*threshold(default 10),is_odd_size_issue is set to True. In this example, the median area is sqrt(300x300) so 300.
+    An image with 29 x 29 has an value of 29 so its more than 10x smaller and thus should be flagged.
+    """
+    arr = np.random.randint(
+        low=0,
+        high=256,
+        size=(29, 29, 3),
+        dtype=np.uint8,  # 30 x 30 pixel image should be detected
+    )
+    img = Image.fromarray(arr, mode="RGB")
+    img.save(Path(generate_local_dataset_once / "class_0" / "smaller.png"))
+
+    files = os.listdir(generate_local_dataset_once / "class_0")
+    filepaths = [
+        os.path.join(generate_local_dataset_once / "class_0", f) for f in files
+    ]
+    imagelab = Imagelab(filepaths=filepaths)
+    imagelab.find_issues()
+    assert len(imagelab.issues.columns) == 18
+    assert len(imagelab.issues[imagelab.issues["is_odd_size_issue"]]) == 1
+
+
+@pytest.mark.usefixtures("set_plt_show")
+def test_filepath_dataset_size_custom_threshold(
+    generate_local_dataset_once, images_per_class
+):
+    """
+    With default threshold the small image would be flagged (See test_filepath_dataset_size_to_small). However,
+     with a custom threshold of 11 instead of 10, the imaage is within the allowed range and should not be flagged.
+    """
+    arr = np.random.randint(
+        low=0,
+        high=256,
+        size=(29, 29, 3),
+        dtype=np.uint8,  # 29 x 29 pixel image should not be detected with threshold 11
+    )
+    img = Image.fromarray(arr, mode="RGB")
+    img.save(Path(generate_local_dataset_once / "class_0" / "smaller.png"))
+
+    files = os.listdir(generate_local_dataset_once / "class_0")
+    filepaths = [
+        os.path.join(generate_local_dataset_once / "class_0", f) for f in files
+    ]
+    imagelab = Imagelab(filepaths=filepaths)
+    imagelab.find_issues({"odd_size": {"threshold": 11.0}})
+    assert len(imagelab.issues.columns) == 2  # Only size
+    assert len(imagelab.issues[imagelab.issues["is_odd_size_issue"]]) == 0

Original file line number	Diff line number	Diff line change
`@@ -162,6 +162,7 @@ def _set_default_config(self) -> Dict[str, Any]:`
`162`	`162`	`IssueType.NEAR_DUPLICATES,`
`163`	`163`	`IssueType.BLURRY,`
`164`	`164`	`IssueType.GRAYSCALE,`
	`165`	`+ IssueType.ODD_SIZE,`
`165`	`166`	`],`
`166`	`167`	`}`
`167`	`168`