Skip to content

Commit fa42afd

Browse files
authored
Feat size: Added detection for images which are way bigger than average. (#175)
* add issue type for bigger images * add two images in description * commit additional changes * remove smaller tests for now. * blak format new files * black formated changes * removed unused imports * fixed typing problems and flake8. * Increased expected column count by 8, those are ['width','height', 'width_score_raw', 'is_width_issue', 'width_score','height_score_raw', 'is_height_issue','height_score'] * add size to issue properties * removed unused field * update readme. Added a test with custom threshold. * cleanup * removed unnecessary report calls * use float type as return for image area to be compatible with types. * get rid of flake8 errors * added changes requested. Use sqrt of area as score now. Rename to odd_size * change naming of issue * only show small image * add new image
1 parent 7666c17 commit fa42afd

File tree

9 files changed

+186
-18
lines changed

9 files changed

+186
-18
lines changed

README.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -78,16 +78,17 @@ use the [cleanlab](https://github.com/cleanlab/cleanlab/) package.
7878

7979
In any collection of image files (most [formats](https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html) supported), CleanVision can detect the following types of issues:
8080

81-
| | Issue Type | Description | Issue Key | Example |
82-
|-----|------------------|-----------------------------------------------------------|------------------|---------------------------------------------------------------------------------------------------------------------|
83-
| 1 | Exact Duplicates | Images that are identical to each other | exact_duplicates | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/exact_duplicates.png) |
84-
| 2 | Near Duplicates | Images that are visually almost identical | near_duplicates | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/near_duplicates.png) |
85-
| 3 | Blurry | Images where details are fuzzy (out of focus) | blurry | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/blurry.png) |
86-
| 4 | Low Information | Images lacking content (little entropy in pixel values) | low_information | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/low_information.png) |
87-
| 5 | Dark | Irregularly dark images (*under*exposed) | dark | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/dark.jpg) |
88-
| 6 | Light | Irregularly bright images (*over*exposed) | light | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/light.jpg) |
89-
| 7 | Grayscale | Images lacking color | grayscale | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/grayscale.jpg) |
90-
| 8 | Odd Aspect Ratio | Images with an unusual aspect ratio (overly skinny/wide) | odd_aspect_ratio | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/odd_aspect_ratio.jpg) |
81+
| | Issue Type | Description | Issue Key | Example |
82+
|---|------------------|-----------------------------------------------------------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
83+
| 1 | Exact Duplicates | Images that are identical to each other | exact_duplicates | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/exact_duplicates.png) |
84+
| 2 | Near Duplicates | Images that are visually almost identical | near_duplicates | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/near_duplicates.png) |
85+
| 3 | Blurry | Images where details are fuzzy (out of focus) | blurry | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/blurry.png) |
86+
| 4 | Low Information | Images lacking content (little entropy in pixel values) | low_information | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/low_information.png) |
87+
| 5 | Dark | Irregularly dark images (*under*exposed) | dark | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/dark.jpg) |
88+
| 6 | Light | Irregularly bright images (*over*exposed) | light | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/light.jpg) |
89+
| 7 | Grayscale | Images lacking color | grayscale | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/grayscale.jpg) |
90+
| 8 | Odd Aspect Ratio | Images with an unusual aspect ratio (overly skinny/wide) | odd_aspect_ratio | ![](https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/odd_aspect_ratio.jpg) |
91+
| 9 | Odd Size | Images which are n times larger or smaller than the median size | odd_size | <img src="https://raw.githubusercontent.com/cleanlab/assets/master/cleanvision/example_issue_images/odd_size.png" width=20% height=20%> |
9192

9293
This package is still a work in progress, so expect sharp edges.
9394
Feel free to submit any found bugs or desired functionality as an [issue][issue]!

src/cleanvision/imagelab.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -162,6 +162,7 @@ def _set_default_config(self) -> Dict[str, Any]:
162162
IssueType.NEAR_DUPLICATES,
163163
IssueType.BLURRY,
164164
IssueType.GRAYSCALE,
165+
IssueType.ODD_SIZE,
165166
],
166167
}
167168

src/cleanvision/issue_managers/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ class IssueType(Enum):
1515
NEAR_DUPLICATES = "near_duplicates"
1616
BLURRY = "blurry"
1717
GRAYSCALE = "grayscale"
18+
ODD_SIZE = "odd_size"
1819

1920

2021
ISSUE_MANAGER_REGISTRY: Dict[str, Type[IssueManager]] = {}

src/cleanvision/issue_managers/image_property.py

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import math
12
from abc import ABC, abstractmethod
23
from typing import List, Dict, Any, Union, overload
34

@@ -292,6 +293,11 @@ def calc_color_space(image: Image) -> str:
292293
return get_image_mode(image)
293294

294295

296+
def calc_image_area_sqrt(image: Image) -> float:
297+
size = image.size
298+
return math.sqrt(size[0] * size[1])
299+
300+
295301
class ColorSpaceProperty(ImageProperty):
296302
name = "color_space"
297303

@@ -329,6 +335,49 @@ def mark_issue(
329335
return is_issue
330336

331337

338+
class SizeProperty(ImageProperty):
339+
name = "size"
340+
341+
@property
342+
def score_columns(self) -> List[str]:
343+
return self._score_columns
344+
345+
def __init__(self) -> None:
346+
self._score_columns = [self.name]
347+
348+
def calculate(self, image: Image) -> Dict[str, Union[float, str]]:
349+
return {self.name: calc_image_area_sqrt(image)}
350+
351+
def get_scores(
352+
self,
353+
raw_scores: pd.DataFrame,
354+
issue_type: str,
355+
**kwargs: Any,
356+
) -> pd.DataFrame:
357+
super().get_scores(raw_scores, issue_type, **kwargs)
358+
assert raw_scores is not None
359+
scores = pd.DataFrame(index=raw_scores.index)
360+
scores[get_score_colname(issue_type)] = raw_scores[self.score_columns[0]].apply(
361+
lambda x: 1.0
362+
/ max(
363+
x / raw_scores[self.score_columns[0]].median(),
364+
raw_scores[self.score_columns[0]].median() / x,
365+
)
366+
)
367+
return scores
368+
369+
def mark_issue(
370+
self, scores: pd.DataFrame, threshold: float, issue_type: str
371+
) -> pd.DataFrame:
372+
is_issue = pd.DataFrame(index=scores.index)
373+
is_issue[get_is_issue_colname(issue_type)] = np.where(
374+
scores[get_score_colname(issue_type)] < 1.0 / threshold,
375+
True,
376+
False,
377+
)
378+
return is_issue
379+
380+
332381
def get_image_mode(image: Image) -> str:
333382
if image.mode:
334383
image_mode = image.mode

src/cleanvision/issue_managers/image_property_issue_manager.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
BlurrinessProperty,
1414
ColorSpaceProperty,
1515
ImageProperty,
16+
SizeProperty,
1617
)
1718
from cleanvision.utils.base_issue_manager import IssueManager
1819
from cleanvision.utils.constants import (
@@ -71,6 +72,7 @@ def get_default_params(self) -> Dict[str, Any]:
7172
"color_threshold": 0.18,
7273
},
7374
IssueType.GRAYSCALE.value: {},
75+
IssueType.ODD_SIZE.value: {"threshold": 10.0},
7476
}
7577

7678
def update_params(self, params: Dict[str, Any]) -> None:
@@ -88,6 +90,7 @@ def _get_image_properties(self) -> Dict[str, ImageProperty]:
8890
IssueType.LOW_INFORMATION.value: EntropyProperty(),
8991
IssueType.BLURRY.value: BlurrinessProperty(),
9092
IssueType.GRAYSCALE.value: ColorSpaceProperty(),
93+
IssueType.ODD_SIZE.value: SizeProperty(),
9194
}
9295

9396
def _get_defer_set(

src/cleanvision/utils/constants.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,15 @@
22

33
IMAGE_PROPERTY: str = "image_property"
44
DUPLICATE: str = "duplicate"
5+
56
IMAGE_PROPERTY_ISSUE_TYPES_LIST: List[str] = [
67
"dark",
78
"light",
89
"odd_aspect_ratio",
910
"low_information",
1011
"blurry",
1112
"grayscale",
13+
"odd_size",
1214
]
1315
DUPLICATE_ISSUE_TYPES_LIST: List[str] = ["exact_duplicates", "near_duplicates"]
1416
SETS: str = "sets"

tests/conftest.py

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,9 +35,7 @@ def generate_single_image_file(tmpdir_factory, img_name="img.png", arr=None):
3535
return str(fn)
3636

3737

38-
@pytest.fixture(scope="session")
39-
def generate_local_dataset(tmp_path_factory, n_classes, images_per_class):
40-
"""Generates n temporary images for testing and returns dir of images"""
38+
def generate_local_dataset_base(tmp_path_factory, n_classes, images_per_class):
4139
tmp_image_dir = tmp_path_factory.mktemp("data")
4240
for i in range(n_classes):
4341
class_dir = tmp_image_dir / f"class_{i}"
@@ -48,3 +46,15 @@ def generate_local_dataset(tmp_path_factory, n_classes, images_per_class):
4846
fn = class_dir / img_name
4947
img.save(fn)
5048
return tmp_image_dir
49+
50+
51+
@pytest.fixture(scope="session")
52+
def generate_local_dataset(tmp_path_factory, n_classes, images_per_class):
53+
"""Generates n temporary images for testing and returns dir of images"""
54+
return generate_local_dataset_base(tmp_path_factory, n_classes, images_per_class)
55+
56+
57+
@pytest.fixture(scope="function")
58+
def generate_local_dataset_once(tmp_path_factory, n_classes, images_per_class):
59+
"""Generates n temporary images for testing and returns dir of images"""
60+
return generate_local_dataset_base(tmp_path_factory, n_classes, images_per_class)

tests/test_image_property_helpers.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,15 @@
44
from PIL import Image
55

66
import cleanvision
7+
import math
78
from cleanvision.issue_managers import IssueType
89
from cleanvision.issue_managers.image_property import (
910
BrightnessProperty,
1011
calculate_brightness,
1112
get_image_mode,
1213
calc_aspect_ratio,
1314
calc_entropy,
15+
calc_image_area_sqrt,
1416
calc_blurriness,
1517
)
1618
from cleanvision.utils.utils import get_is_issue_colname, get_score_colname
@@ -50,6 +52,12 @@ def test_calc_bluriness():
5052
assert blurriness == 0
5153

5254

55+
def test_calc_area():
56+
img = Image.new("RGB", (200, 200), (255, 0, 0))
57+
area = calc_image_area_sqrt(img) # img.size[0] * img.size[1]
58+
assert area == math.sqrt(200 * 200)
59+
60+
5361
@pytest.mark.parametrize(
5462
"image,expected_mode",
5563
[

tests/test_run.py

Lines changed: 98 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,10 @@
11
import os
2-
2+
import numpy as np
3+
from PIL import Image
34
import pytest
45
import torchvision
56
from datasets import load_dataset
6-
7+
from pathlib import Path
78
from cleanvision.dataset.folder_dataset import FolderDataset
89
from cleanvision import Imagelab
910
from cleanvision.issue_managers.image_property import BrightnessProperty
@@ -28,6 +29,7 @@ def test_example1(capsys, generate_local_dataset):
2829
"near_duplicates",
2930
"blurry",
3031
"grayscale",
32+
"odd_size",
3133
]
3234
captured = capsys.readouterr()
3335

@@ -171,7 +173,7 @@ def test_hf_dataset_run(generate_local_dataset, n_classes, images_per_class):
171173
imagelab = Imagelab(hf_dataset=hf_dataset, image_key="image")
172174
imagelab.find_issues()
173175
imagelab.report()
174-
assert len(imagelab.issues.columns) == 16
176+
assert len(imagelab.issues.columns) == 18
175177
assert len(imagelab.issues) == n_classes * images_per_class
176178

177179

@@ -181,7 +183,7 @@ def test_torch_dataset_run(generate_local_dataset, n_classes, images_per_class):
181183
imagelab = Imagelab(torchvision_dataset=torch_ds)
182184
imagelab.find_issues()
183185
imagelab.report()
184-
assert len(imagelab.issues.columns) == 16
186+
assert len(imagelab.issues.columns) == 18
185187
assert len(imagelab.issues) == n_classes * images_per_class
186188

187189

@@ -206,5 +208,96 @@ def test_filepath_dataset_run(generate_local_dataset, images_per_class):
206208
imagelab = Imagelab(filepaths=filepaths)
207209
imagelab.find_issues()
208210
imagelab.report()
209-
assert len(imagelab.issues.columns) == 16
211+
assert len(imagelab.issues.columns) == 18
210212
assert len(imagelab.issues) == images_per_class
213+
214+
215+
@pytest.mark.usefixtures("set_plt_show")
216+
def test_filepath_dataset_size_negative(generate_local_dataset_once, images_per_class):
217+
"""
218+
219+
All images are same size, so no image should have an size issue
220+
221+
"""
222+
files = os.listdir(generate_local_dataset_once / "class_0")
223+
filepaths = [
224+
os.path.join(generate_local_dataset_once / "class_0", f) for f in files
225+
]
226+
imagelab = Imagelab(filepaths=filepaths)
227+
imagelab.find_issues()
228+
assert len(imagelab.issues.columns) == 18
229+
assert len(imagelab.issues[imagelab.issues["is_odd_size_issue"]]) == 0
230+
231+
232+
@pytest.mark.usefixtures("set_plt_show")
233+
def test_filepath_dataset_size_to_large(generate_local_dataset_once, images_per_class):
234+
"""
235+
Size issue is defined based on the area of an image. If the sqrt(width * height) is larger than the median
236+
sqrt(width * height)*threshold(default 10),is_odd_size_issue is set to True. In this example, the median area is sqrt(300x300) so 300.
237+
An image with 3001 x 3001 has an value of 3001 so its more than 10x smaller and thus should be flagged.
238+
"""
239+
arr = np.random.randint(low=0, high=256, size=(3001, 3001, 3), dtype=np.uint8)
240+
img = Image.fromarray(arr, mode="RGB")
241+
img.save(Path(generate_local_dataset_once / "class_0" / "larger.png"))
242+
243+
files = os.listdir(generate_local_dataset_once / "class_0")
244+
filepaths = [
245+
os.path.join(generate_local_dataset_once / "class_0", f) for f in files
246+
]
247+
imagelab = Imagelab(filepaths=filepaths)
248+
imagelab.find_issues()
249+
assert len(imagelab.issues.columns) == 18
250+
assert len(imagelab.issues[imagelab.issues["is_odd_size_issue"]]) == 1
251+
252+
253+
@pytest.mark.usefixtures("set_plt_show")
254+
def test_filepath_dataset_size_to_small(generate_local_dataset_once, images_per_class):
255+
"""
256+
Size issue is defined based on the area of an image. If the sqrt(width * height) is larger than the median
257+
sqrt(width * height)*threshold(default 10),is_odd_size_issue is set to True. In this example, the median area is sqrt(300x300) so 300.
258+
An image with 29 x 29 has an value of 29 so its more than 10x smaller and thus should be flagged.
259+
"""
260+
arr = np.random.randint(
261+
low=0,
262+
high=256,
263+
size=(29, 29, 3),
264+
dtype=np.uint8, # 30 x 30 pixel image should be detected
265+
)
266+
img = Image.fromarray(arr, mode="RGB")
267+
img.save(Path(generate_local_dataset_once / "class_0" / "smaller.png"))
268+
269+
files = os.listdir(generate_local_dataset_once / "class_0")
270+
filepaths = [
271+
os.path.join(generate_local_dataset_once / "class_0", f) for f in files
272+
]
273+
imagelab = Imagelab(filepaths=filepaths)
274+
imagelab.find_issues()
275+
assert len(imagelab.issues.columns) == 18
276+
assert len(imagelab.issues[imagelab.issues["is_odd_size_issue"]]) == 1
277+
278+
279+
@pytest.mark.usefixtures("set_plt_show")
280+
def test_filepath_dataset_size_custom_threshold(
281+
generate_local_dataset_once, images_per_class
282+
):
283+
"""
284+
With default threshold the small image would be flagged (See test_filepath_dataset_size_to_small). However,
285+
with a custom threshold of 11 instead of 10, the imaage is within the allowed range and should not be flagged.
286+
"""
287+
arr = np.random.randint(
288+
low=0,
289+
high=256,
290+
size=(29, 29, 3),
291+
dtype=np.uint8, # 29 x 29 pixel image should not be detected with threshold 11
292+
)
293+
img = Image.fromarray(arr, mode="RGB")
294+
img.save(Path(generate_local_dataset_once / "class_0" / "smaller.png"))
295+
296+
files = os.listdir(generate_local_dataset_once / "class_0")
297+
filepaths = [
298+
os.path.join(generate_local_dataset_once / "class_0", f) for f in files
299+
]
300+
imagelab = Imagelab(filepaths=filepaths)
301+
imagelab.find_issues({"odd_size": {"threshold": 11.0}})
302+
assert len(imagelab.issues.columns) == 2 # Only size
303+
assert len(imagelab.issues[imagelab.issues["is_odd_size_issue"]]) == 0

0 commit comments

Comments
 (0)