Added score for duplicate images #183

sanjanag · 2023-05-23T22:41:36Z

Added score for exact_duplicates and near_duplicates issue type.

score = 1 / num_images_in_duplicated_set

codecov · 2023-05-23T22:57:09Z

Codecov Report

Merging #183 (237c137) into main (de9ae33) will decrease coverage by 0.08%.
The diff coverage is 91.66%.

@@            Coverage Diff             @@
##             main     #183      +/-   ##
==========================================
- Coverage   94.26%   94.18%   -0.08%     
==========================================
  Files          16       16              
  Lines         889      895       +6     
  Branches      164      164              
==========================================
+ Hits          838      843       +5     
- Misses         30       32       +2     
+ Partials       21       20       -1

Impacted Files	Coverage Δ
src/cleanvision/imagelab.py	`90.05% <66.66%> (+0.05%)`	⬆️
...anvision/issue_managers/duplicate_issue_manager.py	`97.50% <100.00%> (+1.84%)`	⬆️

... and 1 file with indirect coverage changes

jwmueller · 2023-05-23T23:20:02Z

src/cleanvision/issue_managers/duplicate_issue_manager.py

-                lambda x: True if x in duplicated_images else False
-            )
+                score = 1.0 / len(s)
+                score_df.loc[s, score_col] = score


can you cover this line in unit test, if easy?

Covered in unit test

jwmueller · 2023-05-23T23:21:34Z

tests/test_run.py

@@ -171,7 +171,7 @@ def test_hf_dataset_run(generate_local_dataset, n_classes, images_per_class):
    imagelab = Imagelab(hf_dataset=hf_dataset, image_key="image")
    imagelab.find_issues()
    imagelab.report()
-    assert len(imagelab.issues.columns) == 14
+    assert len(imagelab.issues.columns) == 16
    assert len(imagelab.issues) == n_classes * images_per_class


what is the most duplicated image in our testing dataset? couldn't we easily add a test to verify this image has the lowest score (tied w other images in its duplicate-set) ? The current tests don't seem to test any of the logic at all

Covered in unit test

src/cleanvision/issue_managers/duplicate_issue_manager.py

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

elisno

LGTM!

Cool that you're sorting the duplicate issues by the duplicate counts when visualizing.

sanjanag self-assigned this May 23, 2023

sanjanag requested a review from jwmueller May 23, 2023 22:43

sanjanag marked this pull request as ready for review May 23, 2023 22:43

jwmueller reviewed May 23, 2023

View reviewed changes

src/cleanvision/issue_managers/duplicate_issue_manager.py Outdated Show resolved Hide resolved

jwmueller reviewed May 24, 2023

View reviewed changes

src/cleanvision/issue_managers/duplicate_issue_manager.py Outdated Show resolved Hide resolved

sanjanag force-pushed the duplicate-score branch from 99b93db to 3b869f7 Compare May 24, 2023 16:05

sanjanag and others added 6 commits May 24, 2023 11:51

Added score for duplicate images

cc93ee4

Fixed tests

0f178d4

Added tests for duplicate score

492a356

Update src/cleanvision/issue_managers/duplicate_issue_manager.py

fbb4a9f

Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>

Black formatting

5775006

Fixed tests

237c137

sanjanag force-pushed the duplicate-score branch from 33b3686 to 237c137 Compare May 24, 2023 18:52

This was linked to issues May 24, 2023

Assign scores to near duplicates and compare different hash outputs #63

Closed

Write tests to check score for exact_duplicates and near_duplicates issues #184

Closed

sanjanag requested a review from elisno May 24, 2023 19:28

elisno approved these changes May 24, 2023

View reviewed changes

sanjanag merged commit 010284e into cleanlab:main May 24, 2023

sanjanag deleted the duplicate-score branch May 24, 2023 20:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added score for duplicate images #183

Added score for duplicate images #183

Uh oh!

sanjanag commented May 23, 2023 •

edited

Loading

Uh oh!

codecov bot commented May 23, 2023 •

edited

Loading

Uh oh!

jwmueller May 23, 2023

Uh oh!

sanjanag May 24, 2023

Uh oh!

jwmueller May 23, 2023

Uh oh!

sanjanag May 24, 2023

Uh oh!

Uh oh!

Uh oh!

elisno left a comment

Uh oh!

Uh oh!

Added score for duplicate images #183

Added score for duplicate images #183

Uh oh!

Conversation

sanjanag commented May 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented May 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jwmueller May 23, 2023

Choose a reason for hiding this comment

Uh oh!

sanjanag May 24, 2023

Choose a reason for hiding this comment

Uh oh!

jwmueller May 23, 2023

Choose a reason for hiding this comment

Uh oh!

sanjanag May 24, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

elisno left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sanjanag commented May 23, 2023 •

edited

Loading

codecov bot commented May 23, 2023 •

edited

Loading