Skip to content

Added score for duplicate images #183

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
May 24, 2023
Merged

Conversation

sanjanag
Copy link
Contributor

@sanjanag sanjanag commented May 23, 2023

Added score for exact_duplicates and near_duplicates issue type.

score = 1 / num_images_in_duplicated_set

@sanjanag sanjanag self-assigned this May 23, 2023
@sanjanag sanjanag requested a review from jwmueller May 23, 2023 22:43
@sanjanag sanjanag marked this pull request as ready for review May 23, 2023 22:43
@codecov
Copy link

codecov bot commented May 23, 2023

Codecov Report

Merging #183 (237c137) into main (de9ae33) will decrease coverage by 0.08%.
The diff coverage is 91.66%.

@@            Coverage Diff             @@
##             main     #183      +/-   ##
==========================================
- Coverage   94.26%   94.18%   -0.08%     
==========================================
  Files          16       16              
  Lines         889      895       +6     
  Branches      164      164              
==========================================
+ Hits          838      843       +5     
- Misses         30       32       +2     
+ Partials       21       20       -1     
Impacted Files Coverage Δ
src/cleanvision/imagelab.py 90.05% <66.66%> (+0.05%) ⬆️
...anvision/issue_managers/duplicate_issue_manager.py 97.50% <100.00%> (+1.84%) ⬆️

... and 1 file with indirect coverage changes

lambda x: True if x in duplicated_images else False
)
score = 1.0 / len(s)
score_df.loc[s, score_col] = score
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you cover this line in unit test, if easy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered in unit test

@@ -171,7 +171,7 @@ def test_hf_dataset_run(generate_local_dataset, n_classes, images_per_class):
imagelab = Imagelab(hf_dataset=hf_dataset, image_key="image")
imagelab.find_issues()
imagelab.report()
assert len(imagelab.issues.columns) == 14
assert len(imagelab.issues.columns) == 16
assert len(imagelab.issues) == n_classes * images_per_class
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the most duplicated image in our testing dataset? couldn't we easily add a test to verify this image has the lowest score (tied w other images in its duplicate-set) ? The current tests don't seem to test any of the logic at all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Covered in unit test

Copy link
Member

@elisno elisno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Cool that you're sorting the duplicate issues by the duplicate counts when visualizing.

@sanjanag sanjanag merged commit 010284e into cleanlab:main May 24, 2023
@sanjanag sanjanag deleted the duplicate-score branch May 24, 2023 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write tests to check score for exact_duplicates and near_duplicates issues Assign scores to near duplicates and compare different hash outputs
3 participants