Skip to content

Comments

Raise an exception on receiving duplicate filepaths#242

Merged
sanjanag merged 3 commits intocleanlab:mainfrom
sanjanag:catch-dup-files
Nov 22, 2023
Merged

Raise an exception on receiving duplicate filepaths#242
sanjanag merged 3 commits intocleanlab:mainfrom
sanjanag:catch-dup-files

Conversation

@sanjanag
Copy link
Contributor

Duplicate filepaths caused out of memory error when doing joins and merges in pandas dataframes.

@sanjanag sanjanag requested a review from elisno November 21, 2023 22:43
@codecov
Copy link

codecov bot commented Nov 21, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (f653207) 96.04% compared to head (7b43c65) 95.84%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #242      +/-   ##
==========================================
- Coverage   96.04%   95.84%   -0.20%     
==========================================
  Files          16       16              
  Lines         986      987       +1     
  Branches      194      194              
==========================================
- Hits          947      946       -1     
- Misses         20       21       +1     
- Partials       19       20       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@jwmueller jwmueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, made a small suggestion you can consider, but feel free to merge.

Does this fix warrant releasing a new version of CleanVision? I'm not sure how you encountered the bug you faced that inspired this simple fix (why did you have duplicate filepaths in the first place before?)

@jwmueller
Copy link
Member

Fixes #236

sanjanag and others added 2 commits November 22, 2023 11:22
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
Co-authored-by: Jonas Mueller <1390638+jwmueller@users.noreply.github.com>
@sanjanag sanjanag linked an issue Nov 22, 2023 that may be closed by this pull request
@sanjanag
Copy link
Contributor Author

LGTM, made a small suggestion you can consider, but feel free to merge.

Does this fix warrant releasing a new version of CleanVision? I'm not sure how you encountered the bug you faced that inspired this simple fix (why did you have duplicate filepaths in the first place before?)

I would hold off on a new release, as this is an odd occurrence. Also this #222 is another bug I want to fix before doing it . I was working with Amazon Berkeley Objects dataset, and it has a separate file for product metadata and images separately. I wrote some code to map the two of them, and the duplicates filepaths got introduced in that (not because of some error in code, but how the metadata file is structured).

@sanjanag sanjanag merged commit d1334e0 into cleanlab:main Nov 22, 2023
@sanjanag sanjanag deleted the catch-dup-files branch November 22, 2023 19:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Catch duplicate entries in filepaths argument

3 participants