-
Notifications
You must be signed in to change notification settings - Fork 60
Implementation of local outlier factor #1825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Implementation of local outlier factor #1825
Conversation
for more information, see https://pre-commit.ci
|
Thank you for the PR! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1825 +/- ##
==========================================
- Coverage 91.73% 90.40% -1.33%
==========================================
Files 89 90 +1
Lines 13942 14848 +906
==========================================
+ Hits 12789 13423 +634
- Misses 1153 1425 +272
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thank you for the PR! |
|
Thank you for the PR! |
|
Thank you for the PR! |
mrfh92
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments so far are mostly on the docs
I now approve to let the CI matrix run
heat/spatial/distance.py
Outdated
| Define if the distances on each process are calculated iteratively. For example, if ``chunks=2``, the | ||
| each processes will first compute one half of the distance matrix and then the second half. | ||
| Returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, we only have Parameters, Attributes, Notes and References, but no Raises or Returns section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I have deleted them
| idx : DNDarray | ||
| The indices used for advanced indexing. | ||
| Returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, we dont use Returns as section
(also at the other functions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adapted this in all functions
| self.assertTrue(condition) | ||
|
|
||
| # test with memory-efficient implementation | ||
| lof = LocalOutlierFactor(n_neighbors=n_neighbors, fully_distributed=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could move the memory-efficient lof to a different test such that, in case it fails, its clear which configuration failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- lines 267-282 in _chunk_wise_topk (i.e. the whole "else-block" for chunks!=1) seem not to get tested.
- lines 266-277 in _map_idx_to_proc are not covered.
If these parts of the code are somehow core ideas of the changes (are they?), they should be covered to ensure that they work.
…Implementation_of_local_outlier_factor
…processes in cdist_small
|
Thank you for the PR! |
|
Thank you for the PR! |
|
Thank you for the PR! |
|
Thank you for the PR! |
|
Thank you for the PR! |
Description
Added an implementation of the local outlier factor (lof) used for outlier classification. The bottleneck to reduce memory consumption lies in the pairwise distance matrix. Memory consumption can be significantly reduced by taking only the n smallest elements in the distance matrix into account (only these are needed to compute the lof). Thus, a new distance matrix, called cdist_small, was established that combines the functionality of cdist and topk, but has the advantage that the smallest n distances are dynamically choosen during computation without evaluating the whole distance matrix cdist at once.
Issue/s resolved: #1758
Changes proposed:
Type of change
Does this change modify the behaviour of other functions? If so, which?
no