[REVIEW]: rrcf: Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams #1336

whedon · 2019-03-19T16:17:34Z

Submitting author: @mdbartos (Matthew Bartos)
Repository: https://github.com/kLabUM/rrcf
Version: 0.3.1
Editor: @VivianePons
Reviewer: @vc1492a, @JustinShenk
Archive: 10.5281/zenodo.2613881

Status

Status badge code:

HTML: <a href="http://joss.theoj.org/papers/f8c83c0b01a984d0dbf934939b53c96d"><img src="http://joss.theoj.org/papers/f8c83c0b01a984d0dbf934939b53c96d/status.svg"></a>
Markdown: [![status](http://joss.theoj.org/papers/f8c83c0b01a984d0dbf934939b53c96d/status.svg)](http://joss.theoj.org/papers/f8c83c0b01a984d0dbf934939b53c96d)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@vc1492a & @JustinShenk, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

Make sure you're logged in to your GitHub account
Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.theoj.org/about#reviewer_guidelines. Any questions/concerns please let @VivianePons know.

✨ Please try and complete your review in the next two weeks ✨

Review checklist for @vc1492a

Conflict of interest

As the reviewer I confirm that I have read the JOSS conflict of interest policy and that there are no conflicts of interest for me to review this work.

Code of Conduct

I confirm that I read and will adhere to the JOSS code of conduct.

General checks

Repository: Is the source code for this software available at the repository url?
License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
Version: 0.3.1
Authorship: Has the submitting author (@mdbartos) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

Installation: Does installation proceed as outlined in the documentation?
Functionality: Have the functional claims of the software been confirmed?
Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

Authors: Does the paper.md file include a list of authors with their affiliations?
A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

Review checklist for @JustinShenk

Conflict of interest

As the reviewer I confirm that I have read the JOSS conflict of interest policy and that there are no conflicts of interest for me to review this work.

Code of Conduct

I confirm that I read and will adhere to the JOSS code of conduct.

General checks

Repository: Is the source code for this software available at the repository url?
License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
Version: 0.3.1
Authorship: Has the submitting author (@mdbartos) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

Installation: Does installation proceed as outlined in the documentation?
Functionality: Have the functional claims of the software been confirmed?
Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

Authors: Does the paper.md file include a list of authors with their affiliations?
A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

The text was updated successfully, but these errors were encountered:

whedon · 2019-03-19T16:17:38Z

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @vc1492a, it looks like you're currently assigned as the reviewer for this paper 🎉.

⭐ Important ⭐

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

For a list of things I can do to help you, just type:

@whedon commands

whedon · 2019-03-19T16:17:39Z

Attempting PDF compilation. Reticulating splines etc...

whedon · 2019-03-19T16:18:23Z

👉 Check article proof 📄 👈

vc1492a · 2019-03-22T20:22:34Z

@VivianePons My review of this software is provided below.

Overview: Overall, I think the authors did a great job implementing the Robust Random Cut Forest algorithm as detailed in the original paper. Below are my comments on particular sections. When the authors strictly meet the requirements laid out in the review checklist, I have marked these items as complete. For those where the requirement is not strictly met, I have left the checklist entry empty. Either way, I have provided some comments below with actions that should be taken (in the case of unchecked entries) and actions that could be taken to improve the work further but that aren't strictly required for submission to JOSS (in the case of checked entries).

I think I will give this implementation a try in a few of my projects! This project is a great implementation of the original paper and could be further improved with a little additional elbow grease. Thoroughly enjoyed reviewing this work!

License: while a license is provided in the project repository (MIT), it may be beneficial to indicate the license more publicly in the readme on Github and/or in the documentation. Aa badge could be used or a section added to the readme or documentation.

Functionality: I read the original paper in preparation for this review and believe the implementation here in rrcf is functionally correct. The software author is correctly accounting for duplicate values and using proper randomization when making cuts to the trees in the forest. The provided examples and unit tests help to verify the functionality of this software package.

However, when reviewing the source code I noticed a few TODO comments in rrcf.py that would be better placed in the dev branch. It's better practice to keep TODO comments away from the master branch. Additionally, there are several areas in rrcf.py where any exception that is caught raises a KeyError. This may be misleading, as exceptions which are not in fact due to a KeyError will be raised as KeyErrors and presented as such to users of rrcf.py. I noticed a similar pattern with some ValueErrors. I have opened a pull-request that addresses this and a few other minor items, but I will leave this up to the authors to address as it is not an issue with algorithm implementation and rather a software-design decision that should be thought through from a user perspective.

Version Matching: The version number submitted to JOSS does match the latest version of the source code in master on Github. It may be beneficial to tag the releases that are uploaded to PyPi on Github such that users can download the software directly from Github via a Release as opposed to having to clone or pull the repository.

Automated Tests: There are automated unit tests which I ran using pytest and pytest-cov without issues. It looks like code coverage is currently at 70%. I did not see any explicit instructions on how to run tests for this software. While not explicitly necessary, it may be beneficial to include some instructions on how to run tests for those interested in contributing (maybe this can be included in some contributor's guidelines). Additionally, if you're looking to have others contribute to the project more easily in the future consider setting up a continuous integration pipeline that will automatically run tests and report the test results (e.g. using Travis CI) and coverage (e.g. using Coveralls) with any pushes to master or dev and for any pull requests opened by open-source contributors. This pipeline would ease the burden on making sure new additions and/or changes to software work as intended, and would also allow maintainers to formally test against every version of Python 3 that should be supported.

Example Usage: This software does a good job of providing examples for use, including implementations of the examples from the original paper. The notebooks are also helpful, however I ran into a few issues in the notebooks that I believe should be addressed if they are to continue being included in the repository. First, some datasets are missing from the notebooks that are either 1) not available (nuclear.mat) 2) referenced in the documentation but not on the Github readme. If these datasets could be provided with links to download in the readme that could be beneficial. Second, version control should be addressed - more specifically, I not able to run some of the notebook code with earlier versions of Matplotlib and Pandas but was later to successfully after an update. It would be helpful to new users of the software to have a list of required versions for packages needed to run the examples in the notebooks. Lastly, in the documentation on the website the file ../resources/nyc_taxi.csv is referenced. It may be more clear to simply point to nyc_taxi.csv and mention that this file should sit in the directory in which the rrcf code is being executed.

Installation Instructions:
It is not clear from the Github readme or the provided documentation that numpy is the sole software dependency. It may be beneficial to users to point this out somewhere. However, install_requires in setup.py is being used properly to ensure numpy is installed if it does not yet exist in the user's environment already. Lastly, the authors should consider having the installation instructions towards the top of the documentation and the readme. This is helpful since the reader would typically start from the top of a document and work their way down when reading documentation and working through examples.

Community Guidelines:
Minimal documentation for contributing to the repository exists, indicating that contributors should submit pull requests to the dev branch. Developers could improve this by adopting a tagging scheme such as one used in scikit-learn. There, pull requests that are a work in progress are tagged [WIP], indicating that someone is working on a fix or new feature and that the work shouldn't be duplicated by someone else. When ready for merging, a two step review process is initiated with the tags [MRG+1] and [MRG+2]. See contributing pull requests in the scikit-learn documentation for more info.

References:
Every reference mentioned in the paper is documented as BibTex entries. However, no doi is listed for A comparative study of RNN for outlier detection in data mining or Support Vector Data Description which have DOIs 10.1109/ICDM.2002.1184035 and 10.1023/B:MACH.0000008084.60811.49, respectively. There is no listed DOI for Robust random cut forest based anomaly detection on streams but I did not locate a matching DOI online for that article. I have included these updates as part of my previously-mentioned pull request.

VivianePons · 2019-03-22T20:42:54Z

Thank you very much @vc1492a for this thorough review, @mdbartos I let you work on the comments.

JustinShenk · 2019-03-23T11:28:48Z

@VivianePons My review of this software is provided below.

Overview: I think the authors did a great job implementing the Robust Random Cut Forest algorithm as a Python package. I am also looking forward to implementing this in a few of my projects. Following @vc1492a's thorough review, I have little to add, so my review will rather emphasize the points already addressed which I think are most relevant to the authors of this project.

The code is well commented and described in sufficient detail, which supports future development.

License: +1 to adding a badge to the heading.

Functionality: Nothing to add.

Version Matching: +1 to tagging releases uploaded to GitHub.

Automated Tests: Nothing to add.

Example Usage: +1 to cleaning up the notebooks if they are to remain.

Installation Instructions: +1 to putting installation instructions towards the top of document.

Community Guidelines: Nothing to add.

References: Nothing to add.

mdbartos · 2019-03-23T20:59:18Z

Hi @vc1492a and @JustinShenk, thanks for the excellent reviews! I will get to work on these issues as soon as possible.

mdbartos · 2019-03-26T05:44:35Z

@whedon set 0.3 as version

whedon · 2019-03-26T05:44:37Z

I'm sorry @mdbartos, I'm afraid I can't do that. That's something only editors are allowed to do.

mdbartos · 2019-03-26T06:01:54Z

Greetings,

@vc1492a and @JustinShenk: Thank you again for the extremely relevant and constructive feedback. It would not be an exaggeration to say that this is one of the most helpful reviews I have received on an academic paper :)

A point-by-point summary of changes is listed below:

License

We have added the MIT license badge to the README header.

Functionality

@vc1492a's pull request has been merged into the master branch. This merge has fixed the error handling behavior. TODO comments have also been removed.

Version matching

Historical releases have been tagged.
A post-review release (0.3) has been tagged and added to pypi.

Automated tests

Instructions for running unit tests have been added to the README under contributing.
Additional tests have been added, raising the overall coverage to 84%.
Tests have been added to verify proper handling of duplicates.
A coveralls badge has been added to the README to indicate coverage.
Travis CI and coverall builds are triggered for pull requests and pushes to master. I thought that pull requests to dev were also checked, but I can verify.

Example usage

To ease maintainability, notebooks have been folded into the documentation website, and redundant examples have been removed.
Suggested versions of dependencies needed to run the examples are now included in the README.
Datasets used in the examples have been added to the /data folder in the main repo, and instructions for using these datasets have been clarified.
We added a caveats section to the documentation about scaling of dimensions. This addition was posted in reponse to some questions I received about the repo in the past week from another user.

Installation instructions

We now list numpy as the sole required dependency in the README.
In the README, we now also list optional dependencies (and versions) needed to run the examples in the documentation.
We have moved the installation instructions to the top of the README.

Community guidelines

We have expanded the community guidelines by adding suggested types of contributions, guidelines for maintaining code quality and instructions for running unit tests.
Given that the repo is fairly small and limited in scope (compared to scikit-learn), I have not included a tagging system for pull requests as of yet. However, I am open to implementing one in the future.

References

We have merged @vc1492a's pull request, and the requested DOIs are now included. Thanks for the help!

Please let me know if you have any other suggestions. Your time, effort and expertise is very much appreciated.

Thanks again!
MDB

mdbartos · 2019-03-26T06:05:53Z

@VivianePons : Would it be possible to change the version to 0.3? I tried to do this using whedon but was denied.

Also, would it be possible to change the description on JOSS to:

rrcf is a Python implementation of the Robust Random Cut Forest algorithm: an unsupervised ensemble method for anomaly detection on streaming data

Thank you!
MDB

vc1492a · 2019-03-26T15:31:21Z

@mdbartos thanks for the reply and for integrating my pull request! @VivianePons I have reviewed the changed @mdbartos put in place - they look great and I've updated the review checklist to reflect the latest status. Please let me know if there's anything else I can take care of as part of the review process. Thanks!

VivianePons · 2019-03-26T18:26:04Z

@whedon set 0.3 as version

whedon · 2019-03-26T18:26:07Z

OK. 0.3 is the version.

VivianePons · 2019-03-26T18:29:49Z

@whedon generate pdf

whedon · 2019-03-26T18:29:52Z

Attempting PDF compilation. Reticulating splines etc...

whedon · 2019-03-26T18:30:26Z

👉 Check article proof 📄 👈

VivianePons · 2019-03-26T18:31:21Z

@mdbartos I have updated the version but I'm not sure which description you're referring to

VivianePons · 2019-03-26T18:32:37Z

@JustinShenk Have you looked at the most recent changes? Do you give your green light for accepting the paper?

mdbartos · 2019-03-26T18:36:23Z

Hi @VivianePons, I was just referring to the description on the JOSS website: https://joss.theoj.org/papers/f8c83c0b01a984d0dbf934939b53c96d

I wanted to add "rrcf is a ..." to the beginning of the description to make the description more in line with the style of other submissions. It's not a big issue though.

VivianePons · 2019-03-26T18:38:18Z

@arfon do you know where this description is coming from: https://joss.theoj.org/papers/f8c83c0b01a984d0dbf934939b53c96d ?

arfon · 2019-03-26T20:38:19Z

@arfon do you know where this description is coming from: https://joss.theoj.org/papers/f8c83c0b01a984d0dbf934939b53c96d ?

Yes, it's what the author included in the form when they originally submitted. Once accepted, this text is never public so we can mostly ignore it.

mdbartos · 2019-03-26T21:06:56Z

@whedon generate pdf

whedon · 2019-03-26T21:06:58Z

Attempting PDF compilation. Reticulating splines etc...

whedon · 2019-03-26T21:07:29Z

👉 Check article proof 📄 👈

mdbartos · 2019-03-28T05:18:25Z

ping @JustinShenk

JustinShenk · 2019-03-28T12:48:39Z

@JustinShenk Have you looked at the most recent changes? Do you give your green light for accepting the paper?

@VivianePons Yes, I've reviewed the recent changes from @mdbartos and updated the checklist. Looks good to me!

VivianePons · 2019-03-28T13:24:28Z

@whedon check references

whedon · 2019-03-28T13:24:30Z

Attempting to check references...

whedon · 2019-03-28T13:24:34Z


OK DOIs

- 10.1109/cifer.1997.618940 is OK
- 10.1145/335191.335388 is OK
- 10.1137/1.9781611972733.3 is OK
- 10.1145/2133360.2133363 is OK
- 10.1109/TII.2010.2060732 is OK
- 10.1145/997150.997156 is OK
- 10.1109/ICDM.2002.1184035 is OK
- 10.1023/B:MACH.0000008084.60811.49 is OK

MISSING DOIs

- None

INVALID DOIs

- None

VivianePons · 2019-03-28T13:41:14Z

This paper is good to go! :)

@mdbartos could you create an online archive (for example on Zenodo) and give the reference?

mdbartos · 2019-03-28T18:40:59Z

Hi @VivianePons,

I created an archive with Zenodo using their Github integration service:
https://zenodo.org/record/2613881

Unfortunately, to get the integration to work I had to update the release to 0.3.1

Would it be possible to update the version to 0.3.1 for consistency?

Thanks again,
MDB

VivianePons · 2019-03-29T07:30:11Z

@whedon set 0.3.1 as version

whedon · 2019-03-29T07:30:15Z

OK. 0.3.1 is the version.

VivianePons · 2019-03-29T07:33:47Z

Can you update the Zenodo metadata so that the title matches the paper title?

Your name and Sara Troutman's name are also a bit different in both (missing the middle name letter) and the order is different.

Thanks

Viviane

mdbartos · 2019-03-29T08:07:11Z

Ok, it is done.

Thanks,
MDB

VivianePons · 2019-03-29T09:44:36Z

@whedon set 10.5281/zenodo.2613881 as archive

whedon · 2019-03-29T09:44:40Z

OK. 10.5281/zenodo.2613881 is the archive.

VivianePons · 2019-03-29T09:49:59Z

This paper is now good to go @openjournals/joss-eics ! Thank you @mdbartos for the paper and @JustinShenk and @vc1492a for the reviews! :) :)

danielskatz · 2019-03-29T11:03:14Z

And thanks @VivianePons for editing

danielskatz · 2019-03-29T11:03:23Z

@whedon accept

whedon · 2019-03-29T11:03:25Z

Attempting dry run of processing paper acceptance...

whedon · 2019-03-29T11:03:28Z


OK DOIs

- 10.1109/cifer.1997.618940 is OK
- 10.1145/335191.335388 is OK
- 10.1137/1.9781611972733.3 is OK
- 10.1145/2133360.2133363 is OK
- 10.1109/TII.2010.2060732 is OK
- 10.1145/997150.997156 is OK
- 10.1109/ICDM.2002.1184035 is OK
- 10.1023/B:MACH.0000008084.60811.49 is OK

MISSING DOIs

- None

INVALID DOIs

- None

whedon · 2019-03-29T11:03:49Z

Check final proof 👉 openjournals/joss-papers#587

If the paper PDF and Crossref deposit XML look good in openjournals/joss-papers#587, then you can now move forward with accepting the submission by compiling again with the flag deposit=true e.g.

@whedon accept deposit=true

danielskatz · 2019-03-29T11:08:53Z

@whedon accept deposit=true

whedon · 2019-03-29T11:08:56Z

Doing it live! Attempting automated processing of paper acceptance...

whedon · 2019-03-29T11:09:29Z

🚨🚨🚨 THIS IS NOT A DRILL, YOU HAVE JUST ACCEPTED A PAPER INTO JOSS! 🚨🚨🚨

Here's what you must now do:

Check final PDF and Crossref metadata that was deposited 👉 Creating pull request for 10.21105.joss.01336 joss-papers#588
Wait a couple of minutes to verify that the paper DOI resolves https://doi.org/10.21105/joss.01336
If everything looks good, then close this review issue.
Party like you just published a paper! 🎉🌈🦄💃👻🤘

Any issues? notify your editorial technical team...

whedon · 2019-03-29T11:10:59Z

🎉🎉🎉 Congratulations on your paper acceptance! 🎉🎉🎉

If you would like to include a link to your paper from your README use the following code snippets:

Markdown:
[![DOI](http://joss.theoj.org/papers/10.21105/joss.01336/status.svg)](https://doi.org/10.21105/joss.01336)

HTML:
<a style="border-width:0" href="https://doi.org/10.21105/joss.01336">
  <img src="http://joss.theoj.org/papers/10.21105/joss.01336/status.svg" alt="DOI badge" >
</a>

reStructuredText:
.. image:: http://joss.theoj.org/papers/10.21105/joss.01336/status.svg
   :target: https://doi.org/10.21105/joss.01336

This is how it will look in your documentation:

We need your help!

Journal of Open Source Software is a community-run journal and relies upon volunteer effort. If you'd like to support us please consider doing either one (or both) of the the following:

Volunteering to review for us sometime in the future. You can add your name to the reviewer list here: http://joss.theoj.org/reviewer-signup.html
Making a small donation to support our running costs here: https://numfocus.salsalabs.org/donate-to-joss

whedon assigned VivianePons Mar 19, 2019

whedon added the review label Mar 19, 2019

whedon mentioned this issue Mar 19, 2019

[PRE REVIEW]: rrcf: Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams #1301

Closed

vc1492a mentioned this issue Mar 22, 2019

JOSS review updates kLabUM/rrcf#48

Merged

VivianePons added the accepted label Mar 29, 2019

danielskatz closed this as completed Mar 29, 2019

whedon added published Papers published in JOSS recommend-accept Papers recommended for acceptance in JOSS. labels Mar 2, 2020

[REVIEW]: rrcf: Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams #1336

[REVIEW]: rrcf: Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams #1336

Comments

whedon commented Mar 19, 2019 • edited Loading

Status

Reviewer instructions & questions

Review checklist for @vc1492a

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @JustinShenk

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

whedon commented Mar 19, 2019

whedon commented Mar 19, 2019

whedon commented Mar 19, 2019

vc1492a commented Mar 22, 2019

VivianePons commented Mar 22, 2019

JustinShenk commented Mar 23, 2019

mdbartos commented Mar 23, 2019

mdbartos commented Mar 26, 2019

whedon commented Mar 26, 2019

mdbartos commented Mar 26, 2019

License

Functionality

Version matching

Automated tests

Example usage

Installation instructions

Community guidelines

References

mdbartos commented Mar 26, 2019

vc1492a commented Mar 26, 2019

VivianePons commented Mar 26, 2019

whedon commented Mar 26, 2019

VivianePons commented Mar 26, 2019

whedon commented Mar 26, 2019

whedon commented Mar 26, 2019

VivianePons commented Mar 26, 2019

VivianePons commented Mar 26, 2019

mdbartos commented Mar 26, 2019

VivianePons commented Mar 26, 2019

arfon commented Mar 26, 2019

mdbartos commented Mar 26, 2019

whedon commented Mar 26, 2019

whedon commented Mar 26, 2019

mdbartos commented Mar 28, 2019

JustinShenk commented Mar 28, 2019

VivianePons commented Mar 28, 2019

whedon commented Mar 28, 2019

whedon commented Mar 28, 2019

VivianePons commented Mar 28, 2019

mdbartos commented Mar 28, 2019

VivianePons commented Mar 29, 2019

whedon commented Mar 29, 2019

VivianePons commented Mar 29, 2019

mdbartos commented Mar 29, 2019

VivianePons commented Mar 29, 2019

whedon commented Mar 29, 2019

VivianePons commented Mar 29, 2019

danielskatz commented Mar 29, 2019

danielskatz commented Mar 29, 2019

whedon commented Mar 29, 2019

whedon commented Mar 29, 2019

whedon commented Mar 29, 2019

danielskatz commented Mar 29, 2019

whedon commented Mar 29, 2019

whedon commented Mar 29, 2019

whedon commented Mar 29, 2019

whedon commented Mar 19, 2019 •

edited

Loading