Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW]: rrcf: Implementation of the Robust Random Cut Forest algorithm for anomaly detection on streams #1336

Closed
36 tasks done
whedon opened this issue Mar 19, 2019 · 48 comments
Assignees
Labels
accepted published Papers published in JOSS recommend-accept Papers recommended for acceptance in JOSS. review

Comments

@whedon
Copy link

whedon commented Mar 19, 2019

Submitting author: @mdbartos (Matthew Bartos)
Repository: https://github.com/kLabUM/rrcf
Version: 0.3.1
Editor: @VivianePons
Reviewer: @vc1492a, @JustinShenk
Archive: 10.5281/zenodo.2613881

Status

status

Status badge code:

HTML: <a href="http://joss.theoj.org/papers/f8c83c0b01a984d0dbf934939b53c96d"><img src="http://joss.theoj.org/papers/f8c83c0b01a984d0dbf934939b53c96d/status.svg"></a>
Markdown: [![status](http://joss.theoj.org/papers/f8c83c0b01a984d0dbf934939b53c96d/status.svg)](http://joss.theoj.org/papers/f8c83c0b01a984d0dbf934939b53c96d)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@vc1492a & @JustinShenk, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.theoj.org/about#reviewer_guidelines. Any questions/concerns please let @VivianePons know.

Please try and complete your review in the next two weeks

Review checklist for @vc1492a

Conflict of interest

Code of Conduct

General checks

  • Repository: Is the source code for this software available at the repository url?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
  • Version: 0.3.1
  • Authorship: Has the submitting author (@mdbartos) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

  • Installation: Does installation proceed as outlined in the documentation?
  • Functionality: Have the functional claims of the software been confirmed?
  • Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
  • Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
  • Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
  • Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

  • Authors: Does the paper.md file include a list of authors with their affiliations?
  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?

Review checklist for @JustinShenk

Conflict of interest

Code of Conduct

General checks

  • Repository: Is the source code for this software available at the repository url?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
  • Version: 0.3.1
  • Authorship: Has the submitting author (@mdbartos) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?

Functionality

  • Installation: Does installation proceed as outlined in the documentation?
  • Functionality: Have the functional claims of the software been confirmed?
  • Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
  • Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
  • Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
  • Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

  • Authors: Does the paper.md file include a list of authors with their affiliations?
  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • References: Do all archival references that should have a DOI list one (e.g., papers, datasets, software)?
@whedon
Copy link
Author

whedon commented Mar 19, 2019

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @vc1492a, it looks like you're currently assigned as the reviewer for this paper 🎉.

⭐ Important ⭐

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

@whedon
Copy link
Author

whedon commented Mar 19, 2019

Attempting PDF compilation. Reticulating splines etc...

@whedon
Copy link
Author

whedon commented Mar 19, 2019

@vc1492a
Copy link

vc1492a commented Mar 22, 2019

@VivianePons My review of this software is provided below.

Overview: Overall, I think the authors did a great job implementing the Robust Random Cut Forest algorithm as detailed in the original paper. Below are my comments on particular sections. When the authors strictly meet the requirements laid out in the review checklist, I have marked these items as complete. For those where the requirement is not strictly met, I have left the checklist entry empty. Either way, I have provided some comments below with actions that should be taken (in the case of unchecked entries) and actions that could be taken to improve the work further but that aren't strictly required for submission to JOSS (in the case of checked entries).

I think I will give this implementation a try in a few of my projects! This project is a great implementation of the original paper and could be further improved with a little additional elbow grease. Thoroughly enjoyed reviewing this work!

License: while a license is provided in the project repository (MIT), it may be beneficial to indicate the license more publicly in the readme on Github and/or in the documentation. Aa badge could be used or a section added to the readme or documentation.

Functionality: I read the original paper in preparation for this review and believe the implementation here in rrcf is functionally correct. The software author is correctly accounting for duplicate values and using proper randomization when making cuts to the trees in the forest. The provided examples and unit tests help to verify the functionality of this software package.

However, when reviewing the source code I noticed a few TODO comments in rrcf.py that would be better placed in the dev branch. It's better practice to keep TODO comments away from the master branch. Additionally, there are several areas in rrcf.py where any exception that is caught raises a KeyError. This may be misleading, as exceptions which are not in fact due to a KeyError will be raised as KeyErrors and presented as such to users of rrcf.py. I noticed a similar pattern with some ValueErrors. I have opened a pull-request that addresses this and a few other minor items, but I will leave this up to the authors to address as it is not an issue with algorithm implementation and rather a software-design decision that should be thought through from a user perspective.

Version Matching: The version number submitted to JOSS does match the latest version of the source code in master on Github. It may be beneficial to tag the releases that are uploaded to PyPi on Github such that users can download the software directly from Github via a Release as opposed to having to clone or pull the repository.

Automated Tests: There are automated unit tests which I ran using pytest and pytest-cov without issues. It looks like code coverage is currently at 70%. I did not see any explicit instructions on how to run tests for this software. While not explicitly necessary, it may be beneficial to include some instructions on how to run tests for those interested in contributing (maybe this can be included in some contributor's guidelines). Additionally, if you're looking to have others contribute to the project more easily in the future consider setting up a continuous integration pipeline that will automatically run tests and report the test results (e.g. using Travis CI) and coverage (e.g. using Coveralls) with any pushes to master or dev and for any pull requests opened by open-source contributors. This pipeline would ease the burden on making sure new additions and/or changes to software work as intended, and would also allow maintainers to formally test against every version of Python 3 that should be supported.

Example Usage: This software does a good job of providing examples for use, including implementations of the examples from the original paper. The notebooks are also helpful, however I ran into a few issues in the notebooks that I believe should be addressed if they are to continue being included in the repository. First, some datasets are missing from the notebooks that are either 1) not available (nuclear.mat) 2) referenced in the documentation but not on the Github readme. If these datasets could be provided with links to download in the readme that could be beneficial. Second, version control should be addressed - more specifically, I not able to run some of the notebook code with earlier versions of Matplotlib and Pandas but was later to successfully after an update. It would be helpful to new users of the software to have a list of required versions for packages needed to run the examples in the notebooks. Lastly, in the documentation on the website the file ../resources/nyc_taxi.csv is referenced. It may be more clear to simply point to nyc_taxi.csv and mention that this file should sit in the directory in which the rrcf code is being executed.

Installation Instructions:
It is not clear from the Github readme or the provided documentation that numpy is the sole software dependency. It may be beneficial to users to point this out somewhere. However, install_requires in setup.py is being used properly to ensure numpy is installed if it does not yet exist in the user's environment already. Lastly, the authors should consider having the installation instructions towards the top of the documentation and the readme. This is helpful since the reader would typically start from the top of a document and work their way down when reading documentation and working through examples.

Community Guidelines:
Minimal documentation for contributing to the repository exists, indicating that contributors should submit pull requests to the dev branch. Developers could improve this by adopting a tagging scheme such as one used in scikit-learn. There, pull requests that are a work in progress are tagged [WIP], indicating that someone is working on a fix or new feature and that the work shouldn't be duplicated by someone else. When ready for merging, a two step review process is initiated with the tags [MRG+1] and [MRG+2]. See contributing pull requests in the scikit-learn documentation for more info.

References:
Every reference mentioned in the paper is documented as BibTex entries. However, no doi is listed for A comparative study of RNN for outlier detection in data mining or Support Vector Data Description which have DOIs 10.1109/ICDM.2002.1184035 and 10.1023/B:MACH.0000008084.60811.49, respectively. There is no listed DOI for Robust random cut forest based anomaly detection on streams but I did not locate a matching DOI online for that article. I have included these updates as part of my previously-mentioned pull request.

@VivianePons
Copy link

Thank you very much @vc1492a for this thorough review, @mdbartos I let you work on the comments.

@JustinShenk
Copy link

@VivianePons My review of this software is provided below.

Overview: I think the authors did a great job implementing the Robust Random Cut Forest algorithm as a Python package. I am also looking forward to implementing this in a few of my projects. Following @vc1492a's thorough review, I have little to add, so my review will rather emphasize the points already addressed which I think are most relevant to the authors of this project.

The code is well commented and described in sufficient detail, which supports future development.

License: +1 to adding a badge to the heading.

Functionality: Nothing to add.

Version Matching: +1 to tagging releases uploaded to GitHub.

Automated Tests: Nothing to add.

Example Usage: +1 to cleaning up the notebooks if they are to remain.

Installation Instructions: +1 to putting installation instructions towards the top of document.

Community Guidelines: Nothing to add.

References: Nothing to add.

@mdbartos
Copy link

Hi @vc1492a and @JustinShenk, thanks for the excellent reviews! I will get to work on these issues as soon as possible.

@mdbartos
Copy link

@whedon set 0.3 as version

@whedon
Copy link
Author

whedon commented Mar 26, 2019

I'm sorry @mdbartos, I'm afraid I can't do that. That's something only editors are allowed to do.

@mdbartos
Copy link

Greetings,

@vc1492a and @JustinShenk: Thank you again for the extremely relevant and constructive feedback. It would not be an exaggeration to say that this is one of the most helpful reviews I have received on an academic paper :)

A point-by-point summary of changes is listed below:

License

  • We have added the MIT license badge to the README header.

Functionality

  • @vc1492a's pull request has been merged into the master branch. This merge has fixed the error handling behavior. TODO comments have also been removed.

Version matching

  • Historical releases have been tagged.
  • A post-review release (0.3) has been tagged and added to pypi.

Automated tests

  • Instructions for running unit tests have been added to the README under contributing.
  • Additional tests have been added, raising the overall coverage to 84%.
  • Tests have been added to verify proper handling of duplicates.
  • A coveralls badge has been added to the README to indicate coverage.
  • Travis CI and coverall builds are triggered for pull requests and pushes to master. I thought that pull requests to dev were also checked, but I can verify.

Example usage

  • To ease maintainability, notebooks have been folded into the documentation website, and redundant examples have been removed.
  • Suggested versions of dependencies needed to run the examples are now included in the README.
  • Datasets used in the examples have been added to the /data folder in the main repo, and instructions for using these datasets have been clarified.
  • We added a caveats section to the documentation about scaling of dimensions. This addition was posted in reponse to some questions I received about the repo in the past week from another user.

Installation instructions

  • We now list numpy as the sole required dependency in the README.
  • In the README, we now also list optional dependencies (and versions) needed to run the examples in the documentation.
  • We have moved the installation instructions to the top of the README.

Community guidelines

  • We have expanded the community guidelines by adding suggested types of contributions, guidelines for maintaining code quality and instructions for running unit tests.
  • Given that the repo is fairly small and limited in scope (compared to scikit-learn), I have not included a tagging system for pull requests as of yet. However, I am open to implementing one in the future.

References

  • We have merged @vc1492a's pull request, and the requested DOIs are now included. Thanks for the help!

Please let me know if you have any other suggestions. Your time, effort and expertise is very much appreciated.

Thanks again!
MDB

@mdbartos
Copy link

@VivianePons : Would it be possible to change the version to 0.3? I tried to do this using whedon but was denied.

Also, would it be possible to change the description on JOSS to:

  • rrcf is a Python implementation of the Robust Random Cut Forest algorithm: an unsupervised ensemble method for anomaly detection on streaming data

Thank you!
MDB

@vc1492a
Copy link

vc1492a commented Mar 26, 2019

@mdbartos thanks for the reply and for integrating my pull request! @VivianePons I have reviewed the changed @mdbartos put in place - they look great and I've updated the review checklist to reflect the latest status. Please let me know if there's anything else I can take care of as part of the review process. Thanks!

@VivianePons
Copy link

@whedon set 0.3 as version

@whedon
Copy link
Author

whedon commented Mar 26, 2019

OK. 0.3 is the version.

@VivianePons
Copy link

@whedon generate pdf

@whedon
Copy link
Author

whedon commented Mar 26, 2019

Attempting PDF compilation. Reticulating splines etc...

@whedon
Copy link
Author

whedon commented Mar 26, 2019

@VivianePons
Copy link

@mdbartos I have updated the version but I'm not sure which description you're referring to

@VivianePons
Copy link

@JustinShenk Have you looked at the most recent changes? Do you give your green light for accepting the paper?

@mdbartos
Copy link

Hi @VivianePons, I was just referring to the description on the JOSS website: https://joss.theoj.org/papers/f8c83c0b01a984d0dbf934939b53c96d

I wanted to add "rrcf is a ..." to the beginning of the description to make the description more in line with the style of other submissions. It's not a big issue though.

@VivianePons
Copy link

@arfon do you know where this description is coming from: https://joss.theoj.org/papers/f8c83c0b01a984d0dbf934939b53c96d ?

@arfon
Copy link
Member

arfon commented Mar 26, 2019

@arfon do you know where this description is coming from: https://joss.theoj.org/papers/f8c83c0b01a984d0dbf934939b53c96d ?

Yes, it's what the author included in the form when they originally submitted. Once accepted, this text is never public so we can mostly ignore it.

@mdbartos
Copy link

@whedon generate pdf

@whedon
Copy link
Author

whedon commented Mar 26, 2019

Attempting PDF compilation. Reticulating splines etc...

@whedon
Copy link
Author

whedon commented Mar 26, 2019

@mdbartos
Copy link

ping @JustinShenk

@JustinShenk
Copy link

@JustinShenk Have you looked at the most recent changes? Do you give your green light for accepting the paper?

@VivianePons Yes, I've reviewed the recent changes from @mdbartos and updated the checklist. Looks good to me!

@VivianePons
Copy link

@whedon check references

@whedon
Copy link
Author

whedon commented Mar 28, 2019

Attempting to check references...

@whedon
Copy link
Author

whedon commented Mar 28, 2019


OK DOIs

- 10.1109/cifer.1997.618940 is OK
- 10.1145/335191.335388 is OK
- 10.1137/1.9781611972733.3 is OK
- 10.1145/2133360.2133363 is OK
- 10.1109/TII.2010.2060732 is OK
- 10.1145/997150.997156 is OK
- 10.1109/ICDM.2002.1184035 is OK
- 10.1023/B:MACH.0000008084.60811.49 is OK

MISSING DOIs

- None

INVALID DOIs

- None

@VivianePons
Copy link

This paper is good to go! :)

@mdbartos could you create an online archive (for example on Zenodo) and give the reference?

@mdbartos
Copy link

Hi @VivianePons,

I created an archive with Zenodo using their Github integration service:
https://zenodo.org/record/2613881

Unfortunately, to get the integration to work I had to update the release to 0.3.1

Would it be possible to update the version to 0.3.1 for consistency?

Thanks again,
MDB

@VivianePons
Copy link

@whedon set 0.3.1 as version

@whedon
Copy link
Author

whedon commented Mar 29, 2019

OK. 0.3.1 is the version.

@VivianePons
Copy link

Can you update the Zenodo metadata so that the title matches the paper title?

Your name and Sara Troutman's name are also a bit different in both (missing the middle name letter) and the order is different.

Thanks

Viviane

@mdbartos
Copy link

Ok, it is done.

Thanks,
MDB

@VivianePons
Copy link

@whedon set 10.5281/zenodo.2613881 as archive

@whedon
Copy link
Author

whedon commented Mar 29, 2019

OK. 10.5281/zenodo.2613881 is the archive.

@VivianePons
Copy link

This paper is now good to go @openjournals/joss-eics ! Thank you @mdbartos for the paper and @JustinShenk and @vc1492a for the reviews! :) :)

@danielskatz
Copy link

And thanks @VivianePons for editing

@danielskatz
Copy link

@whedon accept

@whedon
Copy link
Author

whedon commented Mar 29, 2019

Attempting dry run of processing paper acceptance...

@whedon
Copy link
Author

whedon commented Mar 29, 2019


OK DOIs

- 10.1109/cifer.1997.618940 is OK
- 10.1145/335191.335388 is OK
- 10.1137/1.9781611972733.3 is OK
- 10.1145/2133360.2133363 is OK
- 10.1109/TII.2010.2060732 is OK
- 10.1145/997150.997156 is OK
- 10.1109/ICDM.2002.1184035 is OK
- 10.1023/B:MACH.0000008084.60811.49 is OK

MISSING DOIs

- None

INVALID DOIs

- None

@whedon
Copy link
Author

whedon commented Mar 29, 2019

Check final proof 👉 openjournals/joss-papers#587

If the paper PDF and Crossref deposit XML look good in openjournals/joss-papers#587, then you can now move forward with accepting the submission by compiling again with the flag deposit=true e.g.

@whedon accept deposit=true

@danielskatz
Copy link

@whedon accept deposit=true

@whedon
Copy link
Author

whedon commented Mar 29, 2019

Doing it live! Attempting automated processing of paper acceptance...

@whedon
Copy link
Author

whedon commented Mar 29, 2019

🚨🚨🚨 THIS IS NOT A DRILL, YOU HAVE JUST ACCEPTED A PAPER INTO JOSS! 🚨🚨🚨

Here's what you must now do:

  1. Check final PDF and Crossref metadata that was deposited 👉 Creating pull request for 10.21105.joss.01336 joss-papers#588
  2. Wait a couple of minutes to verify that the paper DOI resolves https://doi.org/10.21105/joss.01336
  3. If everything looks good, then close this review issue.
  4. Party like you just published a paper! 🎉🌈🦄💃👻🤘

Any issues? notify your editorial technical team...

@whedon
Copy link
Author

whedon commented Mar 29, 2019

🎉🎉🎉 Congratulations on your paper acceptance! 🎉🎉🎉

If you would like to include a link to your paper from your README use the following code snippets:

Markdown:
[![DOI](http://joss.theoj.org/papers/10.21105/joss.01336/status.svg)](https://doi.org/10.21105/joss.01336)

HTML:
<a style="border-width:0" href="https://doi.org/10.21105/joss.01336">
  <img src="http://joss.theoj.org/papers/10.21105/joss.01336/status.svg" alt="DOI badge" >
</a>

reStructuredText:
.. image:: http://joss.theoj.org/papers/10.21105/joss.01336/status.svg
   :target: https://doi.org/10.21105/joss.01336

This is how it will look in your documentation:

DOI

We need your help!

Journal of Open Source Software is a community-run journal and relies upon volunteer effort. If you'd like to support us please consider doing either one (or both) of the the following:

@whedon whedon added published Papers published in JOSS recommend-accept Papers recommended for acceptance in JOSS. labels Mar 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted published Papers published in JOSS recommend-accept Papers recommended for acceptance in JOSS. review
Projects
None yet
Development

No branches or pull requests

7 participants