Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Statistical Hierarchical Clusterer #18

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dkrleza
Copy link
Contributor

@dkrleza dkrleza commented Oct 11, 2021

I have added the Statistical Hierarchical Clusterer, which is also an outlier detector and a single pass clusterer.
You can find it as DSC_SHC in the source code.

Dalibor.

@mhahsler
Copy link
Owner

Hi Dalibor,

thank you for all the work. Here are a few comments/thoughts:

  • Avoid requiring ggplot/tidyverse in the package. This makes it very hard to use on a lightweight installation.
  • Do you have a better idea for the registry? I think adding new variables is not the way to go long-term. Maybe the registry was a bad idea in the first place.
  • Create DSOutlier as a task similar to DSC instead of DSC_Outlier. Your clustered then inherits the tasks DSC and DSOutlier.
  • Should we give DSC_SHC a second name like DSOutlier_SHC so people will find it? (Also for the other outlier detection algorithm). Now users can type DSOutlier and the autocomplete in RStudio will show all Outlier methods.

Thanks!

-Michael

@dkrleza
Copy link
Contributor Author

dkrleza commented Oct 12, 2021

Gr8 gr8.

Anyways, we are not in a rush with this. I noticed that SHC C++ code somehow doesn't compile nicely with mingw. Yet it compiles perfectly under Unix/Linux OS-es, which leaves me puzzled. I mean, mingw should be Linux compilers and libraries ported to the Windows OS.... I need to see what happens there. You guessed right, I work on Unix all the time, so I don't have these issues.

I'll throw ggplot2 out, since it is used only to nicely plot the SHC clustering results. This is probably not needed.

About DSC registry. I understand the need to have a central point where users could query for clusterers based on their needs. For that reason I think that classifying only into micro and macro clusterer categories is not sufficient. As a business user I want to see which clusterers support my needs, what capabilities they have. I think that DSC registry should be more extensive and provide users some way to select clusterers based on many attributes, such as:

  • Requires aforedefined number of concepts?
  • Has micro / macro levels?
  • Is it a single or double stage algorithm?
  • Is capable of detecting and isolating outliers?
  • Is capable of detecting and tracking a concept drift?
  • Is suitable for processing of data streams?
  • Complexity?
  • Type of data capable of processing? (For example, discrete and categorical clusterers)
  • etc... etc...

SHC is equally a clusterer and an outlier detector. That is the reason why "DSC_Outlier". But this is an interesting debate. For example COD, MCOD are being ONLY outlier detectors (with some rudimentary clustering capabilities), while SHC is all-in-one. Maybe "DSCO_SHC" and "DS_Outlier" (the abstract class)?

We are definitely adding some new capabilities to the stream package, and I recognize that this should be carefully introduced. There will be no turning back later on.

@mhahsler
Copy link
Owner

OK. Let me know if the following is good for the next release:

  • I remove the registry so we do not need to worry about this. If you type in RStudio DSC_ then all the DSCs are coming up anyway.
  • I provide a DSOutlier class that will be the base class for all outlier detectors. A DSC-based one will just inherit also from DSC and have a DSC_ and DSOutlier_ name.

Once that is done, then you can adapt your new code (should only be a few lines).

Let me know if you think this is a consistent way to deal with this and I will go ahead.

-Michael

@mhahsler
Copy link
Owner

mhahsler commented Oct 29, 2021

A few more questions:

  • Do your DSC_Outlier algorithms mark (micro) clusters as outliers or can they look at a stream and identify outlier points? I think a DSOutlier detector should do the latter, while the first one is a regular clusterer that can define micro clusters as outlier clusters.
  • What is a Single Pass clusterer? The docu is not very clear on that.

@dkrleza
Copy link
Contributor Author

dkrleza commented Oct 30, 2021

So, in stream 4.0 we introduced the abstract DSC_Outlier (which can be renamed as you suggested), and can be found in DSC.R, has the following methods:

  • clean_outliers - clean all outlier points and discard them
  • get_outlier_positions - get data point positions which are detected as outliers
  • recheck_outlier - as we pass through the input data stream, we sometimes need to recheck whether some outlier is still perceived as an outlier
  • noutliers - the number of detected outliers

Outlier are indeed NOT percevied as micro or macro clusters and are totally distinct category. We also need that for outlier accuraccy indices... which are added to the stream package in 4.0 as well.
I think SHC is actually a first clusterer/outlier detector that uses all these mechanisms.

DSC_SinglePass
Some clusterers are NOT performing stream clustering in two phases (microclustering and macroclustering phase). I think that MCOD is such a clusterer. SHC is certainly NOT working in two phases. You retrieve a data point from the input data stream, call the single pass clusterer/outlier detector and get the resulting classification immediatelly back. In such a classification you get either micro and macro cluster identifiers or outlier identifier. Before stream 4.0, the stream package evaluation mechanism was not able to support such single-pass way of clustering... This abstract class was added mostly as a switch that signals the evaluation mechanism that the resulting classification will be immediatelly returned by the underlying clusterer/outlier detector.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants