TICC fails with clusters with only one observation #66
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi David
I used your algorithm for anomaly detection on airplane data. This was in the context of my master thesis. I found the next potential problem.
If the number of clusters is too high, TICC might fail because it contains clusters with only one observation. It calculates the covariance matrix of the clusters using the unbiased covariance formula (with N-1 in the denominator), where N is the amount of observations.
In this case, it will divide by 0, which results in NaN and causes a failure of the algorithm. Clusters with only one observation are not typical, but might be interesting for anomaly detection. TICC is based on the EM-algorithm and will thus iterate, it is thus also possible that it has temporary clusters of size one. Hence, It would be handy if TICC could work with clusters of size one.
Thus, I added an option to choose between the unbiased and the biased (biased divides by N) covariance.
I also added some tests, which illustrate the failure and illustrate that both result in the same cluster assignment. I also had a closer look to the biased and the unbiased covariance matrices. The differences between them are mostly very small. In one of my experiments, the differences are of magnitude 10e-2 or smaller.
Another option would be to only use the biased covariance if there is only one observation, but I leave this up to u.
Kind regards
Jordy Heusdens