Skip to content

Conversation

@Heusdens97
Copy link
Contributor

Hi David

I used your algorithm for anomaly detection on airplane data. This was in the context of my master thesis. I found the next potential problem.

If the number of clusters is too high, TICC might fail because it contains clusters with only one observation. It calculates the covariance matrix of the clusters using the unbiased covariance formula (with N-1 in the denominator), where N is the amount of observations.

In this case, it will divide by 0, which results in NaN and causes a failure of the algorithm. Clusters with only one observation are not typical, but might be interesting for anomaly detection. TICC is based on the EM-algorithm and will thus iterate, it is thus also possible that it has temporary clusters of size one. Hence, It would be handy if TICC could work with clusters of size one.

Thus, I added an option to choose between the unbiased and the biased (biased divides by N) covariance.
I also added some tests, which illustrate the failure and illustrate that both result in the same cluster assignment. I also had a closer look to the biased and the unbiased covariance matrices. The differences between them are mostly very small. In one of my experiments, the differences are of magnitude 10e-2 or smaller.

Another option would be to only use the biased covariance if there is only one observation, but I leave this up to u.

Kind regards
Jordy Heusdens

@davidhallac davidhallac merged commit 85d45d1 into davidhallac:master Jun 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants