-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MNT, ENH, DOC] Rework similarity search #2473
base: main
Are you sure you want to change the base?
Conversation
Thank you for contributing to
|
Thank you very much for working on this. Some thoughts:
That is an interesting problem. Here is my view: For whole series similarity search fit requires a dataset of time series of equal length, and find_neighbors would get one or many query series of this length. For subsequence similarity search fit requires a single time series, and find_neighbors commonly gets a single query sequence which length is shorter than the single series length. It would be fine however, to extend it to multiple short sequences. There is only one whole series consensus motif search paper, which would be the use case of whole matching and motif discovery. The input to fit would be the whole dataset, and find_motif has no input series X. not sure, what an input series X should trigger. Most papers solve the problem of motif discovery in a single long time series, defined as subsequences of the time series. Here, fit gets a single series, and find_motif has no input series X.
What is the difference between BaseMatrixProfile and STOMP? At least for Motiflets, we cannot use STOMP/MP, as it only gives a 1-NN profile, but we need k-NN profiles. Same problem would be the case, if you want to solve k-nearest neighbors similarity search.
I think that X is not meaningful for motif discovery. |
Thanks for the inputs @patrickzib😄
Completely in line with this, but what about the case of unequal length series, with, for example, elastic distance measures? Wouldn't that be a plausible use case? (all whole series estimators don't have to support it)
For this case, I'm defining a length parameter during
This is the tricky one for me too. I'm not sure how giving
I've been kinda frustrated by this limitation for practical use cases, wouldn't it be fine to loop on series of a collection with the motif discovery methods and then merge the results ? That's how I implemented STOMP for now for example. For each subsequence in
As stated above, I already extended STOMP to support k-NN profiles for collections (multivariate and unequal length compatible). I suppose that in this context, motiflets would either inherit from Note that it's possible to simply raise a "NotImplementedError" or something similar if an estimator would only support neighbors or motifs search. My goal here is to find a base class structure that enables us to move most common code to there and focus on the computational optimisations of each method in the child classes.
In the context of motif search in a single series I agree, but wouldn't there be some interest when dealing with a collection ? For example find motifs in the collection at the condition that they are similar to a subsequence in X ? (This is pure speculation) |
Sure. I did not think of this.
Simplicity :) But I agree that you could have multiple series in fit, too - this would mimic the Shapelet use case, I suppose?
Sorry, yes, that is what the authors refer to as consensus motif:
I see. I personally do not like to use the terms matrix-profile for simple k-NN distances or k-NN indices though. It was a brilliant re-framing of EK, such that all 1-NN algorithms are now suddenly an instance of matrix profile. Yet, the concept is much older.
Great.
I would not say that this is impossible, but I have not seen it. :) |
I'm not 100% sure what you mean, but in a sense yes ? For example with a brute force neighbour search, just compute the distance of the subsequence given in
I'm not against the idea of a different naming, especially if methods labelled differently from MPs would fit in the base class without much change of parameter/interface. Would you have any proposal? Something like |
In sklearn it is simply |
…class-with-attimo-algorithm
I'll leave the implementation of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, only a very brief review. I would have to test the code to give a better review, but I will be leaving for vacation...
from aeon.similarity_search._base import BaseSimilaritySearch | ||
|
||
|
||
class BaseCollectionSimilaritySearch(BaseCollectionEstimator, BaseSimilaritySearch): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if I get the difference between this base class and the other base class. What is the purpose of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The BaseCollectionSimilaritySearch is used for estimators taking collections of time series during fit/predict, while the BaseSeriesSimilaritySearch is for single series estimators (similar to the transformer module)
Parameters | ||
---------- | ||
X : np.ndarray, shape = (n_cases, n_channels, n_tiempoints) | ||
Collections of series to predict on. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spelling error: n_tiempoints
And why is the input different from fit
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean different from BaseSimilaritySearch
fit ?
I convert single series to 3D (1, n_channels, n_timepoints)
, but I didn't yet add support for unequal length as there was no estimator using it.
@@ -0,0 +1 @@ | |||
"""Motif search for time series collection.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Motif discovery
not search :)
What do you mean by "for time series collection"? Is this the same as a Consensus Motifs
? Then better call it that way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was aiming for something more generic than consensus motifs, simply motif discovery on a collection of time series instead on a single series. I didn't want to bind the module to one terminology. But consensus motif would be the first thing to implement in there yes !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments for the bits outside the actual module itself. I will have a look at the rest as well, but generally trust you and Patrick agree on something sensible together. Let me know if any of this contradicts a previous review.
- [**Similarity search**](api_reference/similarity_search), where the goal is to find | ||
time series motifs or nearest neighbors in an efficient way for either single series | ||
or collections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an example on this page which is now outdated.
@@ -155,7 +155,7 @@ class : identifier for the base class of objects this tag applies to | |||
"values?", | |||
}, | |||
"input_data_type": { | |||
"class": "transformer", | |||
"class": ["transformer", "similarity-search"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would update the description this is needed for more than transformers now
@@ -451,15 +454,56 @@ def get_subsequence_with_mean_std( | |||
return values, means, stds | |||
|
|||
|
|||
@njit(cache=True, fastmath=True, parallel=True) | |||
def compute_mean_stds_collection_parallel(X): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for updating the list at the top, I think this one is still missing though? Alternatively should be protected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not required here but helpful: updating the utils API page where relevant
@@ -92,6 +92,7 @@ def all_estimators( | |||
# ignore test modules and base classes | |||
"base", | |||
"tests", | |||
"similarity_search" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't think we want to ignore these. All of these are essentially empty of estimators or for testing only.
@@ -87,7 +87,6 @@ Mock Estimators | |||
MockUnivariateSeriesTransformer | |||
MockMultivariateSeriesTransformer | |||
MockSeriesTransformerNoFit | |||
MockSimilaritySearch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing the two new ones
# similarity search | ||
"MockSimilaritySearch", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing the two new ones
Reference Issues/PRs
Fixes #2341, #2236, #2028, #2020, #1806, #2475, #2538
What does this implement/fix? Explain your changes.
The previous structure for similarity search was not in line with the structure we would expect considering other aeon modules, the lack of distinct base classes for some tasks, as well as the initial design choice (due to the lack of practical experience with using and expanding the module) lead to some really complex code when working on #2341 to make everything work together. Further expanding the module would have made thing worse.
To make the module more flexible and comprehensible, the following rework is proposed in this PR (AEP to be updated acordingly):
The module structure is now :
Base classes are
BaseSimilaritySearch, BaseSeriesSimilaritySearch, BaseCollectionSimilaritySearch
Implemented estimators are :
The sufix of the estimators (SNN/ANN/Motifs) remains an open discussion, not sure it's the right way to go.
I removed the support for collections for Stomp and Mass for now to focus on the "expected and well known" use cases, I'll make them in another PR.
All similarity search estimators now use fit/predict interface, with predict returning two arrays (NN/Motifs indexes, and NN/Motifs distances).
Does your contribution introduce a new dependency? If yes, which one?
No.
Any other comments?
As this is still a WIP, I would love some inputs on the structure (notably from @patrickzib !) to make the module more future-proof to future additions and easier to use.
TODO list :
SubsequenceSearch
part and fix themBaseCollectionSimilaritySearch