-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WEASEL+MUSE with Samples of Different Lengths #81
Comments
Hi, Thank you for your interest in pyts. Data sets of variable-length time series are unfortunately poorly supported for the moment, because of the following reasons:
That being said, I think that they are still ways of dealing with variable-length time series data sets. Some dummy ways would be to truncate or pad (with a real number) the time series to make them fixed-length time series data sets. WEASEL+MUSE is basically WEASEL applied to each feature and to the derivatives of the time series independently. So, if the lengths of the time series are identical for a given feature, then you're good to go. I may provide you a better answer and a code example if I know why the time series have different lengths and the approximate size of your data set. Hope this helps you a bit. |
Thanks for getting back to me. It's definitely understandable why the algorithms are the way they are. The problems have always asked for it that way. The problem I'm pursuing is: Many features measured through time, going into a target state and out of a target state. So only a binary classification (True or False). Each time the features go into state (True), that length of time is obviously different than the time it's out of state previously (False). I had chopped up each time series into samples of True and False. For example: Features A and B, measured every minute for a day. Let's say it goes into state from 07:00 to 10:00 and 14:00 to 15:00. Wanting to transform this data to then build a classifier of True/False states. I have tried padding the time series with 0's to make them all the same length, but an exception happens:
I hope that makes sense. The scale of data is roughly ~600 samples, 50 features, timestamps ranging from 60-150 in length. |
Thank you for your detailed answer. Concerning the error, I should get rid of it because it is too restrictive. Basically this error occurs because the Fourier coefficients are all equal to 0, and binning a constant variable with more than 1 bin is a bit weird. Maybe your example is a bit too simple, but if the features can go into the target state only every hour, and if you make the assumption that going in and out of the target state only depends on the previous hour, you would end up with one sample for each hour, and all the time series would have the same length. One important issue, in your formulation, is that you use the target state to define the samples (i.e. how to split the whole time series into sub-series). When you will apply the algorithm on new unseen data, you won't be able to split the whole time series because you can't use the labels of the test samples (that's "cheating"). To me, your task looks more like a regression for binary time series. You have a target binary time series I discussed about this in the previous issue #80. You may find some relevant information in the discussion. Let me know if this helps you. |
I think my simple example didn't illustrate the problem, my apologies. Please let me try again. I'm building a classifier on a set of time series, to classify whether a future, examined time is a target state (True) or not (False). I thought about training this classifier (for example, scikit's RandomForestClassifier) using the X generated by the WEASEL+MUSE transform. Once trained, the classifier could then be applied to future data to give a probability it's in state or not (e.g. predict_proba from RandomForestClassifier). The time series were labeled as in-state or out of state for all time. So they have un-even boundaries (for example, in-state from 01:34 to 03:15, not in state 03:15 to 09:32, etc). That was the original reason to use the samples with varying n_timestamps. I'm not trying to predict the actual values of the time series, just classify them. I thought the words generated by W+M could be robust features in the training of this classifier. Do you have any experience using this for a problem like that? Do you even think W+M is appropriate for it? Thanks. |
As you said, your first example was maybe too simple, but from what I understood, your data set looks like you have a target state
So you have
You split the whole sequence of measurements based on the value of the target state and define the label as the future value of the target:
Am I wrong? To me, your target state is a binary time series and you are trying to predict its future value. |
No your description of the problem is correct, and re-reading your last post I understand more. Let me look more into the other post, and that topic. Thank you so much for your time! |
If each sample is not the same length of n_timestamps, is it possible to perform WEASEL+MUSE on that dataset?
If I construct an array of (n_samples, n_features, n_timestamps) with n_timestamps that aren't equal, I get an exception from the validation of the array.
If I pad the samples with None or if I pad with some constant value, I get an exception (NaNs not allowed / quantiles equal).
Is there a solution/workaround to this problem, or am I chasing something that isn't allowed?
Thank you for a great package!
The text was updated successfully, but these errors were encountered: