Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dimensionality Reduction (approximation) along columns/time axis? #8

Open
legout opened this issue Aug 6, 2018 · 9 comments
Open

Comments

@legout
Copy link

legout commented Aug 6, 2018

Hi,

I wonder why the approximation functions PAA and DFT are applied to the rows? In my opinion based on what I found in the papers and dissertation of Patrick Schäfer, this should be applied to the columns (along the time axis). Am I wrong?

For example the code below returns an error:

import numpy as np
import pyts.approximation as pya

x = np.random.randn(100,2)

paa = pya.PAA(window_size=10)
x_paa = paa.fit_transform(x)

print('Shape of x {}'.format(x.shape))
print('Shape of x_paa {}'.format(x_paa.shape))

ValueError: 'window_size' must be lower or equal than the size of each time series.

However, what i´ve been expecting is the following:

import numpy as np
import pyts.approximation as pya

x = np.random.randn(100,2)

paa = pya.PAA(window_size=10)
x_paa = paa.fit_transform(x)

print('Shape of x {}'.format(x.shape))
print('Shape of x_paa {}'.format(x_paa.shape))

Shape of x (100, 2)
Shape of x_paa (10, 2)

Regards,
legout

@johannfaouzi
Copy link
Owner

Hi legout,

My convention is that each row is a time series, which means that the time axis is the second axis. For instance, x = np.random.randn(100,2) means that you have 100 time series of length 2. If your data is not in this format, you can just transpose your numpy array with the transpose method x.T.

If you're familiar with scikit-learn, you can think of the timestamps as the features of your data.

Best regards,
Johann

@legout
Copy link
Author

legout commented Aug 7, 2018

Hi Johann,

my fault was to think of every timestamp being a new sample and every feature being a different measure (e.g. temperature and pressure). But this is only true, if there is also one label/output at each timestamp (or multiple labels/outputs). **

However, if I wanna map one (multivariate) timeseries to one label/output (or multiple labels/outouts) every timestamp is a feature.

Btw, do you plan to implement WEASEL+MUSE into pyts?

Best regards,
Legout

**That was the case for me in every previous project.

@johannfaouzi
Copy link
Owner

Hi legout,

Multivariate time series are currently not supported in pyts. Adding specific algorithms for multivariate time series would definitely be a great idea. However, pyts is not under very active development currently and I can't make any promise on a release date with such algorithms.

My on-the-fly thoughts for classification of multivariate time series would be to fit a classifier for each dimension and then use a voting classifier to predict one single label. The issue is that you lose the dependency between the dimensions though. You could also reduce the number of dimensions and use a single classifier, but it may be a bad idea if the time series are really different from each other in different dimensions.

Best regards,
Johann

@Sandy4321
Copy link

it would great to add Multivariate time series like
https://github.com/patrickzib/SFA
WEASEL+MUSE

@johannfaouzi
Copy link
Owner

Tools for multivariate time series are provided in the pyts.multivariate module.
WEASEL+MUSE is implemented as pyts.multivariate.transformation.WEASELMUSE.

The literature for multivariate time series classification is quite shallow (probably due to the lack of datasets for a very long time). Nonetheless, if you consider each feature of a multivariate time series independently, you can use the utility classes pyts.multivariate.transformation.MultivariateTransformer and pyts.multivariate.classification.MultivariateClassifier to apply a univariate time series algorithm to each feature a multivariate time series dataset independently.

Hope this helps you a little.

@Sandy4321
Copy link

Really great news
You are the first to implement multivariate time series classification in python
Only one important question
Does your code support the mixture of categorical and continue s features?

@johannfaouzi
Copy link
Owner

Do you mean time series with categorical values? I don't think that I have ever seen any algorithm in the time series classification literature that can deal with that. Maybe Markov chains would be more suited for such features.

I think a few other Python packages like tslearn and sktime can also deal with multivariate time series.

@Sandy4321
Copy link

they do not have
for example
sktime/sktime#235

when data is mixer of continues and categorical variables for each time sample?
for example data samples are
time t1: red , 0.4 , big , low, 234
time t2: green, 0.8, big, high, 12
time t3: green, 0.1, small, low, 34
etc

for example
https://github.com/alan-turing-institute/sktime/blob/master/examples/03_classification_multivariate.ipynb
have simulated data for only continues features

they
tslearn-team/tslearn#172
do have idea
@Sandy4321 it's kind of a late reply, but is it possible to do some kind of initial preprocessing of your categorical variables? e.g. one hot encoding & apply the standard methods should be okay.

You can also apply one of the kernel methods & choose an appropriate kernel which can handle the categorical features...I think ARD kernel is one example, but I forget the details. You can see what the popular bayesian hyperparameter opt. packages do in this case

"ARD kernel is one example, but I forget the details."
do you have idea what they mean?

https://www.cs.toronto.edu/~duvenaud/cookbook/
Discrete Data
Kernels can be defined over all types of data structures: Text, images, matrices, and even kernels . Coming up with a kernel on a new type of data used to be an easy way to get a NIPS paper.
How to use categorical variables in a Gaussian Process regression
There is a simple way to do GP regression over categorical variables. Simply represent your categorical variable as a by a one-of-k encoding. This means that if your number ranges from 1 to 5, represent that as 5 different data dimensions, only one of which is on at a time.

Then, simply put a product of SE kernels on those dimensions. This is the same as putting one SE ARD kernel on all of them. The lengthscale hyperparameter will now encode whether, when that coding is active, the rest of the function changes. If you notice that the estimated lengthscales for your categorical variables is short, your model is saying that it's not sharing any information between data of different categories.

there is even code
https://github.com/Lkxz/categorical-kernels
and thesis
https://upcommons.upc.edu/bitstream/handle/2099.1/24508/99930.pdf?sequence=1
or
or
https://www.researchgate.net/post/What_kernel_functions_can_be_applied_to_categorical_features
However, if you would like to use kernel function for categorical data, I think this package [1] might be helpful. In particular, for categorical data, you could use Aitchison-Aitken kernel [2].
[1] http://socserv.mcmaster.ca/racine/Rjournal.pdf
[2] http://biomet.oxfordjournals.org/content/63/3/413.abstract
or
https://academic.oup.com/biomet/article-abstract/63/3/413/270829

@johannfaouzi
Copy link
Owner

I'm a bit annoyed by the lack of the literature on this topic, but as the time there is no real way to deal with categorical time series in pyts at this stage.

I will consider adding a kernel module in a future release. It would contain popular kernels for continuous time series such as GAK, and it would be the opportunity to add kernels for categorical time series.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants