Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ambiguity between n_samples and n_timestamps #74

Open
NervousEnergy1979 opened this issue Jun 15, 2020 · 5 comments
Open

Ambiguity between n_samples and n_timestamps #74

NervousEnergy1979 opened this issue Jun 15, 2020 · 5 comments

Comments

@NervousEnergy1979
Copy link

Hi, thanks for this great library.

I am trying to create recurrence plots for a time series of geometric brownian motion, but however I try to set the parameters in RecurrencePlot, I keep getting errors.

import numpy as np
from pyts.image import RecurrencePlot
from sklearn.preprocessing import MinMaxScaler

# create a univariate (one feature) time series
x = np.random.normal(0, 0.01, 10000).cumsum()
x = MinMaxScaler().fit_transform(x.reshape(-1, 1))
print(x.shape)
>>> (10000, 1)

rp = RecurrencePlot(dimension=20, threshold=1.)
rp.fit_transform(x)
>>> ValueError: If 'dimension' is an integer, it must be greater than or equal to 1 and lower than or equal to n_timestamps (got 20).

I have dimension as an integer, it is set to 20, so it is greater than or equal to 1. What I don't understand is how 20 isn't lower than n_timestamps, because I do not fully understand what n_timestamps is.

As far as I understand my data x, it is shaped (10000, 1) which is (n_samples, n_features) i.e. each row is a unique 'sample' ordered chronologically, and it only has one column (one 'feature') as the time series is univariate. In addition, the documentation for RecurrencePlot.fit_transform states that the input X must have shape [n_samples, n_features], which as far as I understand, my data does have that shape.

What am I doing wrong here? What is the difference between n_samples and n_timestamps? Thanks in advance

@johannfaouzi
Copy link
Owner

Hi,

Thank you for your interest in pyts. The usual input is a set of time series, represented as a 2D-array with shape (n_samples, n_timestamps), that is:

  • the first axis corresponds to the time series;
  • the second axis corresponds to the time.

If I understood correctly, you have a single time series of geometric brownian motion with 1000 time points, so the expected shape is (1, 1000). Would you have 10 time series of geometric brownian motion with 1000 time points each, the expected shape would be (10, 1000). The transformation would be applied to each time series independently. Having several samples (time series) is the most common case, that's why the expected input is always a set of time series.

Regarding the documentation of the fit_transform method, we use the sklearn.base.TransformerMixin class, which automatically creates a fit_transform method using fit and transform. The downside is that we also get the documentation from scikit-learn, in which the standard input is (n_samples, n_features), that is a dataset of n_samples samples with n_features features. We should probably rewrite (or change the documentation) so that it is less misleading for users.

I hope this answers you question!

@NervousEnergy1979
Copy link
Author

Thanks for the quick reply! Ok, I think I understand now. So the library is equipped to handle multivariate time series of shape (number of time series, number of time stamps), in accordance with your example for 10 GBM time series each of length 10,000. Then the output of ReccurancePlot on such a dataset would have 10 'plots' (10 matrices - one for each time series) and a new number of features for the second axis.

@johannfaouzi
Copy link
Owner

Yes and no ^^

A dataset of multivariate time series is represented as 3D-array with shape (n_samples, n_features, n_timestamps), where:

  • the first axis corresponds to the time series
  • the second axis corresponds to the features of a multivariate time series (the 2 coordinates of a GPS position from a sensor for instance);
  • the third axis corresponds to the time.

The library is mainly focused on time series classification. To train an algorithm, you usually need several samples (time series) from each class, which is why we always consider a set of time series and not a single time series. The corresponding axis is always the first axis.

We consider that a multivariate time series is different from several univariate time series because a multivariate has a single class, while several univariate time series have a class for each time series. But in the case of recurrence plots, we don't use the classes to perform the transformation.

For multivariate time series, you can have a look at Cross and Joint Recurrence Plots. pyts provides only an implementation of joint recurrence plots: pyts.multivariate.image.JointRecurrencePlot

@jc4000
Copy link

jc4000 commented Jan 18, 2023

I have a dataset with multivariate time series data, with available knowledge that values of some of the features if greater than a threshold then it is strongly associated with a specific class. I am trying to incorporate that information in the threshold and percentage arguments of JointRecurrencePlot(). But not clearly understanding what does the Distance and dimension really mean?

@johannfaouzi
Copy link
Owner

There is a small description (for the univariate case, but the idea is similar in the multivariate case) of a recurrence plot in the user guide (https://pyts.readthedocs.io/en/stable/modules/image.html#recurrence-plot).

The idea of a recurrence plot is to compare trajectories in a time series. A trajectory is defined by:

  • its dimension (i.e., its size, the number of time points),
  • the time delay (i.e., the time gap between two back-to-back points in a trajectory).

If you only want to compare single time points, you just have to set the dimension to 1 (which is the default value).

Regarding the threshold used to binarize the image, you can set the value (by providing a float) or you can automatically compute the threshold given a strategy. For instance:

  • If you want to have 20% of black points in the recurrence plot, you would set threshold='point' and percentage=20.
  • You can define the threshold as a percentage of the maximum distance (among all the pairwise distances between the trajectories). For instance, if the maximum pairwise distance is 80 and you set percentage=25, then the threshold is 20 (25% of 80).

Finally, a joint recurrence plot is simply the Hadamard (element-wise) product between all the recurrence plots (one recurrence plot for each feature). You can set different values for the threshold and percentage parameters for the different features (by providing lists). This is described in the documentation (https://pyts.readthedocs.io/en/stable/generated/pyts.multivariate.image.JointRecurrencePlot.html#pyts.multivariate.image.JointRecurrencePlot).

Let me know if this is clearer now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants