PyTorch data sets for supervised time series classification and prediction problems, including:
- All UEA/UCR classification repository data sets
- PhysioNet Challenge 2012 (in-hospital mortality)
- PhysioNet Challenge 2019 (sepsis prediction)
- A binary prediction variant of the 2019 PhysioNet Challenge
- Saves time. You don't have to write your own PyTorch data classes.
- Better research. Use common, reproducible implementations of data sets for a level playing field when evaluating models.
Install PyTorch followed by torchtime
:
$ pip install torchtime
or
$ conda install torchtime -c conda-forge
There is currently no Windows build for conda
. Feedback is welcome from conda
users in particular.
Data classes have a common API. The split
argument determines whether training ("train"), validation ("val") or test ("test") data are returned. The size of the splits are controlled with the train_prop
and (optional) val_prop
arguments.
Three PhysioNet data sets are currently supported:
torchtime.data.PhysioNet2012
returns the 2012 challenge (in-hospital mortality) [link].torchtime.data.PhysioNet2019
returns the 2019 challenge (sepsis prediction) [link].torchtime.data.PhysioNet2019Binary
returns a binary prediction variant of the 2019 challenge.
For example, to load training data for the 2012 challenge with a 70/30% training/validation split and create a DataLoader for model training:
from torch.utils.data import DataLoader
from torchtime.data import PhysioNet2012
physionet2012 = PhysioNet2012(
split="train",
train_prop=0.7,
)
dataloader = DataLoader(physionet2012, batch_size=32)
The torchtime.data.UEA
class returns the UEA/UCR repository data set specified by the dataset
argument, for example:
from torch.utils.data import DataLoader
from torchtime.data import UEA
arrowhead = UEA(
dataset="ArrowHead",
split="train",
train_prop=0.7,
)
dataloader = DataLoader(arrowhead, batch_size=32)
Batches are dictionaries of tensors X
, y
and length
:
X
are the time series data. The package follows the batch first convention thereforeX
has shape (n, s, c) where n is batch size, s is (longest) trajectory length and c is the number of channels. By default, the first channel is a time stamp.y
are one-hot encoded labels of shape (n, l) where l is the number of classes.length
are the length of each trajectory (before padding if sequences are of irregular length) i.e. a tensor of shape (n).
For example, ArrowHead is a univariate time series therefore X
has two channels, the time stamp followed by the time series (c = 2). Each series has 251 observations (s = 251) and there are three classes (l = 3). For a batch size of 32:
next_batch = next(iter(dataloader))
next_batch["X"].shape # torch.Size([32, 251, 2])
next_batch["y"].shape # torch.Size([32, 3])
next_batch["length"].shape # torch.Size([32])
See Using DataLoaders for more information.
- Missing data can be imputed by setting
impute
to mean (replace with training data channel means) or forward (replace with previous observation). Alternatively a custom imputation function can be passed to theimpute
argument. - A time stamp (added by default), missing data mask and the time since previous observation can be appended with the boolean arguments
time
,mask
anddelta
respectively. - Time series data are standardised using the
standardise
boolean argument. - The location of cached data can be changed with the
path
argument, for example to share a single cache location across projects. - For reproducibility, an optional random
seed
can be specified. - Missing data can be simulated using the
missing
argument to drop data at random from UEA/UCR data sets.
See the tutorials and API for more information.
If you're looking for the TensorFlow equivalent for PhysioNet data sets try medical_ts_datasets.
torchtime
uses some of the data processing ideas in Kidger et al, 2020 [1] and Che et al, 2018 [2].
This work is supported by the Engineering and Physical Sciences Research Council, Centre for Doctoral Training in Cloud Computing for Big Data, Newcastle University (grant number EP/L015358/1).
If you use this software, please cite the paper:
@software{darke_torchtime_2022,
author = Darke, Philip and Missier, Paolo and Bacardit, Jaume,
title = "Benchmark time series data sets for {PyTorch} - the torchtime package",
month = July,
year = 2022,
publisher={arXiv},
doi = 10.48550/arXiv.2207.12503,
url = https://doi.org/10.48550/arXiv.2207.12503,
}
DOIs are also available for each version of the package here.
-
Kidger, P, Morrill, J, Foster, J, et al. Neural Controlled Differential Equations for Irregular Time Series. arXiv 2005.08926 (2020). [arXiv]
-
Che, Z, Purushotham, S, Cho, K, et al. Recurrent Neural Networks for Multivariate Time Series with Missing Values. Sci Rep 8, 6085 (2018). [doi]
-
Silva, I, Moody, G, Scott, DJ, et al. Predicting In-Hospital Mortality of ICU Patients: The PhysioNet/Computing in Cardiology Challenge 2012. Comput Cardiol 2012;39:245-248 (2010). [hdl]
-
Reyna, M, Josef, C, Jeter, R, et al. Early Prediction of Sepsis From Clinical Data: The PhysioNet/Computing in Cardiology Challenge. Critical Care Medicine 48 2: 210-217 (2019). [doi]
-
Reyna, M, Josef, C, Jeter, R, et al. Early Prediction of Sepsis from Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019 (version 1.0.0). PhysioNet (2019). [doi]
-
Goldberger, A, Amaral, L, Glass, L, et al. PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation 101 (23), pp. e215–e220 (2000). [doi]
-
Löning, M, Bagnall, A, Ganesh, S, et al. sktime: A Unified Interface for Machine Learning with Time Series. Workshop on Systems for ML at NeurIPS 2019 (2019). [doi]
-
Löning, M, Bagnall, A, Middlehurst, M, et al. alan-turing-institute/sktime: v0.10.1 (v0.10.1). Zenodo (2022). [doi]
Released under the MIT license.