Skip to content

Functional-Data-Clustering/Functional-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Functional Data for Cluster Analysis

Lists of open-access functional datasets from different fields of application. We only collect data that can be used for cluster analysis. The main objective is to facilitate comparing with existing clustering methods (for functional data) and evaluating new clustering methods. A recent comprehensive review of clustering methods for functional data is available here. Our team is actively developing functional data clustering methods tailored to various data types and application domains. The software tools we have developed can be accessed here.

For datasets that need further processing on the linked data, a copy of them can be found in the Data folder. (This ongoing project is a bit slow, due to other commitments of the contributor.)

One-dimensional Functional Data

Name Available at Field Task Size Length Missing Value
ARC_Mobile Publisher Health Clustering 125 30/40 Yes
ArrowHead UEA & UCR Time Series Classification Repository Computer Vision Classification 211 251 No
BirdChicken UEA & UCR Time Series Classification Repository Computer Vision Classification 40 512 No
BTH_PM25 Publisher Environment Clustering 73 48 Yes
China_PM25 Publisher Environment Clustering 338 731 Yes
DiatomSizeReduction UEA & UCR Time Series Classification Repository Bioinformatics Classification 322 345 No
ECG200 UEA & UCR Time Series Classification Repository ECG Classification 200 96 No
FaceFour UEA & UCR Time Series Classification Repository Computer Vision Classification 112 350 No
Flour R (cfda) Food Classification 115 241 No
Fungi UEA & UCR ... Bioinformatics Classification 204 201 No
GunPoint UEA & UCR Time Series Classification Repository Motion Classification 200 150 No
Meat UEA & UCR Time Series Classification Repository Food Classification 120 448 No
Plane UEA & UCR ... Shape Classification 210 144 No
Phoneme e-Book (ElemStatLearn) Speech Classification 4K+ 256 No
Strawberry UEA & UCR Time Series Classification Repository Food Classification 983 235 No
Symbols UEA & UCR Time Series Classification Repository Computer Vision Classification 1K+ 398 No
Tecator CMU StatLib Food Classification 240 100 No

Multi-dimensional Functional Data

Name Available at Field Task Size Length Dimension
BasicMotions UEA & UCR Time Series Classification Repository Motion Classification 80 100 6
Blink UEA & UCR ... EEG Classification 950 510 4
ECG_Arrhythmia Publisher ECG Classification 10K+ 5000 12
EEG_Full UCI Machine Learning Repository EEG Classification 122 256 64
Epilepsy UEA & UCR ... Motion Classification 275 207 3
ERing UEA & UCR ... Gesture Classification 300 65 4
EyesOpenShut UEA & UCR ... EEG Classification 98 128 14
FingerMovements UEA & UCR ... EEG Classification 416 50 28
Japanese_Vowels UCI Machine Learning Repository Speech Classification 640 29 12
UWaveGestureLibrary UEA & UCR ... Gesture Classification 4K+ 315 3

Manifold-valued Functional Data

We provided a Python generator for manifold-valued functional data. It can simulate five families of trajectories:

  • Hypersphere (unit sphere trajectories)
  • Hyperbolic (Poincaré ball model)
  • Swiss roll (Swiss-roll curves, up to 3D)
  • Lorenz (chaotic attractor, up to 3D)
  • Pendulum (simple pendulum dynamics, up to 3D)

Each dataset is a collection of multi-dimensional functions that evolve along a specified manifold or dynamical system. The generator script lives in the Data/Manifold/ directory as manifold_valued_data_generator.py. You can import it or run it directly. This generator was used in our NeurIPS work to evaluate FAEclust.

Outputs & Shapes

  • X.shape = (n_samples, n_features, n_steps): multivariate time series laid out as [sample, feature, time].
  • y.shape = (n_samples,): integer labels (0 … n_clusters-1) for cluster/dynamics identity.

Key Parameters

  • n_samples: number of time series (functions) to generate.
  • n_features: dimensionality per time step (e.g., 2D, 3D coordinates).
  • n_steps: length (number of time points) in each trajectory.
  • n_clusters: number of distinct clusters/dynamics per dataset.
  • base_noise (optional): small perturbations; useful for realism.
  • seed (optional): random seed for reproducibility.

To change the size of a dataset (e.g., more functions), edit the corresponding tuple in specs - no other code changes needed.