Skip to content

Mingyue-Cheng/InstructTime

Repository files navigation

InstructTime

Advancing Time Series Classification with Multimodal Language Modeling

ACM WSDM 2025

Paper Dataset Model


🙋 Please let us know if you find out a mistake or have any suggestions!

🌟 If you find this resource helpful, please consider starring this repository and citing our research.

Citation

@inproceedings{cheng2025instructime,
  title={Instructime: Advancing time series classification with multimodal language modeling},
  author={Cheng, Mingyue and Chen, Yiheng and Liu, Qi and Liu, Zhiding and Luo, Yucong and Chen, Enhong},
  booktitle={Proceedings of the Eighteenth ACM International Conference on Web Search and Data Mining},
  pages={792--800},
  year={2025}
}

Overview

InstructTime Architecture

InstructTime is a multimodal language model for time series classification that bridges the gap between time series data and natural language understanding.

Resources

Resource Link
🤗 Dataset zhjai/InstructTime
🤗 Base Model openai-community/gpt2
📄 Paper ACM Digital Library

Dataset Name Mapping

The following table shows the mapping between dataset names used in the code and their corresponding domains:

Code Name Domain Description
sleep EEG Electroencephalogram (Sleep Stage)
geo / ecg ECG Electrocardiogram
dev FD Fault Detection (Industrial Equipment)
har HAR Human Activity Recognition
whale RWC Real World Computing (Whale Sound)

Installation

Requirements

  • Python 3.9+
  • PyTorch 2.1+
  • CUDA (recommended)

Setup

# Clone the repository
git clone https://github.com/your-repo/InstructTime.git
cd InstructTime

# Install dependencies
pip install -r requirements.txt

# Download GPT-2 model from HuggingFace (required)
# Option 1: Using huggingface-cli
huggingface-cli download openai-community/gpt2 --local-dir ./gpt2

# Option 2: Using git lfs
git lfs install
git clone https://huggingface.co/openai-community/gpt2 ./gpt2

Usage

Step 1: Train TStokenizer (Time Series Tokenizer)

First, train the VQ-VAE based time series tokenizer for each domain.

Parameters for each dataset (format: d_model, n_embed, wave_length):

Dataset d_model n_embed wave_length
ECG (geo) 64 128 40
EEG (sleep) 64 256 25
FD (dev) 64 512 40
HAR 64 256 1
RWC (whale) 64 384 32
cd TStokenizer

# Train tokenizer for HAR dataset
python main.py \
    --save_path ../vqvae/HAR \
    --dataset har \
    --data_path ../datasets/HAR \
    --device cuda:0 \
    --d_model 64 \
    --n_embed 256 \
    --wave_length 1

# Train tokenizer for EEG (sleep) dataset
python main.py \
    --save_path ../vqvae/EEG \
    --dataset sleep \
    --data_path ../datasets/EEG \
    --device cuda:0 \
    --d_model 64 \
    --n_embed 256 \
    --wave_length 25

# Train tokenizer for ECG (geo) dataset
python main.py \
    --save_path ../vqvae/ECG \
    --dataset geo \
    --data_path ../datasets/ECG \
    --device cuda:0 \
    --d_model 64 \
    --n_embed 128 \
    --wave_length 40

# Train tokenizer for FD (dev) dataset
python main.py \
    --save_path ../vqvae/FD \
    --dataset dev \
    --data_path ../datasets/FD \
    --device cuda:0 \
    --d_model 64 \
    --n_embed 512 \
    --wave_length 40

# Train tokenizer for RWC (whale) dataset
python main.py \
    --save_path ../vqvae/RWC \
    --dataset whale \
    --data_path ../datasets/RWC \
    --device cuda:0 \
    --d_model 64 \
    --n_embed 384 \
    --wave_length 32

Step 2: Train InstructTime

Universal Training (Train from Scratch)

cd ..  # Back to project root

# Single domain training (e.g., HAR)
python run_truth_loss.py \
    --dataset har \
    --model_path ./gptmodel \
    --data_root ./datasets \
    --vqvae_root ./vqvae \
    --device cuda:0 \
    --epochs 15 \
    --batch_size 16 \
    --lr 5e-5

# Multi-domain training
python run_truth_loss.py \
    --dataset mix \
    --model_path ./gptmodel \
    --data_root ./datasets \
    --vqvae_root ./vqvae \
    --device cuda:0 \
    --epochs 15 \
    --batch_size 16 \
    --lr 5e-5

Adaptation Training (Fine-tune from Pretrained)

python run_truth_loss.py \
    --dataset har \
    --model_path ./gptmodel/har \
    --load_model_path ./gptmodel/no_frozen/run_0/best_model \
    --data_root ./datasets \
    --vqvae_root ./vqvae \
    --device cuda:0 \
    --lr 1e-5 \
    --adapt

Prompt Example

You will be receiving electroencephalogram(EEG) related signals.
Electroencephalogram signals: <BET><TS Tokens><EET>
The sleep patterns include waking up, rapid eye movement sleep, and sleep stages one through four, as well as periods of movement and unidentified stages.
Select one of the eight previously mentioned sleep patterns and report on the person's sleep using the provided information.
The person's sleep pattern is waking up

Project Structure

InstructTime/
├── TStokenizer/          # Time Series Tokenizer (VQ-VAE)
│   ├── main.py           # Tokenizer training script
│   ├── model.py          # VQ-VAE model
│   └── ...
├── datasets/             # Dataset directory
├── vqvae/                # Trained tokenizer checkpoints
├── gpt2/                 # GPT-2 base model
├── run_truth_loss.py     # Main training script
├── multidataset.py       # Dataset processing
├── multimodel.py         # Model definition
├── args.py               # Argument parser
├── metrics.py            # Evaluation metrics
└── requirements.txt      # Dependencies

License

This project is for research purposes only.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages