SpiroLLM: Finetuning Pretrained LLMs to Understand Spirogram Time Series with Clinical Validation in COPD Reporting
Shuhao Mei1,2,7, Yongchao Long2, Shan Cao3, Xiaobo Han4, Shijia Geng5, Jinbo Sun1,*, Yuxi Zhou2,6,*, Shenda Hong7,*
1Xidian University 2Tianjin University of Technology 3The Second Hospital of Tianjin Medical University 4Chinese PLA General Hospital 5HeartVoice Medical Technology 6Tsinghua University 7Peking University
*Corresponding Author
SpiroLLM is the first multimodal large language model specifically designed to interpret spirogram time-series data, providing diagnostic support for Chronic Obstructive Pulmonary Disease (COPD). By integrating raw spirometry signals with demographic information, SpiroLLM generates comprehensive and clinically relevant diagnostic reports.
If you find SpiroLLM useful for your work, please consider citing our work.
@misc{mei2025spirollmfinetuningpretrainedllms,
title={SpiroLLM: Finetuning Pretrained LLMs to Understand Spirogram Time Series with Clinical Validation in COPD Reporting},
author={Shuhao Mei and Yongchao Long and Shan Cao and Xiaobo Han and Shijia Geng and Jinbo Sun and Yuxi Zhou and Shenda Hong},
year={2025},
eprint={2507.16145},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2507.16145},
}
First, create and activate a Conda virtual environment, then install the required dependencies.
# Create and activate the environment
conda create -n SpiroLLM python=3.11 -y
conda activate SpiroLLM
# Install all dependencies
pip install -r requirements.txtRun the provided script to automatically download the example spirometry data from the UK Biobank website. The data will be saved to the data/ directory.
python generate_ukbb_demo_data.pyOnce the environment is set up and the data is downloaded, run the main inference script with the patient's information.
python main.py \
--csv_path ./data/example.csv \
--age 69 \
--sex Male \
--height_cm 176.0 \
--is_smokerThe generated report will be printed to the console and saved to the output file specified in your config.yaml.
- Python: 3.11
- PyTorch: >= 2.0
- GPU: A CUDA-enabled GPU with at least 16 GB of VRAM is required for the model to run properly.
The main.py script is the primary entry point for running inference. It requires the following command-line arguments:
| Argument | Type | Description | Required |
|---|---|---|---|
--csv_path |
str |
Path to the patient's raw spirometry data file. | Yes |
--age |
int |
The age of the patient in years. | Yes |
--sex |
str |
The sex of the patient (Male or Female). |
Yes |
--height_cm |
float |
The height of the patient in centimeters. | Yes |
--is_smoker |
flag |
Include this flag if the patient is a smoker. | No |
--ethnicity |
str |
Patient's ethnicity. Defaults to Caucasian. |
No |
--config |
str |
Path to the configuration YAML file. | No |
The data used in this project is sourced from the UK Biobank, a large-scale biomedical database and research resource. Access to the data is available to approved researchers upon application. For more information, please visit the UK Biobank website.
The DeepSpiro feature extractor, a key component of this project, is based on our prior work published in npj systems biology and applications:
Mei S, Li X, Zhou Y, et al. Deep learning for detecting and early predicting chronic obstructive pulmonary disease from spirogram time series[J]. npj Systems Biology and Applications, 2025, 11(1): 18.
The original implementation is available at the COPD-Early-Prediction GitHub repository.
This project is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/ or see the LICENSE file.
