PULSAR: Pre-training with Extracted Gap Healthcare Terms for Summarizing Patients’ Problems and Data Augmentation with Black-box Large Language Models (BioNLP@ACL2023)
Code repository for PULASR, including
- Pre-training T5 with the generation of the gap healthcare text spans as the pre-training objective
- Fine-tuning the pre-trained model with data augmentation on the downstream summarization tasks, i.e., BioNLP Workshop 2023 Shared Task 1A: Problem List Summarization and ImageCLEF MEDIQA-Sum-2023 Subtask B and C
- PULSAR: Pre-training with Extracted Gap Healthcare Terms for Summarizing Patients’ Problems and Data Augmentation with Black-box Large Language Models
- Large Language Models Augmented by Synthetic Dialogue Convert Patient Dialogues to Medical Records
See pre-train/README.md
for details.
See data-augmentation/README.md
for details.
See aug-openai/README.md
for details.
See fine-tune/README.md
for details.
See stanford_alpaca/README.md
for details, e.g., requirements.
# train
CUDA_LAUNCH_BLOCKING=1 python stanford_alpaca/train.py --model_name_or_path PATH/TO/llama/13B_convert --data_path PATH/TO/CLEF_TaskB/trainset.jsonl --bf16 True --output_dir output_llama --num_train_epochs 3 --per_device_train_batch_size 8 --per_device_eval_batch_size 4 --gradient_accumulation_steps 1 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 3e-4 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --tf32 True
# inference
CUDA_LAUNCH_BLOCKING=1 python fine-tune/inference_llama.py --base_model PATH/TO/llama/13B_convert --adapter_path output_llama --test_dataset PATH/TO/CLEF_TaskB/taskB_testset4participants_inputHeadersAndConversations.csv --is_causal --load_in_8bit --max_new_tokens 240
If you make use of the code in this repository or of any follow up works, please cite the following paper(s):
@article{li2023pulsar,
title={PULSAR: Pre-training with Extracted Healthcare Terms for Summarising Patients' Problems and Data Augmentation with Black-box Large Language Models},
author={Li, Hao and Wu, Yuping and Schlegel, Viktor and Batista-Navarro, Riza and Nguyen, Thanh-Tung and Kashyap, Abhinav Ramesh and Zeng, Xiaojun and Beck, Daniel and Winkler, Stefan and Nenadic, Goran},
journal={arXiv preprint arXiv:2306.02754},
year={2023}
}
@article{schlegel2023pulsar,
title={PULSAR at MEDIQA-Sum 2023: Large Language Models Augmented by Synthetic Dialogue Convert Patient Dialogues to Medical Records},
author={Schlegel, Viktor and Li, Hao and Wu, Yuping and Subramanian, Anand and Nguyen, Thanh-Tung and Kashyap, Abhinav Ramesh and Beck, Daniel and Zeng, Xiaojun and Batista-Navarro, Riza Theresa and Winkler, Stefan and others},
journal={arXiv preprint arXiv:2307.02006},
year={2023}
}