This is the repository for the paper "LLaSA: Large Language and Structured Data Assistant".
In this documentation, we detail how to construct pretraining datasets and train LLaSA model.
Requirements:
- Python 3.10
- Linux
- support for CUDA 12.4
pip install -r requirements.txt
If you encounter any issues during installing torch-geometric
, please refer to torch-geometric for manual installation.
You can also download out pretraining ckpt and skip the pretraining process.
Due to the accidental deletion of the weight file, we will re-release the weights as soon as possible after retraining and validation.
# download pretraining data
git clone https://github.com/YaooXu/TaBERT.git
cd TaBERT
python -m spacy download en_core_web_sm
bash get_pretrain_data.sh
python preprocess/construct_pretrain_data.py
bash pretrain_gformer.sh
# download and process data
python preprocess/construct_sft_data.py
# convert all data to hypergraph
python preprocess/convert_table_to_graph_hytrel.py
bash ./train_llasa.sh
bash ./predict.sh