Model for solving the problem of Column Type Annotation with BERT, trained on RWT-RuTaBERT dataset.
RWT-RuTaBERT dataset contains 1 441 349
columns from Russian language Wikipedia tables. With headers matching 170 DBpedia semantic types. It has fixed train / test split:
Split | Columns | Tables | Avg. columns per table |
---|---|---|---|
Test | 115 448 | 55 080 | 2.096 |
Train | 1 325 901 | 633 426 | 2.093 |
We trained RuTaBERT with two table serialization strategies:
- Neighboring column serialization;
- Multi-column serialization (based on Doduo's approach);
Benchmark results on RWT-RuTaBERT dataset:
Serialization strategy | micro-F1 | macro-F1 | weighted-F1 |
---|---|---|---|
Multi-column | 0.962 | 0.891 | 0.9621 |
Neighboring column | 0.964 | 0.904 | 0.9639 |
Training parameters:
Parameter | Value |
---|---|
batch size | 32 |
epochs | 30 |
Loss function | Cross-entropy |
GD Optimizer | AdamW(lr=5e-5, eps=1e-8) |
GPU's | 4 NVIDIA A100 (80 GB) |
random seed | 2024 |
validation split | 5% |
📦RuTaBERT
┣ 📂checkpoints
┃ ┗ Saved PyTorch models `.pt`
┣ 📂data
┃ ┣ 📂inference
┃ ┃ ┗ Tabels to inference `.csv`
┃ ┣ 📂test
┃ ┃ ┗ Test dataset files `.csv`
┃ ┣ 📂train
┃ ┃ ┗ Train dataset files `.csv`
┃ ┗ Directory for storing dataset files.
┣ 📂dataset
┃ ┗ Dataset wrapper classes, dataloaders
┣ 📂logs
┃ ┗ Log files (train / test / error)
┣ 📂model
┃ ┗ Model and metrics
┣ 📂trainer
┃ ┗ Trainer
┣ 📂utils
┃ ┗ Helper functions
┗ Entry points (train.py, test.py, inference.py), configuration, etc.
The model configuration can be found in the file config.json
.
The configuratoin argument parameters are listed below:
argument | description |
---|---|
num_labels | Number of labels used for classification |
num_gpu | Number of GPUs to use |
save_period_in_epochs | Number characterizing with what periodicity the checkpoint is saved (in epochs) |
metrics | The classification metrics used are |
pretrained_model_name | BERT shortcut name from HuggingFace |
table_serialization_type | Method of serializing a table into a sequence |
batch_size | Batch size |
num_epochs | Number of training epochs |
random_seed | Random seed |
logs_dir | Directory for logging |
train_log_filename | File name for train logging |
test_log_filename | File name for test logging |
start_from_checkpoint | Flag to start training from checkpoint |
checkpoint_dir | Directory for storing checkpoints of model |
checkpoint_name | File name of a checkpoint (model state) |
inference_model_name | File name of a model for inference |
inference_dir | Directory for storing inference tables .csv |
dataloader.valid_split | Amount of validation subset split |
dataloader.num_workers | Number of dataloader workers |
dataset.num_rows | Number of readable rows in the dataset, if null read all rows in files |
dataset.data_dir | Directory for storing train/test/inference files |
dataset.train_path | Directory for storing train dataset files .csv |
dataset.test_path | Direcotry for storing test dataset files .csv |
We recomend to change ONLY theese parameters:
num_gpu
- Any positive ingeter number + {0}.0
stand for training / testing on CPU.save_period_in_epochs
- Any positive integer number, measures in epochs.table_serialization_type
- "column_wise" or "table_wise".pretrained_model_name
- BERT shorcut names from Huggingface PyTorch pretrained models.batch_size
- Any positive integer number.num_epochs
- Any positive integer number.random_seed
- Any integer number.start_from_checkpoint
- "true" or "false".checkpoint_name
- Any name of model, saved incheckpoint
directory.inference_model_name
- Any name of model, saved incheckpoint
directory. But we recommend to use the best models: [model_best_f1_weighted.pt, model_best_f1_macro.pt, model_best_f1_micro.pt].dataloader.valid_split
- Real number within range [0.0, 1.0] (0.0 stands for 0 % of train subset, 0.5 stands for 50 % of train subset). Or positive integer number (Denoting a fixed number of a validation subset).dataset.num_rows
- "null" stands for read all lines in dataset files. Positive integer means the number of lines to read in the files of the dataset.
Before training / testing the model you need to:
- Download dataset repository in the same directory as RuTaBERT, example source directory strucutre:
├── src
│ ├── RuTaBERT
│ ├── RuTaBERT-Dataset
│ │ ├── move_dataset.sh
- Run script
move_dataset.sh
from dataset repository, to move dataset files into RuTaBERTdata
directory:
RuTaBERT-Dataset$ ./move_dataset.sh
- configure
config.json
file before training.
RuTaBERT supports training / testing locally and inside Docker container. Also supports slurm workload manager.
- Create virtual environment:
RuTaBERT$ virtualenv venv
or
RuTaBERT$ python -m virtualenv venv
- Install requirements and start train and test.
RuTaBERT$ source venv/bin/activate &&\
pip install -r requirements.txt &&\
python3 train.py 2> logs/error_train.log &&\
python3 test.py 2> logs/error_test.log
- Models will be saved in
checkpoint
directory. - Output will be in
logs/
directory (training_results.csv
,train.log
,test.log
,error_train.log
,error_test.log
).
Requirements:
- Docker installation guide (ubuntu);
- NVIDIA driver;
- NVIDIA Container Toolkit installation guide (ubuntu);
- Make sure all dependencies are installed.
- Build image:
RuTaBERT$ sudo docker build -t rutabert .
- Run image
RuTaBERT$ sudo docker run -d --runtime=nvidia --gpus=all \
--mount source=rutabert_logs,target=/app/rutabert/logs \
--mount source=rutabert_checkpoints,target=/app/rutabert/checkpoints \
rutabert
- Move models and logs from container after training / testing.
RuTaBERT$ sudo cp -r /var/lib/docker/volumes/rutabert_checkpoints/_data ./checkpoints
RuTaBERT$ sudo cp -r /var/lib/docker/volumes/rutabert_logs/_data ./logs
- Don't forget to remove volumes after training! Docker wont do it for you.
- Models will be saved in
checkpoint
directory. - Output will be in
logs/
directory (training_results.csv
,train.log
,test.log
,error_train.log
,error_test.log
).
- Create virtual environment:
RuTaBERT$ virtualenv venv
or
RuTaBERT$ python -m virtualenv venv
- Run slurm script:
RuTaBERT$ sbatch run.slurm
- Check job status:
RuTaBERT$ squeue
- Models will be saved in
checkpoint
directory. - Output will be in
logs/
directory (train.log
,test.log
,error_train.log
,error_test.log
).
- Make sure data placed in
data/test
directory. - (Optional) Download pre-trained models:
RuTaBERT$ ./download.sh table_wise
or
RuTaBERT$ ./download.sh column_wise
- Configure which model to test in
config.json
. - Run:
RuTaBERT$ source venv/bin/activate &&\
pip install -r requirements.txt &&\
python3 test.py 2> logs/error_test.log
- Output will be in
logs/
directory (test.log
,error_test.log
).
- Make sure data placed in
data/inference
directory. - (Optional) Download pre-trained models:
RuTaBERT$ ./download.sh table_wise
or
RuTaBERT$ ./download.sh column_wise
- Configure which model to inference in
config.json
- Run:
RuTaBERT$ source venv/bin/activate &&\
pip install -r requirements.txt &&\
python3 inference.py
- Labels will be in
data/inference/result.csv