Name		Name	Last commit message	Last commit date
parent directory ..
processed		processed
pytorch		pytorch
README.md		README.md
convert_dpr_original_checkpoint_to_pytorch.py		convert_dpr_original_checkpoint_to_pytorch.py
qa_system.py		qa_system.py

README.md

Convert checkpoints to pytorch

Changing the script for conversion

After fine-tuning your own model, you would need to convert the created checkpoint to a format that is supported by pytorch. There is a script available within transformers library¹ that allows the conversion. Since we are not using bert-base-uncased as the pretrained model, we need to change:

model = DPRContextEncoder(DPRConfig(**BertConfig.get_config_dict("bert-base-uncased")[0]))

with:

model = DPRContextEncoder(DPRConfig(**BertConfig.get_config_dict("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")[0]))

Similarly, we need to change code for DPRQuestionEncoder:

model = DPRQuestionEncoder(DPRConfig(**BertConfig.get_config_dict("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext")[0]))

and for DPRReader as:

model = DPRReader(DPRConfig(**BertConfig.get_config_dict("gdario/biobert_bioasq")[0]))

convert_dpr_original_checkpoint_to_pytorch.py script is already changed in the above-mentioned way.

Running scripts for conversion

After editing the script in the above-mentioned way, we need to run three scripts to first convert ctx_encoder:

python convert_dpr_original_checkpoint_to_pytorch.py --type ctx_encoder --src pipeline1_baseline/cp_models/dpr_biencoder.29 --dest SleepQA/models/pytorch/ctx_encoder

then question_encoder:

python convert_dpr_original_checkpoint_to_pytorch.py --type question_encoder --src pipeline1_baseline/cp_models/dpr_biencoder.29 --dest SleepQA/models/pytorch/question_encoder

and finally reader:

python convert_dpr_original_checkpoint_to_pytorch.py --type reader --src pipeline1_baseline/cp_models/dpr_extractive_reader.1.250 --dest SleepQA/models/pytorch/reader

Adding missing files after the conversion

After running three above mentioned scripts, we need to download tokenizer_config.json and vocab.txt files from respective Hugging Face repositories: PubMedBERT² for ctx_encoder and question_encoder, and BioBERT BioASQ³ for reader.

Building QA pipeline

qa_system.py script allows us to use fine-tuned models in a QA pipeline:

generate_dense_encodings function generates encodings for text corpus,
dense_retriever function retrieves the most relevant passage for the given question, and
extractive_reader function retrieves the most relevant text span for the given question.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

models

models

README.md

Convert checkpoints to pytorch

Changing the script for conversion

Running scripts for conversion

Adding missing files after the conversion

Building QA pipeline

Files

models

Directory actions

More options

Directory actions

More options

Latest commit

History

models

Folders and files

parent directory

README.md

Convert checkpoints to pytorch

Changing the script for conversion

Running scripts for conversion

Adding missing files after the conversion

Building QA pipeline

Footnotes