Hybrid Question Answering with Transformers

A dual model system for answering questions based on business context. Scalable across lengthier contexts.

The way this system functions is,

Business context is first categorized into features like "feature X can do", "feature X can't do" and capabilities as sentences.
Then the classifier model is trained on this dataset to learns to pay attention in categorizing "feature X" and it's possiblity "can do"/"can't do".
Then the trained category predicted is selected to load the feature capability sentences.
The capability sentences and the question is passed to DeBERTa v3 to extract the relevant answer.
Possibility (Yes/No) + answer is finally combined together.

Benefits of this approach:

Large contexts are broken to smaller categorized contexts = scalability, better understanding of data.
Classifier predicting context category supports DeBERTa v3 to quickly load the context data.
Trained categorized context reused for question answering.
Classifier is trained on context only, therefore questions asked does not need to be trained separately.
Not fully dependent on DeBERTa v3, and another answer extraction model like RoBERTa can be used instead.
Sanitizing question to work on trained vocabulary only.

Refer Model Simplified Flowchart for more details on how the 2 models work together.

This project contains the dataset used to train, in directory /data/cat2context_train.json, this dataset can be modified to fit the business context.
And the test dataset is in /data/cat2context_test.json, this contains questions to see how well the model has learned the category and possiblity.

This is the project structure:

├───api
│       model_blueprint.py
│       model_init.py
│       model_schema.py
│
├───data
│       cat2context_test.json
│       cat2context_train.json
│       context_mapper.json
│       dataset_loader.py
│       deberta_train.json
|       inference_test.json
│       word_alias_mapper.json
│
├───docs
├───fine_tuned_models
├───huggingface_model
│
├───model
│       classifier_model.py
│
├───output
│   └───models
│       │   classifier_591_68_75.pt
│       │   classifier_597_63_63.pt
│       │   english_dictionary_591_68_75.json
│       │   english_dictionary_597_63_63.json
│       │   english_list_591_68_75.json
│       │   english_list_597_63_63.json
│       │   training_curves_591_68_75.json
│       │   training_curves_597_63_63.json
│       │
│       ├───onnx
│       │       .gitignore
│       │       classifier_597_63_63.onnx
│       │
│       └───tensorrt
│               classifier_597_63_63-4.trt
│               classifier_597_63_63-5.trt
│
├───scripts
│       add_path.py
│       analyze_classifier.py
│       classifier_to_onnx.py
│       classifier_to_trt.py
│       context_loader.py
│       context_loader_trt.py
│       deberta_finetune.py
│       dev_api.py
│       inference.py
│       inference_onnx.py
│       inference_trt.py
│       qa_to_onnx.py
│       qa_to_trt.py
│       train_classifier.py
│    
└───util
        helper.py

Getting started guide

Based on Python version 3.13.2

This repository includes a backend API for continous inference and testing purposes, after the steps from 1 to 4 below are completed, skip to Dev API Inference section.

Installation, Analysis & Inference

Packages used: PyTorch v2.7 + cuda 12.8, matplotlib, Huggingface transformers, Huggingface datasets, Flask, Flask-smorest for swagger-ui

STEP 1

It is recommended to use a virtual python environment, this can be done with;

python -m venv dl

STEP 2

Then on windows, the virtual env can be activated by

.\dl\Scripts\activate

STEP 3

Afterwards, Install the packages from requirements.txt, can be done using;

pip install -r <project-dir>/requirements.txt

TensorRT SDK for inference, and optimization

After installing the packages, download the appropiate tensorRT SDK from https://developer.nvidia.com/tensorrt. This project was built using TensorRT 10.11.0.33 version.

After downloaded, copy the contents to a folder accessible to the python enviornment, then install the tensorrt--cp-none--.whl from the python folder inside the contents. For instance,

pip install TensorRT-10.11.0.33/python/tensorrt-10.11.0.33-cp313-none-win_amd64

Other types such as dispatch can also be used if the motive is to inference and not to optimize the engine.

STEP 4

python <project-dir>/scripts/deberta_finetune.py

This will download DeBERTa v3 base squad2 trained by deepset from hugging face and save locally, and then it will fine tune. The fine tuning is done using the dataset in /data/deberta_train.json

Note: Any pretrained answer extraction model from hugging face can be used (such as RoBERTa) by changing the model_name variable in /scripts/deberta_finetune.py

This execution will save the hugging face model under /huggingface_model directory and the fine tuned model under /fine_tuned_models directory, these are needed for later steps.

STEP 5

Now that we have the hugging face model, and a trained classifier is provided in this repository under /output/models/. So we will see how the classifier and the hugging face model works together.

Run

python <project-dir>/scripts/analyze_classifier.py

This will load both DeBERTa v3 and classifier models to run inference against the test dataset, and at the end it will show a series of graphs to indicate how the trained classifier model is performing on unseen test data. These graphs are included under docs in this repository as well.

Training Loss curve, across epochs
Test dataset accuracy, across epochs
Category Softmax curves
Possibility Softmax curves

STEP 6

After the classifier and DeBERTa functionality is analyzed, now run the command below to test how the fine tuned model performs with the classifier.

python <project-dir>/scripts/inference.py

The questions included in this file can be modified as needed, the more failure conditions occur the better understanding can be gained.

Dev API Inference (requires steps 1 to 4 from above)

The API can be hosted by the command;

python <project-dir>/scripts/dev_api.py

By default port 5000 will be used and this server can be accessed via http://127.0.0.1:5000/api/docs

The api folder in the repository holds the blueprint which interacts with the models, and the marhmallow schema for input validation and output structuring.

Training the classifier

The code for this is in scripts/train_classifier.py here the train and test datasets are loaded, train datset is split to 80 batches, and iteratively (epochs) model weights are updated via adamW optimizer. This process is done multiple times to pick the best model, this can be adjusted by modifying the num_of_runs variable in line 148. Similarly learn rate, epochs, and other params can be modified too, the best values are the default.

The Classifier model is instantiated in line 163, context label mapper can be created by uncommenting lines 44 to 47, in this case category_count is 6

model = Classifier(max_vocab_size=len(word_list), embed_dim=512, category_count=len(context_label_mapper), possibility_count=2).to(device) # copy model to torch device

Training can be done by executing,

python <project-dir>/scripts/train_classifier.py

Optimizing multimodel for inference (Onnx, TensorRT)

The process here is torch model to onnx model to tensorrt engine.

Files in /Scripts	Purpose
classifier_to_onnx.py	Loads the torch classifier model, converts to onnx, then asserts the onnx model inference against torch model.
classifier_to_trt.py	Loads the onnx classifier model, converts to tensorrt engine with optimization profile 5 and max shape (8, 64). Means 8 batches with 64 max size.
qa_to_onnx.py	Loads the fine tuned DeBeRtav3 model, converts to onnx, then inferences on onnx runtime to ensure it functions.
qa_to_trt.py	This will be released after TensorRT Issue#4288 is resolved.
inference_onnx.py	Test onnx performance. Combines onnx classifier model and onnx finetuned DeBeRtaV3 for hybrid inference.
inference_trt.py	Test trt performance. Combines tensorrt classifier engine and torch(cuda) finetuned DeBeRtaV3 for hybrid inference
inference.py	Test raw performance. Combines torch(cuda) classifier engine and torch(cuda) finetuned DeBeRtaV3 for hybrid inference

Benchmark

Execution Computer specifications:

Component	Specification
CPU	i5 12600K
RAM	32GB DDR4
Storage	nvme ssd
GPU	RTX 4080 super

Results, tested against the questions in /data/inference_test.json

Classifier average inference time per question: 4.4ms
DeBERTa v2 average inference time in pipeline: 943.8ms
Total inference time for 15 zero-shot questions:

File Refererence	Run 1 /s	Run 2 /s	Run 3 /s
inference.py	14.425	14.527	14.408
inference_onnx.py	28.767	28.518	28.990
inference_trt.py	14.252	14.363	14.331

Concludes that the classifier tensorRT engine performs 70 to 170ms faster than with torch(cuda), this is important when combined with the Flask API, especially for batch inference.

More Updates and Benchnmarks coming soon...

Acknowledgments

PyTorch Documentation
Huggingface DeBERTa v3 base squad2 from deepset
Huggingface Trainer Documentation
matplotlib Documentation
Inspired by LSTMS from the MIT xPRO course: Deep Learning – Mastering Neural Networks (2025 Feb Cohort)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hybrid Question Answering with Transformers

Getting started guide

Installation, Analysis & Inference

STEP 1

STEP 2

STEP 3

TensorRT SDK for inference, and optimization

STEP 4

STEP 5

STEP 6

Dev API Inference (requires steps 1 to 4 from above)

Training the classifier

Optimizing multimodel for inference (Onnx, TensorRT)

Benchmark

Results, tested against the questions in /data/inference_test.json

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
api		api
data		data
docs		docs
fine_tuned_models		fine_tuned_models
huggingface_model		huggingface_model
model		model
output/models		output/models
scripts		scripts
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

matthew-roche/hybrid-transformer-qa

Folders and files

Latest commit

History

Repository files navigation

Hybrid Question Answering with Transformers

Getting started guide

Installation, Analysis & Inference

STEP 1

STEP 2

STEP 3

TensorRT SDK for inference, and optimization

STEP 4

STEP 5

STEP 6

Dev API Inference (requires steps 1 to 4 from above)

Training the classifier

Optimizing multimodel for inference (Onnx, TensorRT)

Benchmark

Results, tested against the questions in /data/inference_test.json

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages