-
Create a new conda environment called 'data_decomposer':
conda create --name data_decomposer python=3.10 -
Activate the environment:
conda activate data_decomposer -
Install requirements:
pip install -r requirements.txt
The repository is organized into the following main directories:
-
core/: Core pipeline interfaces and system setupbase_implementation.py: Abstract base class for all implementationsconfig.py: Configuration managementfactory.py: Factory pattern for creating implementation instances
-
data/: Data storage for input datasets -
data_processing/: Data processing and question generation scripts- Jupyter notebooks for generating questions from different data types (passages, tables, etc.)
-
implementations/: Contains different system implementationssymphony/: Symphony implementation with data decomposition and executionReSP/: Retrieval-enhanced Structured Processing implementationXMODE/: Cross-modal data handling implementationbaseline/: Baseline implementation for comparison
-
results/: Results storage for evaluation outputs -
results_v2/: Extended results storage with additional metrics -
scripts/: Command-line tools and utilitiesauto_extract_embeddings.py: Extract embeddings from data using a GPT embedding modelbuild_index.py: Build search indices for data retrievalrun_query.py: Run queries against the systemtrain.py: Training a T5-based autoencoder modelextract_embeddings.py: Extracting embeddings from the trained T5-based autoencoder modelpassage_embedd_and_index.py: Process and index passage databuild_representation_index.py: Build indices for cross-modal representationscsv_to_sqlite.py: Convert CSV data to SQLite database format
-
tests/: Test suite for validating system functionality
To process a query with ground truth answer for source relevance scoring:
python main.py --config config.yaml --ground-truth-answer "Ground truth answer text" "Your query here"To evaluate the system against a dataset of queries and ground truth answers:
python evaluate_qa.py --config config.yaml --gt-file path/to/groundtruth.csv --output results.jsonThe ground truth file should be a CSV with columns: question, answer, text, table. Where:
question: The query to processanswer: The ground truth answertext: Comma-separated list of expected text source filestable: Comma-separated list of expected table source files
Example:
"question","answer","text","table"
"What is the mechanism of action for Cetuximab?","Cetuximab is an EGFR binding FAB, targeting the EGFR in humans.","None","drugbank-targets"While building the benchmark and implementing the three methods, I used Github Co-pilot as an assistive tool. I primary used Co-pilot for assisting in writing boiler plate code for functions that I planned, designed and architected myself.