SciDaEx is a open-source system for extracting and structuring data (as data tables) from scientific literature using Large Language Models (LLMs). It integrates a computational backend with an interactive user interface to facilitate efficient data extraction, structuring, and refinement for evidence synthesis in scientific research.
- Automated data extraction from scientific papers (text, tables, and figures)
- Structured data table output in standardized formats
- Interactive user interface for data validation and refinement
- Retrieval-augmented generation (RAG) for enhanced accuracy and speed
- Quality evaluation metrics for extracted data
- Support for both technical and non-technical users
# Clone the repository
git clone https://github.com/xingbow/SciDaEx.git
cd SciDaEx
# Set up a virtual environment
python -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
# Install backend dependencies (python 3.10)
pip install -r requirements.txt && pip install "pdfservices-sdk==2.3.0"
# Install frontend dependencies
cd frontend
npm install
- Backend configuration
- Create a
config.yml
file in thebackend/app/dataService
directory - Update the
config.yml
file with the required configurations:
api_keys: openai: your_openai_api_key adobe_credentials: client_id: your_adobe_client_id client_secret: your_adobe_client_secret
- Create a
- Place your PDF documents in the
backend/app/dataService/data
directory. - Run the preprocessing script:
This script will extract tables, figures, and metadata from the PDFs and store them in the respective directories.
cd backend/app/dataService python preprocess.py --pdf_dir data --table_dir data/table --figure_dir data/figure --meta_dir data/meta
For details, please refer to the preprocessing documentation.
-
Start the backend server
cd backend python run-data-backend.py
-
Start the frontend server
cd frontend npm run serve
-
Open your browser and navigate to
http://localhost:8080
to access the SciDaEx interface.
Contributors to the project (development version) are listed below (data as of 2024-08-06):
Xingbo Wang: wangxbzb@foxmail.com
- Total Commits: 63
- Total Additions: 37,992
- Total Deletions: 17,417
Rui Sheng: rshengac@connect.ust.hk
- Total Commits: 14
- Total Additions: 339
- Total Deletions: 173
Winston Tsui: wt285@cornell.edu
- Total Commits: 2
- Total Additions: 208
- Total Deletions: 102