SciDaEx: Scientific Data Extraction and Structuring System

SciDaEx is a open-source system for extracting and structuring data (as data tables) from scientific literature using Large Language Models (LLMs). It integrates a computational backend with an interactive user interface to facilitate efficient data extraction, structuring, and refinement for evidence synthesis in scientific research.

Features

Automated data extraction from scientific papers (text, tables, and figures)
Structured data table output in standardized formats
Interactive user interface for data validation and refinement
Retrieval-augmented generation (RAG) for enhanced accuracy and speed
Quality evaluation metrics for extracted data
Support for both technical and non-technical users

Installation

# Clone the repository
git clone https://github.com/xingbow/SciDaEx.git
cd SciDaEx

# Set up a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

# Install backend dependencies (python 3.10)
pip install -r requirements.txt && pip install "pdfservices-sdk==2.3.0"

# Install frontend dependencies
cd frontend
npm install

Configuration

Backend configuration
- Create a config.yml file in the backend/app/dataService directory
- Update the config.yml file with the required configurations:
  - You can get adobe service api credentials here
  - You can get openai api key here
```
openai_key: your_openai_api_key

adobe_credentials:
   client_id: your_adobe_client_id
   client_secret: your_adobe_client_secret
```

Usage

Preprocess documents

Place your PDF documents in the backend/app/dataService/data directory.
Run the preprocessing script:
```
cd backend/app/dataService
python preprocess.py --pdf_dir data --table_dir data/table --figure_dir data/figure --meta_dir data/meta
```
This script will extract tables, figures, and metadata from the PDFs and store them in the respective directories.

For details, please refer to the preprocessing documentation.

Running the web application

Start the backend server
```
cd backend
python run-data-backend.py
```
Start the frontend server
```
cd frontend
npm run serve
```
Open your browser and navigate to http://localhost:8080 to access the SciDaEx interface.

Contributors

Contributors to the project (development version) are listed below (data as of 2024-08-06):

Xingbo Wang: wangxbzb@foxmail.com

Total Commits: 63
Total Additions: 37,992
Total Deletions: 17,417

Rui Sheng: rshengac@connect.ust.hk

Total Commits: 14
Total Additions: 339
Total Deletions: 173

Winston Tsui: wt285@cornell.edu

Total Commits: 2
Total Additions: 208
Total Deletions: 102

Contact

Xingbo Wang - xiw4011@med.cornell.edu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SciDaEx: Scientific Data Extraction and Structuring System

Table of Contents

Features

Installation

Configuration

Usage

Preprocess documents

Running the web application

Contributors

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

SciDaEx: Scientific Data Extraction and Structuring System

Table of Contents

Features

Installation

Configuration

Usage

Preprocess documents

Running the web application

Contributors

Contact