Skip to content

xingbow/SciDaEx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SciDaEx: Scientific Data Extraction and Structuring System

SciDaEx Logo

SciDaEx is a open-source system for extracting and structuring data (as data tables) from scientific literature using Large Language Models (LLMs). It integrates a computational backend with an interactive user interface to facilitate efficient data extraction, structuring, and refinement for evidence synthesis in scientific research.

Table of Contents

Features

  • Automated data extraction from scientific papers (text, tables, and figures)
  • Structured data table output in standardized formats
  • Interactive user interface for data validation and refinement
  • Retrieval-augmented generation (RAG) for enhanced accuracy and speed
  • Quality evaluation metrics for extracted data
  • Support for both technical and non-technical users

Installation

# Clone the repository
git clone https://github.com/xingbow/SciDaEx.git
cd SciDaEx

# Set up a virtual environment
python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts\activate`

# Install backend dependencies (python 3.10)
pip install -r requirements.txt && pip install "pdfservices-sdk==2.3.0"

# Install frontend dependencies
cd frontend
npm install

Configuration

  1. Backend configuration
    • Create a config.yml file in the backend/app/dataService directory
    • Update the config.yml file with the required configurations:
      • You can get adobe service api credentials here
      • You can get openai api key here
    api_keys:
       openai: your_openai_api_key
    
    adobe_credentials:
       client_id: your_adobe_client_id
       client_secret: your_adobe_client_secret

Usage

Preprocess documents

  1. Place your PDF documents in the backend/app/dataService/data directory.
  2. Run the preprocessing script:
    cd backend/app/dataService
    python preprocess.py --pdf_dir data --table_dir data/table --figure_dir data/figure --meta_dir data/meta
    This script will extract tables, figures, and metadata from the PDFs and store them in the respective directories.

For details, please refer to the preprocessing documentation.

Running the web application

  1. Start the backend server

    cd backend
    python run-data-backend.py
  2. Start the frontend server

    cd frontend
    npm run serve
  3. Open your browser and navigate to http://localhost:8080 to access the SciDaEx interface.

Contributors

Contributors to the project (development version) are listed below (data as of 2024-08-06):

Xingbo Wang: wangxbzb@foxmail.com
  • Total Commits: 63
  • Total Additions: 37,992
  • Total Deletions: 17,417
Rui Sheng: rshengac@connect.ust.hk
  • Total Commits: 14
  • Total Additions: 339
  • Total Deletions: 173
Winston Tsui: wt285@cornell.edu
  • Total Commits: 2
  • Total Additions: 208
  • Total Deletions: 102

Contact

Xingbo Wang - xiw4011@med.cornell.edu