Skip to content

ScienceOne-AI/S1-Parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📄 S1-Parser: Efficient Multimodal Document Parsing via Dynamic Structure-Semantics Fusion

🔗 Codebase  

S1-Parser is a highly efficient multimodal text parsing tool designed to enable accurate and efficient parsing of complex documents. Instead of relying solely on static fine-tuning or single-stage optimization, it employs a strategy of first Supervised Fine-Tuning (SFT) then Reinforcement Learning (RL), effectively fine-tuning the model on critical aspects such as formula syntax correctness, symbol integrity, and structural rationality—balancing parsing precision and efficiency across diverse document types.

framework1


📰 News

  • [2025/10/28] We release the Code for S1-Parser.

🚀 Features

  • 🧩 Supervised Fine-Tuning with task-oriented ([Parse Target: Scientific Equations]) to sharpen domain adaptation.
  • 🎯 Multi-stage RL to refine, stabilize, and accelerate the learning process in strategic of behaviors.
  • 📊 Benchmarked on Scientific Literature Dataset: SCI_LLM

🛢️ Benchmarks


⚙️ Environment Setup (Recommended)

We recommend using Python 3.10 and PyTorch ≥ 2.7.

Install the environment:

# Recommend Python 3.10.18
git clone https://github.com/ScienceOne-AI/S1-Parser.git
cd S1-Parser
pip install -r requirements.txt

🏋️ Training

S1-Parser training proceeds in two stages with different designs:

# Stage 1: Execute the Supervised Fine-Tuning (SFT) to acquire fundamental LaTeX OCR.
bash scripts/run_train_ocr_sft_model.sh

# Stage 2: Execute the GRPO training to optimize LaTeX formula syntax, symbol and structure.
bash scripts/run_train_ocr_grpo_model.sh

Make sure to configure your model paths and data in script/run_train_ocr_*.sh.


🔍 Acknowledgements

We build and reference on the following open source trunks, and thank the following sources for their contributions to the open source community:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •