🔗 Codebase
S1-Parser is a highly efficient multimodal text parsing tool designed to enable accurate and efficient parsing of complex documents. Instead of relying solely on static fine-tuning or single-stage optimization, it employs a strategy of first Supervised Fine-Tuning (SFT) then Reinforcement Learning (RL), effectively fine-tuning the model on critical aspects such as formula syntax correctness, symbol integrity, and structural rationality—balancing parsing precision and efficiency across diverse document types.
- [2025/10/28] We release the Code for S1-Parser.
- 🧩 Supervised Fine-Tuning with task-oriented ([Parse Target: Scientific Equations]) to sharpen domain adaptation.
- 🎯 Multi-stage RL to refine, stabilize, and accelerate the learning process in strategic of behaviors.
- 📊 Benchmarked on Scientific Literature Dataset: SCI_LLM
We recommend using Python 3.10 and PyTorch ≥ 2.7.
Install the environment:
# Recommend Python 3.10.18
git clone https://github.com/ScienceOne-AI/S1-Parser.git
cd S1-Parser
pip install -r requirements.txtS1-Parser training proceeds in two stages with different designs:
# Stage 1: Execute the Supervised Fine-Tuning (SFT) to acquire fundamental LaTeX OCR.
bash scripts/run_train_ocr_sft_model.sh
# Stage 2: Execute the GRPO training to optimize LaTeX formula syntax, symbol and structure.
bash scripts/run_train_ocr_grpo_model.sh
Make sure to configure your model paths and data in script/run_train_ocr_*.sh.
We build and reference on the following open source trunks, and thank the following sources for their contributions to the open source community:
