A two-stage RL-enhanced framework that equips SLMs for high-accuracy long-document QA.
- [2026-01-26] Our LiteCoST is accepted by ICLRβ26.
Pillar 1: Chain-of-Structured-Thought (CoST) uses a high-capability LLM purely as a trace generator: it proposes a minimal structure, executes a step-wise, structure-guided trace over the documents, serializes the result, and verifies/refines it (optionally with an LLM-as-judge).
Pillar 2: SLM fine- tuning (SFT β GRPO) trains an SLM with the CoST supervision in two phases: Supervised Fine-Tuning to learn structural patterns, formatting rules, and reasoning steps, followed by Group Relative Policy Optimization with dual signals that reward both answer/format quality and step/process consistencyβtransferring structure-first behavior to an efficient SLM for low-latency deployment.
- π Structure Analysis
- π§ Trace Geneartion
- β Data Verification
- π Data Refinement
- π― Supervised Fine-Tuning (SFT)
- β‘ Group Relative Poilcy Optimization (GRPO)
The core execution of LiteCoST is implemented in the src directory (See GRPO in 'verl/'):
src
βββ convert_func.py # Conversion function module
βββ data_refinement.py # Data refinement module
βββ data_verification.py # Data verification module
βββ extract/ # Extraction module
β βββ graph.py # Graph class
β βββ main.py # Main program
β βββ table.py # Table class
β βββ to_desc.py # Convert to description
β βββ to_graph.py # Convert to graph
β βββ to_table.py # Convert to table
βββ sft.py # SFT module
βββ prompt.py # Prompt template module
βββ reasoner.py # Reasoning module
βββ reward.py # Reward module
βββ structure_analysis/ # Structure analysis module
β βββ query2schema.py # Schema construction
β βββ structure_decision.py # Structure decision
βββ cal_latenct.py # Calculate Latency
βββ utils.py # Utility functions module
- Generate the Serialized Structured Output
python main.py --model gpt-4o --dataset Loong --structured --document
cd src
python data_verification.py
python data_refinement.py- Conduct SFT Training
python -m src.convert_func # data format conversion
python -m src.sft- Conduct GRPO Optimization
cd verl
bash scripts/run_grpo_cost.sh
## merge model
python scripts/model_merger.py merge --backend fsdp --local_dir checkpoints/cost-sft/cost-sft-llama3.2-3b-ins/global_step_1566/actor --target_dir merged/cost-grpo/llama3.2-3b-ins1. Quick Deployment
cd Loong/src
bash vllm_example.sh
2. Run the pipeline
python main.py --model deployed_model --dataset Loong --structured --documentEfficacy of Chain-of-Structured-Thought (CoST).
Effectiveness: How good is LiteCoST for SSO Generation?
We implement our reinforcement learning algorithm by extending the veRL framework. For efficient inference, we leverage vLLM, and we develop evaluation scripts based on the Loong datasets. We sincerely thank these communities for their valuable contributions!




