This is the official repository for the paper "CALM BEFORE THE STORM: UNLOCKING NATIVE REASONING FOR OPTIMIZATION MODELING".
STORM (Smart Thinking Optimization Reasoning Model) is an advanced Large Language Model designed for automating Operations Research (OR) and optimization modeling tasks. Traditional domain adaptation methods often force models into a rigid, non-reflective generation pattern, which suppresses the powerful, native multi-step reasoning abilities of modern Large Reasoning Models (LRMs).
To address this, we introduce CALM (Corrective Adaptation with Lightweight Modification). CALM utilizes lightweight, expert-aligned hints to dynamically correct and guide a model's reasoning trajectories, rather than overwriting them. This approach generates high-quality training data that mirrors an expert's thought process.
Building on CALM, we transform a 4B parameter base model into STORM through a two-stage training pipeline: Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL).
- 🚀 SOTA Performance with High Efficiency: STORM, with only 4B parameters, achieves a new state-of-the-art average accuracy of 68.9% across five popular optimization modeling benchmarks. Its performance matches or surpasses that of a 671B parameter model, demonstrating exceptional parameter efficiency.
- 🧠 Preserving and Enhancing Native Reasoning: Our CALM framework preserves and amplifies the model's inherent multi-step, iterative reasoning abilities through 'lightweight correction' rather than 'forced instruction,' allowing it to reason more like a true domain expert.
- 🛠️ Powerful Code-Integrated Reasoning: STORM can autonomously leverage a wide range of scientific computing libraries (e.g.,
pulp,sympy,numpy) during inference to aid its modeling and solving process, showcasing strong tool-use capabilities. - 💡 Emergent Abilities: After reinforcement learning, STORM demonstrates the ability to use novel tools not seen during its training (like using
rdkitfor chemistry problems) to solve complex tasks, indicating powerful generalization and autonomous learning.
We highly recommend using Conda to manage your Python environment.
conda create -n storm python=3.10
conda activate stormFor high-performance inference, we support vLLM and SGLang. Please choose one to install based on your preference and environment.
Option 1: vLLM (Recommended)
pip install "vllm>=0.8.5.post1"Option 2: SGLang
pip install "sglang>=0.4.6.post1"These are the essential Python packages required to run this project.
pip install math_verify transformers datasets pebbleSTORM's power lies in its ability to dynamically call external Python libraries to solve problems. To unlock its full potential, ensure the following common scientific computing packages are installed in your environment.
# Operations Research & Optimization Solvers
pip install pulp gurobipy cvxpy pyomo osqp scikit-optimize optuna hyperopt ortools
# Scientific Computing & Data Analysis
pip install numpy scipy sympy pandas matplotlib scikit-learn statsmodels networkx autograd torch
# Other Specialized Libraries (Optional, depending on your tasks)
pip install pymc3 pydstool shapely pygeos seaborn plotly mpmathImportant Note: The model's tool-use capabilities are open-ended. We found that when faced with specialized problems (e.g., GPQA Diamond Chemistry), STORM attempts to use more specific libraries like rdkit. Therefore, we encourage you to install other relevant scientific packages based on your application domain to further enhance the model's capabilities.
If you wish to create an environment identical to the one used in our experiments, you can install all dependencies from the requirements.txt file. Please be aware that this list is very extensive and includes many task-specific packages.
pip install -r requirements.txtWe have open-sourced the STORM-Qwen3-4B model weights. You can download them from either source:
- Hugging Face: tangzhy/STORM-Qwen3-4B
- ModelScope: tangzhy/STORM-Qwen3-4B
We provide a convenient script to reproduce the evaluation results from our paper.
The run_inference.sh script accepts three arguments:
MODEL_NAME_OR_PATH: The local path to your downloaded model weights.TEST_SET_NAME: The name of the benchmark to evaluate. Options include:nl4opt,mamo_easy,mamo_complex,industryor,OptMath.GPU_ID: The ID of the GPU device you wish to use (e.g.,0).
The script (run_inference.sh):
#!/bin/bash
# $1: Local model path, e.g., /path/to/your/STORM-Qwen3-4B
# $2: Test set name, e.g., nl4opt
# $3: GPU ID to use, e.g., 0
MODEL_NAME_OR_PATH=$1
TEST_SET=test.tir_prompt.$2
INPUT_FILE="data/$TEST_SET.jsonl"
OUTPUT_TAG="STORM_infer_outputs/$TEST_SET"
MODEL_OUTPUT_DIR=$MODEL_NAME_OR_PATH/$OUTPUT_TAG
CUDA_VISIBLE_DEVICES=$3 TOKENIZERS_PARALLELISM=false python -m infer.inference_and_eval \
--input_file $INPUT_FILE \
--output_dir $MODEL_OUTPUT_DIR \
--model_name_or_path $MODEL_NAME_OR_PATH \
--engine "vllm" \
--tensor_parallel_size 1You can switch between vllm and sglang by modifying the --engine parameter.
Assuming you have downloaded the model to ./models/STORM-Qwen3-4B and want to evaluate it on the nl4opt test set using GPU 0, run the following command:
bash run_inference.sh ./models/STORM-Qwen3-4B nl4opt 0For our Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) training pipeline, we would like to thank and reference the work presented in the CoRT paper. Their official repository can be found here: CoRT GitHub.
If you find our work helpful for your research, please consider citing our paper:
@misc{tang2025calmstormunlockingnative,
title={CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling},
author={Zhengyang Tang and Zihan Ye and Chenyu Huang and Xuhan Huang and Chengpeng Li and Sihang Li and Guanhua Chen and Ming Yan and Zizhuo Wang and Hongyuan Zha and Dayiheng Liu and Benyou Wang},
year={2025},
eprint={2510.04204},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.04204},
}