RIMO

📚️ RIMO

This repository contains two complementary tasks:

RIMO‑N: single-step math problems scored by a numeric final answer.
RIMO‑P: multi‑part proof problems solved step‑by‑step and strictly evaluated per sub‑solution.

Datasets live under RIMO/ and runnable scripts live under code/.

Requirements

Python 3.10+
Windows users: run commands in Command Prompt (cmd) from the repo root, e.g. D:\RIMO.

Install only what you need:

pip install pandas openai transformers torch google-genai

Note

For local/Transformer models, a GPU with sufficient VRAM is strongly recommended. For API usage, set provider keys as environment variables (see below). RIMO‑P examples use an OpenAI‑compatible endpoint from Aliyun DashScope (Bailian). See the Bailian console: Aliyun Bailian (DashScope) console.

Repository layout

RIMO/                    datasets (CSV/JSONL)
  RIMO-N.csv            numeric problems
  RIMO-P.csv            proof problems

code/                   scripts
  RIMO_N_API.py                         # RIMO‑N via API
  RIMO_N_Open_Source.py                 # RIMO‑N via local Transformers
  RIMO_P_solve_subproblems_api.py       # RIMO‑P solver (API, sequential)
  RIMO_P_solve_subproblems_local.py     # RIMO‑P solver (local, sequential)
  RIMO_P_evaluate_solutions_deepseek_r1.py  # RIMO‑P judge via DeepSeek‑R1 (API)

🔥 RIMO‑N: Numeric final‑answer evaluation

Two options are provided: API or local model.

API (`code/RIMO_N_API.py`)

Reads RIMO/RIMO-N.csv (columns: problem_id, problem, answer).
Calls an OpenAI‑compatible chat completions API and extracts the final value inside \boxed{}.
Appends results to code/answer_qwq.csv (resumable by problem_id).

Run:

python code\RIMO_N_API.py

Configure inside the script: api_key, base_url, MODEL_NAME, CSV_OUT.

Local (`code/RIMO_N_Open_Source.py`)

Loads a local Hugging Face model (default Qwen/Qwen3-8B).
Writes code/proof_answer_qwen3_8b.csv with columns problem_id, correct_answer, llm_answer.

Run:

python code\RIMO_N_Open_Source.py

🔥 RIMO‑P: Sub‑problem solving and strict evaluation

RIMO‑P is performed in two phases: (1) produce per‑part solutions, (2) evaluate them strictly in order.

What RIMO‑P measures (from the paper)

A proof problem is split into up to 4 ordered sub‑problems. A model must solve them sequentially.
Let parts ∈ {1..4} be the total sub‑problems for a given item. Let S_i be the number of consecutive correct sub‑solutions starting from part 1.
The per‑problem score is score_i = S_i / parts. The benchmark performance P is the mean of score_i over all problems.
Evaluation is strict: the judge checks only the current step’s sub‑solution; any error, gap, or unjustified claim causes failure at that step and halts further credit for that problem.

Phase 1 — Produce sub‑solutions (sequential)

Key idea: solve one sub‑problem at a time and pass the previous solution forward as a proved statement.

Inputs: RIMO/RIMO-P.csv with columns:

problem_id, problem, number_of_parts (1..4)
sub-problem1..4, sub-solution1..4 (official)

Output format (both API and local solvers):

CSV columns problem_id, parts, sub-problem1..4, llm_solution1..4.
Non‑existing parts are kept as None (sub‑problem) and N/A (solution).

Environment for API (DashScope/Bailian):

set DASHSCOPE_API_KEY=YOUR_DASHSCOPE_KEY

API solver (uses Qwen3 via DashScope; thinking disabled):

python code\RIMO_P_solve_subproblems_api.py RIMO\RIMO-P.csv RIMO\RIMO-P_solutions_qwen3_sequential.csv qwen3-8b 0.25

Local solver (Transformers):

python code\RIMO_P_solve_subproblems_local.py RIMO\RIMO-P.csv RIMO\RIMO-P_solutions_local_sequential.csv mistralai/Mathstral-7B-v0.1 0.25 1024

Notes:

Both solvers include a post‑processing step that fixes problem_id placeholders like row_00001 by re‑reading the original input (handles BOM‑prefixed headers such as \ufeffproblem_id).

Phase 2 — Strict evaluation with DeepSeek‑R1

Evaluator: code/RIMO_P_evaluate_solutions_deepseek_r1.py.

Loads references from RIMO/RIMO-P.csv and candidates from your produced solutions CSV.
Queries deepseek-r1 via an OpenAI‑compatible client (DashScope/Bailian).
Sequential grading: stops at first incorrect step. Computes per‑problem S_i / parts and overall performance P.

Run:

set DASHSCOPE_API_KEY=YOUR_DASHSCOPE_KEY
python code\RIMO_P_evaluate_solutions_deepseek_r1.py RIMO\RIMO-P.csv RIMO\RIMO-P_solutions_qwen3_sequential.csv deepseek-r1 0.25

Outputs a judged CSV next to your solutions file:

Columns: problem_id, parts, S_i, score_i, verdict1..4, reason1..4.
Prints overall P across all evaluated problems.

Recommended usage protocol (from the paper)

Do not send all sub‑problems at once to a model. Solve exactly one sub‑problem per inference and chain the previous answer forward as an already‑proved statement.
The evaluator grades strictly step‑by‑step and stops at the first error, which encourages precise, incremental reasoning.

🧩 Quickstart (API path)

Install dependencies:

pip install pandas openai transformers torch

Set DashScope (Bailian) key:

set DASHSCOPE_API_KEY=YOUR_DASHSCOPE_KEY

Solve RIMO‑P sequentially:

python code\RIMO_P_solve_subproblems_api.py RIMO\RIMO-P.csv RIMO\RIMO-P_solutions_qwen3_sequential.csv qwen3-8b 0.25

Evaluate strictly with DeepSeek‑R1:

python code\RIMO_P_evaluate_solutions_deepseek_r1.py RIMO\RIMO-P.csv RIMO\RIMO-P_solutions_qwen3_sequential.csv deepseek-r1 0.25

🤔 Troubleshooting

If you see “No problems evaluated (no references matched)”, your CSV may contain a BOM. The evaluator and solvers resolve keys like problem_id even when stored as \ufeffproblem_id.
If API calls fail, check DASHSCOPE_API_KEY and connectivity to the OpenAI‑compatible endpoint (Bailian DashScope). See Aliyun Bailian (DashScope) console.
For local models, reduce model size or max_new_tokens if you run out of memory.

📎 Citation

If you use RIMO in your work, please cite the following paper:

@misc{chen2025rimoeasytoevaluatehardtosolveolympiad,
      title={RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning}, 
      author={Ziye Chen and Chengwei Qin and Yao Shu},
      year={2025},
      eprint={2509.07711},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2509.07711}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
RIMO		RIMO
code		code
img		img
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RIMO

📚️ RIMO

Requirements

Repository layout

🔥 RIMO‑N: Numeric final‑answer evaluation

API (`code/RIMO_N_API.py`)

Local (`code/RIMO_N_Open_Source.py`)

🔥 RIMO‑P: Sub‑problem solving and strict evaluation

What RIMO‑P measures (from the paper)

Phase 1 — Produce sub‑solutions (sequential)

Phase 2 — Strict evaluation with DeepSeek‑R1

Recommended usage protocol (from the paper)

🧩 Quickstart (API path)

🤔 Troubleshooting

📎 Citation

About

Uh oh!

Releases

Packages

Languages

ziye2chen/RIMO

Folders and files

Latest commit

History

Repository files navigation

RIMO

📚️ RIMO

Requirements

Repository layout

🔥 RIMO‑N: Numeric final‑answer evaluation

API (code/RIMO_N_API.py)

Local (code/RIMO_N_Open_Source.py)

🔥 RIMO‑P: Sub‑problem solving and strict evaluation

What RIMO‑P measures (from the paper)

Phase 1 — Produce sub‑solutions (sequential)

Phase 2 — Strict evaluation with DeepSeek‑R1

Recommended usage protocol (from the paper)

🧩 Quickstart (API path)

🤔 Troubleshooting

📎 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

API (`code/RIMO_N_API.py`)

Local (`code/RIMO_N_Open_Source.py`)

Packages