merge changes from source by rohan-rao7 · Pull Request #3 · Specter-Co/VLMEvalKit

rohan-rao7 · 2025-12-31T03:23:37Z

No description provided.

* Update run.py * Update __init__.py * Update image_vqa.py * Update image_vqa.py * Add files via upload * Update image_vqa.py * Update image_vqa.py * Update mmreason.py * Update image_vqa.py * Update internvl_chat.py --------- Co-authored-by: Xinyu Fang <fangxinyutju202009@126.com>

Co-authored-by: Ma Zerun <mzr1996@163.com>

* Initial commit fot HiPhO * update * update * refactor: simplify HiPhO dataset logging system - Remove custom LogBuffer class and thread-safe logging - Replace safe_print with standard print statements - Remove threading and datetime imports - Simplify build_prompt function by removing verbose debug output - Update dataset URL from haiyuanwan/HiPhO to HY-Wan/HiPhO - Reduce code from 899 to 803 lines (10.7% reduction) - Maintain all core functionality: evaluation logic, prompt building, hipho_verifier integration * refactor: remove parallel evaluation framework from HiPhO dataset - Remove complex parallel evaluation using track_progress_rich - Simplify to sequential evaluation for better stability and debugging - Remove multiprocessing and parallel task management dependencies - Rename functions to remove '_with_buffer' suffix and log_buffer parameters - Remove nproc parameter handling and temporary file management - Reduce code from 803 to 774 lines (additional 3.6% reduction) - Maintain all core evaluation logic: fine/coarse-grained scoring, hipho_verifier integration - Sequential evaluation is sufficient for physics olympiad problem counts * refactor: major simplification of HiPhO dataset implementation Major improvements: - Remove 6 unnecessary try-except blocks that were hiding errors - Standardize judge model initialization to follow VLMEvalKit conventions - Move all prompt templates to utils/prompt_inference.py for better organization - Remove redundant count statistics (fine_grained_count, coarse_grained_count, total_count) - Remove unused fallback functions (_simple_answer_matching, _extract_prediction_for_display) - Fix multi-image base64 processing bug - Correct dataset name display in summary output - Remove verbose debugging output and unnecessary comments Code reduction: 899 → 604 lines (32.8% reduction) Eliminated potential bugs and improved maintainability while preserving all core functionality * Improve HiPhO dataset: translate comments to English and enhance configuration - Translate all Chinese comments to English in hipho.py, hipho_verifier.py, and prompt_inference.py - Simplify comments while maintaining technical accuracy - Replace hardcoded verifier model configuration with environment variables - Use VLMEvalKit standard environment variable approach for better flexibility - Add support for HIPHO_VERIFIER_* environment variables for model configuration - Improve code maintainability and international accessibility * Add new dependencies for HiPhO dataset functionality - Add datasets: for HuggingFace dataset loading - Add scikit-learn: for machine learning utilities - Add pylatexenc==2.10: for LaTeX text processing - Add math-verify: for mathematical answer verification These dependencies are required for the HiPhO physics olympiad dataset evaluation and verification functionality. * Refactor HiPhO dataset: clean up debug info, add language auto-detection, and improve code structure - Remove all debug print statements for cleaner output - Add automatic language detection for PanMechanics datasets (Chinese) - Translate all Chinese comments to English - Extract judge model configuration to top-level constants - Fix empty if/else blocks from debug cleanup - Improve code readability and maintainability - Update hipho_verifier to use judge_model directly instead of OpenAI client * Fix flake8 linting errors in hipho dataset files * Rename file from prompt_inference.py to hipho_prompt_inference.py --------- Co-authored-by: Haodong Duan <dhd@pku.edu.cn>

* fix add support SArena_MINI * fix add support SArena_MINI * fix add support SArena_MINI

* Support MMSI-Video-Bench * Support MMSI-Video-Bench * Support MMSI-Video-Bench * Support MMSI-Video-Bench

* [Fix] 修复 SArena 评测，并行化 UniSVG 评测 * Fix requirements.

[ci] change task timeout from 30mins to 120mins

…model (#1377)

* [Fix] 修复 Physics 评测超时 * [Fix] Fix MM-IFEval evaluation.

[ci] use default result exact_matching

HJYao00 and others added 22 commits December 8, 2025 22:45

Add UniSVG dataset support (#1349)

5f7aa1a

Co-authored-by: Ma Zerun <mzr1996@163.com>

Add support for SArena_MINI (#1353)

18ce87c

Co-authored-by: Ma Zerun <mzr1996@163.com>

[FIX BUG] Fix bug for SArena-MINI support (#1360)

2be212c

* fix add support SArena_MINI * fix add support SArena_MINI * fix add support SArena_MINI

[Fix] Fix MMVP metric (#1369)

b77b16f

[Feat] Add telemm2.0 (#1365)

db11569

Support MMSI-Video-Bench (#1368)

240f1d7

* Support MMSI-Video-Bench * Support MMSI-Video-Bench * Support MMSI-Video-Bench * Support MMSI-Video-Bench

[Fix] Fix SArena evaluation and parallelize UniSVG evaluation. (#1374)

63fa14e

* [Fix] 修复 SArena 评测，并行化 UniSVG 评测 * Fix requirements.

update

837ae0a

Merge pull request #1376 from zhulinJulia24/fix_timeout

3fb8347

[ci] change task timeout from 30mins to 120mins

[Fix] Remove judge model restriction and add proxy support for GPT4V …

418054f

…model (#1377)

[Fix] Fix evaluation of Physics and MM-IFEval (#1378)

a039b67

* [Fix] 修复 Physics 评测超时 * [Fix] Fix MM-IFEval evaluation.

[Fix] 避免根据 OPENAI_API_KEY 判断是否使用 judge model (#1379)

f438f87

Update pr-run-test.yml

26aee58

Update pr-run-test.yml

03c76c0

Update pr-run-test.yml

a42d914

Update pr-run-test.yml

3d2438f

Change pip install to use --user flag

70af89b

Update pr-run-test.yml

4b9ee53

Merge pull request #1387 from zhulinJulia24/add_param

242f993

[ci] use default result exact_matching

[Feat] Add tele2thinking (#1375)

8573857

rtong-3 approved these changes Dec 31, 2025

View reviewed changes

rohan-rao7 merged commit 4864dfe into Specter-Co:main Dec 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge changes from source#3

merge changes from source#3
rohan-rao7 merged 22 commits intoSpecter-Co:mainfrom
open-compass:main

rohan-rao7 commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

rohan-rao7 commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants