Skip to content

merge changes from source#3

Merged
rohan-rao7 merged 22 commits intoSpecter-Co:mainfrom
open-compass:main
Dec 31, 2025
Merged

merge changes from source#3
rohan-rao7 merged 22 commits intoSpecter-Co:mainfrom
open-compass:main

Conversation

@rohan-rao7
Copy link

No description provided.

HJYao00 and others added 22 commits December 8, 2025 22:45
* Update run.py

* Update __init__.py

* Update image_vqa.py

* Update image_vqa.py

* Add files via upload

* Update image_vqa.py

* Update image_vqa.py

* Update mmreason.py

* Update image_vqa.py

* Update internvl_chat.py

---------

Co-authored-by: Xinyu Fang <fangxinyutju202009@126.com>
Co-authored-by: Ma Zerun <mzr1996@163.com>
Co-authored-by: Ma Zerun <mzr1996@163.com>
* Initial commit fot HiPhO

* update

* update

* refactor: simplify HiPhO dataset logging system

- Remove custom LogBuffer class and thread-safe logging
- Replace safe_print with standard print statements
- Remove threading and datetime imports
- Simplify build_prompt function by removing verbose debug output
- Update dataset URL from haiyuanwan/HiPhO to HY-Wan/HiPhO
- Reduce code from 899 to 803 lines (10.7% reduction)
- Maintain all core functionality: evaluation logic, prompt building, hipho_verifier integration

* refactor: remove parallel evaluation framework from HiPhO dataset

- Remove complex parallel evaluation using track_progress_rich
- Simplify to sequential evaluation for better stability and debugging
- Remove multiprocessing and parallel task management dependencies
- Rename functions to remove '_with_buffer' suffix and log_buffer parameters
- Remove nproc parameter handling and temporary file management
- Reduce code from 803 to 774 lines (additional 3.6% reduction)
- Maintain all core evaluation logic: fine/coarse-grained scoring, hipho_verifier integration
- Sequential evaluation is sufficient for physics olympiad problem counts

* refactor: major simplification of HiPhO dataset implementation

Major improvements:
- Remove 6 unnecessary try-except blocks that were hiding errors
- Standardize judge model initialization to follow VLMEvalKit conventions
- Move all prompt templates to utils/prompt_inference.py for better organization
- Remove redundant count statistics (fine_grained_count, coarse_grained_count, total_count)
- Remove unused fallback functions (_simple_answer_matching, _extract_prediction_for_display)
- Fix multi-image base64 processing bug
- Correct dataset name display in summary output
- Remove verbose debugging output and unnecessary comments

Code reduction: 899 → 604 lines (32.8% reduction)
Eliminated potential bugs and improved maintainability while preserving all core functionality

* Improve HiPhO dataset: translate comments to English and enhance configuration

- Translate all Chinese comments to English in hipho.py, hipho_verifier.py, and prompt_inference.py
- Simplify comments while maintaining technical accuracy
- Replace hardcoded verifier model configuration with environment variables
- Use VLMEvalKit standard environment variable approach for better flexibility
- Add support for HIPHO_VERIFIER_* environment variables for model configuration
- Improve code maintainability and international accessibility

* Add new dependencies for HiPhO dataset functionality

- Add datasets: for HuggingFace dataset loading
- Add scikit-learn: for machine learning utilities
- Add pylatexenc==2.10: for LaTeX text processing
- Add math-verify: for mathematical answer verification

These dependencies are required for the HiPhO physics olympiad dataset
evaluation and verification functionality.

* Refactor HiPhO dataset: clean up debug info, add language auto-detection, and improve code structure

- Remove all debug print statements for cleaner output
- Add automatic language detection for PanMechanics datasets (Chinese)
- Translate all Chinese comments to English
- Extract judge model configuration to top-level constants
- Fix empty if/else blocks from debug cleanup
- Improve code readability and maintainability
- Update hipho_verifier to use judge_model directly instead of OpenAI client

* Fix flake8 linting errors in hipho dataset files

* Rename file from prompt_inference.py to hipho_prompt_inference.py

---------

Co-authored-by: Haodong Duan <dhd@pku.edu.cn>
* fix add support SArena_MINI

* fix add support SArena_MINI

* fix add support SArena_MINI
* Support MMSI-Video-Bench

* Support MMSI-Video-Bench

* Support MMSI-Video-Bench

* Support MMSI-Video-Bench
* [Fix] 修复 SArena 评测,并行化 UniSVG 评测

* Fix requirements.
[ci] change task timeout from 30mins to 120mins
* [Fix] 修复 Physics 评测超时

* [Fix] Fix MM-IFEval evaluation.
[ci] use default result exact_matching
@rohan-rao7 rohan-rao7 merged commit 4864dfe into Specter-Co:main Dec 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants