Merged
Conversation
* Update run.py * Update __init__.py * Update image_vqa.py * Update image_vqa.py * Add files via upload * Update image_vqa.py * Update image_vqa.py * Update mmreason.py * Update image_vqa.py * Update internvl_chat.py --------- Co-authored-by: Xinyu Fang <fangxinyutju202009@126.com>
Co-authored-by: Ma Zerun <mzr1996@163.com>
Co-authored-by: Ma Zerun <mzr1996@163.com>
* Initial commit fot HiPhO * update * update * refactor: simplify HiPhO dataset logging system - Remove custom LogBuffer class and thread-safe logging - Replace safe_print with standard print statements - Remove threading and datetime imports - Simplify build_prompt function by removing verbose debug output - Update dataset URL from haiyuanwan/HiPhO to HY-Wan/HiPhO - Reduce code from 899 to 803 lines (10.7% reduction) - Maintain all core functionality: evaluation logic, prompt building, hipho_verifier integration * refactor: remove parallel evaluation framework from HiPhO dataset - Remove complex parallel evaluation using track_progress_rich - Simplify to sequential evaluation for better stability and debugging - Remove multiprocessing and parallel task management dependencies - Rename functions to remove '_with_buffer' suffix and log_buffer parameters - Remove nproc parameter handling and temporary file management - Reduce code from 803 to 774 lines (additional 3.6% reduction) - Maintain all core evaluation logic: fine/coarse-grained scoring, hipho_verifier integration - Sequential evaluation is sufficient for physics olympiad problem counts * refactor: major simplification of HiPhO dataset implementation Major improvements: - Remove 6 unnecessary try-except blocks that were hiding errors - Standardize judge model initialization to follow VLMEvalKit conventions - Move all prompt templates to utils/prompt_inference.py for better organization - Remove redundant count statistics (fine_grained_count, coarse_grained_count, total_count) - Remove unused fallback functions (_simple_answer_matching, _extract_prediction_for_display) - Fix multi-image base64 processing bug - Correct dataset name display in summary output - Remove verbose debugging output and unnecessary comments Code reduction: 899 → 604 lines (32.8% reduction) Eliminated potential bugs and improved maintainability while preserving all core functionality * Improve HiPhO dataset: translate comments to English and enhance configuration - Translate all Chinese comments to English in hipho.py, hipho_verifier.py, and prompt_inference.py - Simplify comments while maintaining technical accuracy - Replace hardcoded verifier model configuration with environment variables - Use VLMEvalKit standard environment variable approach for better flexibility - Add support for HIPHO_VERIFIER_* environment variables for model configuration - Improve code maintainability and international accessibility * Add new dependencies for HiPhO dataset functionality - Add datasets: for HuggingFace dataset loading - Add scikit-learn: for machine learning utilities - Add pylatexenc==2.10: for LaTeX text processing - Add math-verify: for mathematical answer verification These dependencies are required for the HiPhO physics olympiad dataset evaluation and verification functionality. * Refactor HiPhO dataset: clean up debug info, add language auto-detection, and improve code structure - Remove all debug print statements for cleaner output - Add automatic language detection for PanMechanics datasets (Chinese) - Translate all Chinese comments to English - Extract judge model configuration to top-level constants - Fix empty if/else blocks from debug cleanup - Improve code readability and maintainability - Update hipho_verifier to use judge_model directly instead of OpenAI client * Fix flake8 linting errors in hipho dataset files * Rename file from prompt_inference.py to hipho_prompt_inference.py --------- Co-authored-by: Haodong Duan <dhd@pku.edu.cn>
* fix add support SArena_MINI * fix add support SArena_MINI * fix add support SArena_MINI
* Support MMSI-Video-Bench * Support MMSI-Video-Bench * Support MMSI-Video-Bench * Support MMSI-Video-Bench
* [Fix] 修复 SArena 评测,并行化 UniSVG 评测 * Fix requirements.
[ci] change task timeout from 30mins to 120mins
* [Fix] 修复 Physics 评测超时 * [Fix] Fix MM-IFEval evaluation.
[ci] use default result exact_matching
rtong-3
approved these changes
Dec 31, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.