A comprehensiveness-aware benchmark for repository-level CR evaluation.
Code review (CR) refers to the process of having other developers on the team check the code written by a particular developer. It aims to improve the code quality and find code defects and plays an important role in software quality maintenance. Some research had proposed some CR benchmarks and automatic CR approaches. However, existing CR benchmarks and approaches, lack of comprehensiveness, which is not close to the real scenario. The rapid growth of Large Language Model (LLM) capabilities has made comprehensive CR a possibility. To evaluate the LLMs' performance in comprehensive CR, we construct a comprehensiveness-aware CR dataset in Python, namely SWE-CARE. The dataset is categorized into nine types and each instance's information covers the full process of code review. In addition, the repository-level feature is also included in each instance. Based on the dataset, we design a framework to evaluate LLMβs performance on CR.
Follow these steps to set up the project locally.
-
Clone the repository:
git clone https://github.com/your-username/SWE-CARE.git cd SWE-CARE -
Install dependencies: This project uses
uvfor package management. Make sure you have Python 3.10 or higher.pip install uv uv sync
Alternatively, you can use
pip:pip install -e . -
Set up pre-commit hooks (for development): This project uses
rufffor linting and formatting. The pre-commit hooks will run these checks automatically before each commit.pre-commit install
For a streamlined evaluation workflow, use the bootstrap script in scripts/run_eval_pipeline.py:
# Set up environment variables
export OPENAI_API_KEY="your-openai-api-key"
export LLM_EVALUATOR_OPENAI_API_KEY="your-o3-evaluation-api-key"
# Run the complete pipeline (uses default Hugging Face dataset)
python scripts/run_eval_pipeline.py \
--output-dir results/pipeline_output \
--model gpt-4o \
--model-provider openai \
--file-source oracle
# Run with local dataset file
python scripts/run_eval_pipeline.py \
--dataset-name-or-path results/dataset/code_review_task_instances.jsonl \
--output-dir results/pipeline_output \
--model gpt-4o \
--model-provider openai \
--file-source oracle
# Use skeleton stubs for Python files (optional)
python scripts/run_eval_pipeline.py \
--dataset-name-or-path results/dataset/code_review_task_instances.jsonl \
--output-dir results/pipeline_output \
--model gpt-4o \
--model-provider openai \
--file-source bm25 \
--k 10 \
--retrieval-output-dir results/retrieval_output \
--use-skeletonThis script automates the entire evaluation process: text generation β inference β evaluation. See scripts/README.md for detailed usage.
After running evaluations, you can generate comprehensive analysis reports:
# Generate evaluation report from pipeline results
python scripts/eval_report.py \
--dataset-name-or-path results/dataset/code_review_task_instances.jsonl \
--eval-output-dir results/pipeline_output/evaluation \
--report-output-file results/evaluation_report.json
# Or use default Hugging Face dataset
python scripts/eval_report.py \
--eval-output-dir results/pipeline_output/evaluation \
--report-output-file results/evaluation_report.jsonThis generates detailed statistics including:
- Model performance across different file source settings (none, oracle, bm25 with k)
- Performance breakdown by evaluator type (RuleBasedEvaluator, LLMEvaluator)
- Performance analysis by metadata categories (problem domain, difficulty, estimated review effort)
- Ranking of all model-setting configurations by average score
- Identification of missing instances (assigned score of 0 for fair comparison)
The output is a comprehensive JSON report that can be used for further analysis and visualization.
The data collection process involves several steps to gather and process data from GitHub. The main scripts for this process are located in src/swe_care/collect.
Here's an example of the command-line usage for each step:
-
Get Top Repositories: Find the most starred repositories for a given language.
python -m swe_care.collect get_top_repos \ --language "Python" \ --top-n 100 \ --output-dir "results/top_repos" \ --tokens "your_github_pat" -
Get Pull Request Data: Fetch PR data from a specific repository using the GitHub GraphQL API.
python -m swe_care.collect get_graphql_prs_data \ --repo "<repo_owner>/<repo_name>" \ --output-dir "results/graphql_prs_data" \ --tokens "your_github_pat" \ --max-number 20 -
Classify PRs Data: Analyze and classify PR data by evaluating commits and labeling review comments.
Single file processing:
python -m swe_care.collect classify_prs_data \ --graphql-prs-data-file "results/graphql_prs_data/<repo_owner>__<repo_name>_graphql_prs_data.jsonl" \ --output-dir "./results/classify_prs_data" \ --tokens "your_github_pat"Batch processing (multiple repositories):
python -m swe_care.collect classify_prs_data \ --graphql-prs-data-file "results/graphql_prs_data/" \ --output-dir "./results/classify_prs_data" \ --tokens "your_github_pat" \ --jobs 4This step combines two important analyses:
- Commit Evaluation: Uses heuristic rules to score commits based on quality indicators (message clarity, size, review activity, etc.)
- Review Comment Classification: Extracts and labels review comments based on whether referenced lines were actually changed in the merged commit, or the review thread is resolved, outdated, or collapsed.
-
Build Code Review Dataset: Build the final dataset for the code review task. This step requires an LLM to classify metadata such as problem domain, difficulty, and review effort for each task instance.
Single file processing:
# Example with OpenAI GPT-4o export OPENAI_API_KEY=<your_openai_api_key> python -m swe_care.collect build_code_review_dataset \ --graphql-prs-data-file "results/graphql_prs_data/<repo_owner>__<repo_name>_graphql_prs_data.jsonl" \ --pr-classification-file "results/classify_prs_data/<repo_owner>__<repo_name>_pr_classification.jsonl" \ --model "gpt-4o" \ --model-provider "openai" \ --model-args "temperature=0.7,top_p=0.9" \ --output-dir "./results/dataset" \ --tokens "your_github_pat" # Example with Anthropic Claude export ANTHROPIC_API_KEY=<your_anthropic_api_key> python -m swe_care.collect build_code_review_dataset \ --graphql-prs-data-file "results/graphql_prs_data/<repo_owner>__<repo_name>_graphql_prs_data.jsonl" \ --pr-classification-file "results/classify_prs_data/<repo_owner>__<repo_name>_pr_classification.jsonl" \ --model "claude-3-5-sonnet-20241022" \ --model-provider "anthropic" \ --model-args "temperature=0.5,max_tokens=4096" \ --output-dir "./results/dataset" \ --tokens "your_github_pat"
Batch processing (multiple repositories):
export OPENAI_API_KEY=<your_openai_api_key> python -m swe_care.collect build_code_review_dataset \ --graphql-prs-data-file "results/graphql_prs_data/" \ --pr-classification-file "results/classify_prs_data/" \ --model "gpt-4o" \ --model-provider "openai" \ --model-args "temperature=0.7" \ --output-dir "./results/dataset" \ --tokens "your_github_pat" \ --jobs 4
Note: When using directory inputs, the tool will automatically:
- Recursively find all
*_graphql_prs_data.jsonlfiles in the specified directory - Match them with corresponding
*_pr_classification.jsonlfiles - Process multiple file pairs concurrently using the specified number of jobs
- Recursively find all
You can find more details about the arguments for each script by running python -m swe_care.collect <subcommand> -h.
This is an additional processing step that converts PR classification data to reward model training samples, separate from the main data collection pipeline.
Single file processing:
python -m swe_care.collect convert_to_rm_samples \
--graphql-prs-data-file "results/graphql_prs_data/<repo_owner>__<repo_name>_graphql_prs_data.jsonl" \
--pr-classification-file "results/classify_prs_data/<repo_owner>__<repo_name>_pr_classification.jsonl" \
--output-dir "./results/rm_samples" \
--file-source "none"Batch processing (multiple repositories):
python -m swe_care.collect convert_to_rm_samples \
--graphql-prs-data-file "results/graphql_prs_data/" \
--pr-classification-file "results/classify_prs_data/" \
--output-dir "./results/rm_samples" \
--file-source "base_changed_files" \
--jobs 4Using retrieval-based file sources:
# Example with retrieved_all_files (requires --retrieval-output-dir)
python -m swe_care.collect convert_to_rm_samples \
--graphql-prs-data-file "results/graphql_prs_data/" \
--pr-classification-file "results/classify_prs_data/" \
--output-dir "./results/rm_samples" \
--file-source "retrieved_all_files" \
--retrieval-output-dir "./results/retrieval_output" \
--retrieval-max-files 10 \
--jobs 2This step converts classified PR data into training samples for reward models. Each sample contains:
- Problem Statement: Extracted from closing issues or PR description using the
extract_problem_statementutility - Patch to Review: The actual code changes (patch) from the commit
- Positive Reviews: Review comments where referenced lines were changed in the merged commit AND the review thread is resolved
- Negative Reviews: All other review comments
- Metadata: Repository info, PR number, commit SHA, PR URL, and file source for traceability
The --file-source parameter controls how file content is included in the review samples:
none(default): Uses the default sample format without including changed files contentbase_changed_files: Includes the content of changed files from the patch between base commit and commit to review in the review comment samplereviewed_file: Includes changed file content to the sample the review comment applied toretrieved_base_changed_files: Uses BM25 to retrieve relevant files from changed files based on the diff_hunk contentretrieved_all_files: Uses BM25 to retrieve relevant files from the entire repository based on the diff_hunk content
When --file-source is set to any option other than none, review comments will include a <code> section containing the relevant file content, providing more context for training. The retrieval-based options (retrieved_base_changed_files and retrieved_all_files) use BM25 similarity to select the most relevant files based on the review comment's diff_hunk.
Note: When using --file-source retrieved_all_files, you must also specify --retrieval-output-dir to set the directory where retrieval operations will be performed and temporary files will be stored.
Important: The retrieved_all_files file source strategy uses Pyserini for BM25 retrieval, which requires Java 21. Make sure Java 21 is installed on your system before using this option. See Pyserini installation guide for details.
The output files follow the naming pattern <repo_owner>__<repo_name>_rm_samples.jsonl and contain RewardModelTrainingSample objects with comprehensive metadata for each training instance.
The inference module provides two main functionalities: generating text datasets and running LLM inference on code review tasks.
Before running evaluation, you can generate text datasets from the collected SWE-CARE data with different context strategies. This creates datasets in the format required for LLM evaluation.
# Example with no file context (using default Hugging Face dataset)
python -m swe_care.inference create_code_review_text \
--output-dir "results/code_review_text" \
--file-source "none"
# Example with local dataset file
python -m swe_care.inference create_code_review_text \
--dataset-name-or-path "results/dataset/code_review_task_instances.jsonl" \
--output-dir "results/code_review_text" \
--file-source "none"
# Example with oracle file source
python -m swe_care.inference create_code_review_text \
--dataset-name-or-path "results/dataset/code_review_task_instances.jsonl" \
--output-dir "results/code_review_text" \
--file-source "oracle" \
--tokens "your_github_pat"
# Example with BM25 retrieval
python -m swe_care.inference create_code_review_text \
--dataset-name-or-path "results/dataset/code_review_task_instances.jsonl" \
--output-dir "results/code_review_text" \
--file-source "bm25" \
--k 10 \
--retrieval-output-dir "results/retrieval_output" \
--tokens "your_github_pat" \
--jobs 4
# Example with all files
python -m swe_care.inference create_code_review_text \
--dataset-name-or-path "results/dataset/code_review_task_instances.jsonl" \
--output-dir "results/code_review_text" \
--file-source "all" \
--k 20 \
--retrieval-output-dir "results/retrieval_output" \
--tokens "your_github_pat" \
--jobs 4The --file-source parameter supports different strategies for selecting context files:
- none: No file context, only uses problem statement and patch
- oracle: Uses ground truth files (files that were actually changed in both the review commit and merged commit)
- bm25: Uses BM25 retrieval to select relevant files based on the problem statement (requires
--kand--retrieval-output-dir) - all: Uses all available files from the repository up to a specified limit (requires
--kand--retrieval-output-dir)
--k: Maximum number of files to include (required for bm25 and all strategies)--retrieval-output-dir: Directory for retrieval operations and git repositories (required for bm25 and all strategies)--tokens: GitHub Personal Access Token(s) for API access--jobs: Number of parallel jobs for processing (default: 2). Uses ProcessPoolExecutor for bm25/all strategies for better parallelism--skip-existing: Skip existing instances in the output file based on instance_id (default: False). When specified, the tool will append to the existing output file instead of overwriting it
After generating text datasets, you can run inference using various LLM APIs to generate code review predictions.
# Example with OpenAI GPT-4o
export OPENAI_API_KEY=<your_openai_api_key>
python -m swe_care.inference run_api \
--dataset-file "results/code_review_text/dataset__oracle.jsonl" \
--model "gpt-4o" \
--model-provider "openai" \
--model-args "temperature=0.7,top_p=0.9" \
--output-dir "results/predictions" \
--jobs 4 \
--skip-existing
# Example with Anthropic Claude
export ANTHROPIC_API_KEY=<your_anthropic_api_key>
python -m swe_care.inference run_api \
--dataset-file "results/code_review_text/dataset__oracle.jsonl" \
--model "claude-3-5-sonnet-20241022" \
--model-provider "anthropic" \
--model-args "temperature=0.5,max_tokens=4096" \
--output-dir "results/predictions" \
--jobs 2
# Example with DeepSeek
export OPENAI_API_KEY=<your_deepseek_api_key>
python -m swe_care.inference run_api \
--dataset-file "results/code_review_text/dataset__oracle.jsonl" \
--model "deepseek-chat" \
--model-provider "deepseek" \
--output-dir "results/predictions" \
--jobs 1See python -m swe_care.inference run_api --help for the supported model providers and models.
If you are using an API provider other than the provided ones, you can run inference with export OPENAI_BASE_URL=<your_openai_base_url> or export ANTHROPIC_BASE_URL=<your_anthropic_base_url> to specify the base URL for the API.
--dataset-file: Path to the text dataset file (CodeReviewInferenceInstance objects)--model: Model name to use for inference--model-provider: Model provider (openai, anthropic, deepseek, qwen, moonshot, gemini)--model-args: Comma-separated model arguments (e.g.,temperature=0.7,top_p=0.9)--output-dir: Directory to save generated predictions--jobs: Number of parallel threads for inference (default: 2)--skip-existing: Skip instances that already have predictions (flag, default: False)
The generated predictions will be saved as JSONL files containing CodeReviewPrediction objects, which can then be used for evaluation.
You can find more details about the arguments for each script by running python -m swe_care.inference <subcommand> -h.
The evaluation harness is used to assess model predictions on the code review task. The main script is src/swe_care/harness/code_review_eval.py.
- LLM Evaluator (
llm_evaluator): Evaluates code review quality based on multiple dimensions (functionality, quality, style, documentation). - Rule-based Evaluator (
rule_based_evaluator): Extracts defects from review text and compares them with reference defects.
Note: The reward model evaluator mentioned in the paper is still work-in-progress and the reward model is about to be open-sourced.
See python -m swe_care.harness code_review_eval --help for supported evaluators and LLM model if you want to use LLM-based evaluation.
export OPENAI_API_KEY=<your_openai_api_key>
python -m swe_care.harness code_review_eval \
--dataset-name-or-path "results/code_review_task_instances.jsonl" \
--predictions-path "results/predictions/dataset__gpt-4o.jsonl" \
--output-dir "./results/evaluation" \
--evaluator "llm_evaluator" \
--model "gpt-4o" \
--model-provider "openai" \
--model-args "temperature=0.0" \
--jobs 4
# Or use default Hugging Face dataset
python -m swe_care.harness code_review_eval \
--predictions-path "results/predictions/dataset__gpt-4o.jsonl" \
--output-dir "./results/evaluation" \
--evaluator "llm_evaluator" \
--model "gpt-4o" \
--model-provider "openai" \
--model-args "temperature=0.0" \
--jobs 4python -m swe_care.harness code_review_eval \
--dataset-name-or-path "results/code_review_task_instances.jsonl" \
--predictions-path "results/predictions/dataset__gpt-4o.jsonl" \
--output-dir "./results/evaluation" \
--evaluator "rule_based_evaluator" \
--jobs 4export OPENAI_API_KEY=<your_openai_api_key>
python -m swe_care.harness code_review_eval \
--dataset-name-or-path "results/code_review_task_instances.jsonl" \
--predictions-path "results/predictions/dataset__gpt-4o.jsonl" \
--output-dir "./results/evaluation" \
--evaluator "llm_evaluator" "rule_based_evaluator" \
--model "gpt-4o" \
--model-provider "openai" \
--jobs 4--dataset-name-or-path: Path to the original dataset file or Hugging Face dataset name (default: inclusionAI/SWE-CARE)--predictions-path: Path to the predictions file (CodeReviewPrediction objects)--output-dir: Directory where evaluation results will be saved--evaluator: One or more evaluator types to use (llm_evaluator,rule_based_evaluator)--model: Model name for LLM evaluation (required if using LLM-based evaluators)--model-provider: Model provider for LLM evaluation (required if using LLM-based evaluators)--model-args: Comma-separated model arguments for LLM evaluation--evaluator-args: Evaluator-specific arguments in formatevaluator1:arg1=value1,arg2=value2;evaluator2:arg1=value1--jobs: Number of parallel threads for evaluation (default: 2)
The evaluation results are saved as a JSONL file (final_report.jsonl) containing CodeReviewEvaluationResult objects with detailed metrics for each instance.
You can find more details about the arguments for each script by running python -m swe_care.harness <subcommand> -h.
If you find our work helpful, please use the following citation:
@misc{guo2025codefusecrbenchcomprehensivenessawarebenchmarkendtoend,
title={CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects},
author={Hanyang Guo and Xunjin Zheng and Zihan Liao and Hang Yu and Peng DI and Ziyin Zhang and Hong-Ning Dai},
year={2025},
eprint={2509.14856},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2509.14856},
}