Merge branch 'main' into main

yanxinl4 · web-flow · commit 7945c0872b16 · 2024-11-05T17:21:27.000+08:00
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -15,7 +15,7 @@ repos:
       # - id: trailing-whitespace
 
   - repo: https://github.com/astral-sh/ruff-pre-commit
-    rev: v0.5.5
+    rev: v0.6.4
     hooks:
       # Run the linter.
       - id: ruff
diff --git a/README.md b/README.md
@@ -6,11 +6,19 @@
 This repo contains the evaluation code for the paper "[SciCode: A Research Coding Benchmark Curated by Scientists](https://arxiv.org/abs/2407.13168)"
 
 ## 🔔News
-- **[2024-08-22]: The SciCode benchmark has been successfully integrated into [OpenCompass](https://github.com/open-compass/opencompass).**
-- **[2024-07-24]: We add the scientist-annotated background and support setup for w/ background evaluation.**
+
+**[2024-11-04]: Leaderboard is on! Check [here](https://scicode-bench.github.io/leaderboard/). We have also added Claude Sonnet 3.5 (new) results.**
+
+**[2024-10-01]: We have added OpenAI o1-mini and o1-preview results.**
+
+**[2024-08-22]: The SciCode benchmark has been successfully integrated into [OpenCompass](https://github.com/open-compass/opencompass).**
+
+**[2024-09-26]: SciCode is accepted at NeurIPS D&B Track 2024.**
+
+**[2024-07-24]: We add the scientist-annotated background and support setup for w/ background evaluation.**
 
 ## Introduction
-SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of **16** subdomains from **6** domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains **338** subproblems decomposed from **80** challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only **4.6%** of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.
+SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of **16** subdomains from **6** domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains **338** subproblems decomposed from **80** challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. OpenAI o1-preview, the best-performing model among those tested, can solve only **7.7%** of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.
 
 
 
@@ -19,19 +27,24 @@ SciCode sources challenging and realistic research-level coding problems across
 
 ## 🏆 Leaderboard
 
-| Model                     | Subproblem | Main Problem |
-|---------------------------|------------|--------------|
-| Claude3.5-Sonnet          | **26**         | **4.6**          |
-| GPT-4o                    | 25         | 1.5          |
-| GPT-4-Turbo               | 22.9       | 1.5          |
-| Gemini 1.5 Pro            | 21.9       | 1.5          |
-| Claude3-Opus              | 21.5       | 1.5          |
-| Deepseek-Coder-v2         | 21.2       | 3.1          |
-| Claude3-Sonnet            | 17         | 1.5          |
-| Qwen2-72B-Instruct        | 17         | 1.5          |
-| Llama-3.1-70B-Instruct    | 16.3       | 1.5          |
-| Mixtral-8x22B-Instruct    | 16.3       | 0            |
-| Llama-3-70B-Chat          | 14.6       | 0            |
+| Models                   | Main Problem Resolve Rate           | <span style="color:grey">Subproblem</span>            |
+|--------------------------|-------------------------------------|-------------------------------------|
+| 🥇 OpenAI o1-preview      | <div align="center">**7.7**</div>       | <div align="center" style="color:grey">28.5</div>     |
+| 🥈 Claude3.5-Sonnet       | <div align="center">**4.6**</div>       | <div align="center" style="color:grey">26.0</div>     |
+| 🥉 Claude3.5-Sonnet (new) | <div align="center">**4.6**</div>       | <div align="center" style="color:grey">25.3</div>     |
+| Deepseek-Coder-v2        | <div align="center">**3.1**</div>       | <div align="center" style="color:grey">21.2</div>     |
+| GPT-4o                   | <div align="center">**1.5**</div>       | <div align="center" style="color:grey">25.0</div>     |
+| GPT-4-Turbo              | <div align="center">**1.5**</div>       | <div align="center" style="color:grey">22.9</div>     |
+| OpenAI o1-mini           | <div align="center">**1.5**</div>       | <div align="center" style="color:grey">22.2</div>     |
+| Gemini 1.5 Pro           | <div align="center">**1.5**</div>       | <div align="center" style="color:grey">21.9</div>     |
+| Claude3-Opus             | <div align="center">**1.5**</div>       | <div align="center" style="color:grey">21.5</div>     |
+| Llama-3.1-405B-Chat      | <div align="center">**1.5**</div>       | <div align="center" style="color:grey">19.8</div>     |
+| Claude3-Sonnet           | <div align="center">**1.5**</div>       | <div align="center" style="color:grey">17.0</div>     |
+| Qwen2-72B-Instruct       | <div align="center">**1.5**</div>       | <div align="center" style="color:grey">17.0</div>     |
+| Llama-3.1-70B-Chat       | <div align="center">**0.0**</div>       | <div align="center" style="color:grey">17.0</div>     |
+| Mixtral-8x22B-Instruct   | <div align="center">**0.0**</div>       | <div align="center" style="color:grey">16.3</div>     |
+| Llama-3-70B-Chat         | <div align="center">**0.0**</div>       | <div align="center" style="color:grey">14.6</div>     |
+
 
 ## Instructions to evaluate a new model
 
@@ -41,6 +54,11 @@ SciCode sources challenging and realistic research-level coding problems across
 4. Run `eval/scripts/gencode_json.py` to generate new model outputs (see the [`eval/scripts` readme](eval/scripts/)) for more information
 5. Run `eval/scripts/test_generated_code.py` to evaluate the unittests
 
+## More information and FAQ
+
+More information, including a [FAQ section](https://scicode-bench.github.io/faq/), is provided on our [website](https://scicode-bench.github.io/).
+If you have trouble reaching the website, please find the markdown source in its [github repository](https://github.com/scicode-bench/scicode-bench.github.io/tree/main/docs).
+
 ## Contact
 - Minyang Tian: mtian8@illinois.edu
 - Eliu Huerta: elihu@anl.gov
diff --git a/eval/scripts/README.md b/eval/scripts/README.md
@@ -1,42 +1,72 @@
- ## **Generate LLM code**
-  
-Your first need to set up your API keys. For this, create a `keys.cfg` file at the root of the repository
-and add keys as follows:
+## **Generating Code with LLMs**
+
+### 1. Set Up Your API Keys
+
+First, create a `keys.cfg` file at the root of the repository and add your API keys for the different providers as follows:
 
 ```
 OPENAI_KEY = 'your_api_key'
 ANTHROPIC_KEY = 'your_api_key'
-GOOGLE_KEY = 'your_api_key' 
+GOOGLE_KEY = 'your_api_key' 
+```
+
+If you're using **litellm**, which supports a variety of providers including **vllm**, **Hugging Face**, and **Together AI**, make sure to include the relevant API key in the `keys.cfg` file. Please refer to the docs [here](https://docs.litellm.ai/docs/providers). Then, use `litellm/*` as the model name when running the command.
+
+For example, to use **Together AI**'s models, you'll need to add the following to your `keys.cfg`:
+
+```
+TOGETHERAI_API_KEY = 'your_api_key'
 ```
 
-For example, to create model results with `gpt-4o` and the default settings, go to the root of this repo and run 
+### 2. Generating Code
+
+To generate code using the **Together AI** model (e.g., `Meta-Llama-3.1-70B-Instruct-Turbo`), go to the root of this repo and run:
+
+```bash
+python eval/scripts/gencode_json.py --model litellm/together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
+```
+
+To generate code using **GPT-4o** (with default settings), go to the root of this repo and run:
 
 ```bash
 python eval/scripts/gencode_json.py --model gpt-4o
 ```
 
-For results with scientist-annotated background, run
+If you want to include **scientist-annotated background** in the prompts, use the `--with-background` flag:
 
 ```bash
 python eval/scripts/gencode_json.py --model gpt-4o --with-background
 ```
 
+Please note that we do not plan to release the ground truth code for each problem to the public. However, we have made a dev set available that includes the ground truth code in `eval/data/problems_dev.jsonl`. 
+
+In this repository, **we only support evaluating with previously generated code for each step.**
 
 ### Command-Line Arguments
 
-- `--model` - Specifies the model name used for generating responses.
-- `--output-dir` - Directory to store the generated code outputs (Default: `eval_results/generated_code`).
-- `--input-path` - Directory containing the JSON files describing the problems (Default: `eval/data/problems_all.jsonl`).
-- `--prompt-dir` - Directory where prompt files are saved (Default: `eval_results/prompt`).
-- `--with-background` - Include problem background if enabled.
-- `--temperature` - Controls the randomness of the generation (Default: 0).
-  
-## **Evaluate generated code**
+When running the `gencode_json.py` script, you can use the following options:
+
+- `--model`: Specifies the model name to be used for generating code (e.g., `gpt-4o` or `litellm/together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo`).
+- `--output-dir`: Directory where the generated code outputs will be saved. Default is `eval_results/generated_code`.
+- `--input-path`: Directory containing the JSON files describing the problems. Default is `eval/data/problems_all.jsonl`.
+- `--prompt-dir`: Directory where prompt files are saved. Default is `eval_results/prompt`.
+- `--with-background`: If enabled, includes the problem background in the generated code.
+- `--temperature`: Controls the randomness of the output. Default is 0.
+
+---
 
-Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZW6_bdiDAiipuFMqdUhvUaHIj6-pR?usp=drive_link) and save them as `./eval/data/test_data.h5`
+## **Evaluating the Generated Code**
 
-To run the script, go to the root of this repo and use the following command:
+### 1. Download Numeric Test Data
+
+Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZW6_bdiDAiipuFMqdUhvUaHIj6-pR?usp=drive_link) and save it as `eval/data/test_data.h5`.
+
+### 2. Run the Evaluation
+
+To evaluate the generated code using a specific model, go to the root of this repo and use the following command:
 
 ```bash
 python eval/scripts/test_generated_code.py --model "model_name"
 ```
+
+Replace `"model_name"` with the appropriate model name, and include `--with-background` if the code is generated with **scientist-annotated background**.
diff --git a/eval/scripts/gencode_json.py b/eval/scripts/gencode_json.py
@@ -8,10 +8,10 @@
 )
 from scicode.gen.models import extract_python_script, get_model_function
 
-
 DEFAULT_PROMPT_TEMPLATE = Path("eval", "data", "background_comment_template.txt").read_text()
 BACKGOUND_PROMPT_TEMPLATE = Path("eval", "data", "multistep_template.txt").read_text()
 
+
 class Gencode:
     def __init__(self, model: str, output_dir: Path,
                  prompt_dir: Path, with_background: bool, temperature: float):
@@ -57,6 +57,10 @@ def generate_response_with_steps(
             save (bool, optional): Save propmt and model response. Defaults to True.
         """
         prob_id = prob_data["problem_id"]
+        output_file_path = (
+                self.output_dir / Path(self.model).parts[-1] / self._get_background_dir()
+                / f"{prob_id}.{num_steps}.py"
+        )
         if num_steps == 1:
             self.previous_llm_code = [None] * tot_steps
         else:
@@ -69,8 +73,7 @@ def generate_response_with_steps(
                         prev_file_path = Path("eval", "data", f"{prob_id}.{prev_step+1}.txt")
                     else:
                         prev_file_path = (
-                                self.output_dir
-                                / model
+                                self.output_dir / Path(self.model).parts[-1] / self._get_background_dir()
                                 / f"{prob_id}.{prev_step + 1}.py"
                         )
                     if prev_file_path.is_file():
@@ -80,6 +83,9 @@ def generate_response_with_steps(
                         self.previous_llm_code[prev_step] = function_code
                     else:
                         raise Exception(f'Generating {prob_id} step {num_steps} ahead of step {prev_step + 1}.')
+
+        if output_file_path.exists():
+            return
         prompt, previous_code = self.generate_prompt_with_steps(prob_data, num_steps, prompt_template)
         if save:
             self.save_prompt_with_steps(prob_data, prompt, num_steps)
@@ -89,16 +95,10 @@ def generate_response_with_steps(
             model_kwargs["max_tokens"] = 4096
         model_kwargs["temperature"] = self.temperature
         # write the response to a file if it doesn't exist
-        output_file_path = (
-                self.output_dir
-                / model
-                / f"{prob_id}.{num_steps}.py"
-        )
-        if not output_file_path.exists():
-            model_fct = get_model_function(model, **model_kwargs)
-            response_from_llm = model_fct(prompt)
-            self.previous_llm_code[num_steps - 1] = extract_python_script(response_from_llm)
-            self.save_response_with_steps(prob_data, response_from_llm, previous_code, num_steps)
+        model_fct = get_model_function(model, **model_kwargs)
+        response_from_llm = model_fct(prompt)
+        self.previous_llm_code[num_steps - 1] = extract_python_script(response_from_llm)
+        self.save_response_with_steps(prob_data, response_from_llm, previous_code, num_steps)
 
     @staticmethod
     def process_problem_code(prob_data: dict, num_steps: int) -> str: