Skip to content

evalplus/repoqa

Repository files navigation

RepoQA: Evaluating Long-Context Code Understanding

🚀 Installation🏁 Search Needle Function📚 Read More

🚀 Installation

# without vLLM (can run openai, anthropic, and huggingface backends)
pip install --upgrade repoqa
# with vLLM
pip install --upgrade "repoqa[vllm]"
⏬ Install nightly version :: click to expand ::
pip install --upgrade "git+https://github.com/evalplus/repoqa.git"                 # without vLLM
pip install --upgrade "repoqa[vllm] @ git+https://github.com/evalplus/repoqa@main" # with vLLM
⏬ Using RepoQA as a local repo? :: click to expand ::
git clone https://github.com/evalplus/repoqa.git
cd repoqa
export PYTHONPATH=$PYTHONPATH:$(pwd)
pip install -r requirements.txt

🏁 Search Needle Function

Inference with OpenAI Compatible Servers

repoqa.search_needle_function --model "gpt4-turbo" --caching --backend openai
# 💡 If you use customized server such vLLM:
# repoqa.search_needle_function --base-url "http://url.to.vllm.server/v1" \
#                               --model "gpt4-turbo" --caching --backend openai

Inference with Anthropic Compatible Servers

repoqa.search_needle_function --model "claude-3-haiku-20240307" --caching --backend anthropic

Inference with vLLM

repoqa.search_needle_function --model "Qwen/CodeQwen1.5-7B-Chat" \
                              --caching --backend vllm

Inference with HuggingFace transformers

repoqa.search_needle_function --model "gpt2" "Qwen/CodeQwen1.5-7B-Chat" \
                              --caching --backend hf --trust-remote-code

Usage

Tip

  • Input:
    • --model: Hugging-Face model ID, such as ise-uiuc/Magicoder-S-DS-6.7B
    • --backend: vllm (default) or openai
    • --base-url: OpenAI API base URL
    • --code-context-size (default: 16384): Number of tokens (using DeepSeekCoder tokenizer) of code in the long context
    • --caching (default: False): if enabled, the tokenization and chuncking results will be cached to accelerate subsequent runs
    • --max-new-tokens (default: 1024): Maximum number of new tokens to generate
    • --system-message (default: None): if given, the model use a system message (but note some models don't support system message)
    • --tensor-parallel-size: Number of tensor parallelism (only for vLLM)
    • --languages (default: None): List of languages to evaluate (None means all)
    • --result-dir (default: "results"): Directory to save the model outputs and evaluation results
  • Output:
    • results/ntoken_{code-context-size}/{model}.jsonl: Model generated outputs
    • results/ntoken_{code-context-size}/{model}-SCORE.json: Evaluation scores (also see Compute Scores)

Compute Scores

By default, the repoqa.search_needle_function command will also compute scores after producing model outputs. However, you can also compute scores separately using the following command:

repoqa.compute_score --model-output-path={model-output}.jsonl

Tip

  • Input: Path to the model generated outputs.
  • Output: The evaluation scores would be stored in {model-output}-SCORES.json

📚 Read More

About

RepoQA: Evaluating Long-Context Code Understanding

Resources

License

Stars

Watchers

Forks

Contributors 6