Skip to content

Very low pass@1 #13

Closed
Closed
@marianna13

Description

@marianna13

Issue

Hey everyone,

I was trying to eval some models on the BigCodeBench but I get very low pass@1 (which is way lower than what's been reported for this model) and this warning:

BigCodeBench-Complete-calibrated
Groundtruth pass rate: 0.000
Please be cautious!
pass@1: 0.033

For reproduction

I tried granite-3b-code-base in this setup but for other models that I tried (stablelm-1..6b, granite-8b-code-base it was the same).
For both apptainer images I used docker images mentioned in this repo, both latest versions.

My cmd for evaluation:

IMAGE="/p/scratch/ccstdl/marianna/bigcodebench-evaluate_latest.sif"
SUBSET="complete"
SAVE_PATH="/p/scratch/ccstdl/marianna/bigcodebench_results/ibm-granite/granite-3b-code-base_bigcodebench_complete_0.0_1_vllm-sanitized-calibrated.jsonl"

CMD="apptainer -v run --bind $CONTAINER_HOME:/app,/tmp $IMAGE \
    --subset $SUBSET \
    --max-data-limit 16000 \
    --samples $SAVE_PATH "

srun --cpus-per-task=$SLURM_CPUS_PER_TASK $CMD

My generation cmd:

IMAGE="/p/scratch/ccstdl/marianna/bigcodebench-generate_latest.sif"
MODEL="ibm-granite/granite-3b-code-base"
MODELS_DIR="/marianna/models/"
SUBSET="complete"
BS=1
TEMPERATURE=0.0
N_SAMPLES=1
NUM_GPUS=4
SAVE_DIR="/p/scratch/ccstdl/marianna/bigcodebench_results"
BACKEND="vllm"
SAVE_PATH="${SAVE_DIR}/${MODEL}_bigcodebench_${SUBSET}_${TEMPERATURE}_${N_SAMPLES}_${BACKEND}.jsonl"


CMD="apptainer -v run --nv --bind $(pwd):/app $IMAGE \
        --subset $SUBSET \
        --model $MODELS_DIR/$MODEL \
        --greedy \
        --temperature $TEMPERATURE \
        --n_samples $N_SAMPLES \
        --backend $BACKEND \
        --tp $NUM_GPUS \
        --trust_remote_code \
        --resume \
        --save_path $SAVE_PATH"

srun --cpus-per-task=$SLURM_CPUS_PER_TASK $CMD

Please let me know if it's an issue on my side or what I can do to solve it! Thanks in advance!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions