Skip to content

Cannot reproduce the result for bert-base-uncased, avg_first_last setting #285

Closed
@kuriyan1204

Description

@kuriyan1204

@gaotianyu1350
Hi, thank you for the great work / publishing beautiful codes!
I have some questions reproducing the STS results for pre-trained bert models.

When I run the following command in my environment, I got higher STS scores comparing to the results shown in your paper.
Do you have any idea what is causing the issue?

Code executed

python evaluation.py \
    --model_name_or_path bert-base-uncased \
    --pooler avg_first_last \
    --task_set sts \
    --mode test

Results

| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
| 45.09 | 64.30 | 54.56 | 70.52 | 67.87 | 59.05 | 63.75 | 60.73 |

Expected results (scores shown in your paper)

| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
| 39.70 | 59.38 | 49.67 | 66.03 | 66.19 | 53.87 | 62.06 | 56.70 |

Strangely, I can fully reproduce the scores for SimCSE models via following command:

% python evaluation.py \
--model_name_or_path princeton-nlp/sup-simcse-bert-base-uncased \
--pooler cls \
--task_set sts \
--mode test

Here is the result of pip freeze and I am using one NVIDIA RTX 6000 Ada GPU.
Thank you very much for your help!

pip freeze result
  aiofiles==23.2.1
  aiohappyeyeballs==2.4.3
  aiohttp==3.10.10
  aiosignal==1.3.1
  annotated-types==0.7.0
  anyio==4.5.0
  async-timeout==4.0.3
  attrs==24.2.0
  certifi==2024.8.30
  charset-normalizer==3.4.0
  click==8.1.7
  contourpy==1.1.1
  cycler==0.12.1
  datasets==3.0.1
  dill==0.3.8
  exceptiongroup==1.2.2
  fastapi==0.115.2
  ffmpy==0.4.0
  filelock==3.16.1
  fonttools==4.54.1
  frozenlist==1.4.1
  fsspec==2024.6.1
  gradio==4.44.1
  gradio-client==1.3.0
  h11==0.14.0
  httpcore==1.0.6
  httpx==0.27.2
  huggingface-hub==0.25.2
  idna==3.10
  importlib-resources==6.4.5
  jinja2==3.1.4
  joblib==1.4.2
  kiwisolver==1.4.7
  markdown-it-py==3.0.0
  MarkupSafe==2.1.5
  matplotlib==3.7.5
  mdurl==0.1.2
  multidict==6.1.0
  multiprocess==0.70.17
  numpy==1.24.4
  orjson==3.10.7
  packaging==24.1
  pandas==2.0.3
  pillow==10.4.0
  prettytable==3.11.0
  propcache==0.2.0
  pyarrow==17.0.0
  pydantic==2.9.2
  pydantic-core==2.23.4
  pydub==0.25.1
  pygments==2.18.0
  pyparsing==3.1.4
  python-dateutil==2.9.0.post0
  python-multipart==0.0.12
  pytz==2024.2
  PyYAML==6.0.2
  regex==2024.9.11
  requests==2.32.3
  rich==13.9.2
  ruff==0.6.9
  sacremoses==0.1.1
  safetensors==0.4.5
  scikit-learn==1.3.2
  scipy==1.10.1
  semantic-version==2.10.0
  shellingham==1.5.4
  six==1.16.0
  sniffio==1.3.1
  starlette==0.39.2
  threadpoolctl==3.5.0
  tokenizers==0.9.4
  tomlkit==0.12.0
  torch==1.7.1+cu110
  torchtyping==0.1.5
  tqdm==4.66.5
  transformers==4.2.1
  typeguard==2.13.3
  typer==0.12.5
  typing-extensions==4.12.2
  tzdata==2024.2
  urllib3==2.2.3
  uvicorn==0.31.1
  wcwidth==0.2.13
  websockets==12.0
  xxhash==3.5.0
  yarl==1.15.1
  zipp==3.20.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions