Cannot reproduce the result for `bert-base-uncased`, `avg_first_last` setting

@gaotianyu1350
Hi, thank you for the great work / publishing beautiful codes!
I have some questions reproducing the STS results for pre-trained bert models.

When I run the following command in my environment, I got higher STS scores comparing to the results shown in your paper.
Do you have any idea what is causing the issue?

### Code executed
```
python evaluation.py \
    --model_name_or_path bert-base-uncased \
    --pooler avg_first_last \
    --task_set sts \
    --mode test
```
### Results
```
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
| 45.09 | 64.30 | 54.56 | 70.52 | 67.87 | 59.05 | 63.75 | 60.73 |
```
### Expected results (scores shown in your paper)
```
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. |
| 39.70 | 59.38 | 49.67 | 66.03 | 66.19 | 53.87 | 62.06 | 56.70 |
```

Strangely, I can fully reproduce the scores for SimCSE models via following command:
```
% python evaluation.py \
--model_name_or_path princeton-nlp/sup-simcse-bert-base-uncased \
--pooler cls \
--task_set sts \
--mode test
```

Here is the result of `pip freeze` and I am using one NVIDIA RTX 6000 Ada GPU.
Thank you very much for your help!

<details>
<summary> pip freeze result </summary>

```
  aiofiles==23.2.1
  aiohappyeyeballs==2.4.3
  aiohttp==3.10.10
  aiosignal==1.3.1
  annotated-types==0.7.0
  anyio==4.5.0
  async-timeout==4.0.3
  attrs==24.2.0
  certifi==2024.8.30
  charset-normalizer==3.4.0
  click==8.1.7
  contourpy==1.1.1
  cycler==0.12.1
  datasets==3.0.1
  dill==0.3.8
  exceptiongroup==1.2.2
  fastapi==0.115.2
  ffmpy==0.4.0
  filelock==3.16.1
  fonttools==4.54.1
  frozenlist==1.4.1
  fsspec==2024.6.1
  gradio==4.44.1
  gradio-client==1.3.0
  h11==0.14.0
  httpcore==1.0.6
  httpx==0.27.2
  huggingface-hub==0.25.2
  idna==3.10
  importlib-resources==6.4.5
  jinja2==3.1.4
  joblib==1.4.2
  kiwisolver==1.4.7
  markdown-it-py==3.0.0
  MarkupSafe==2.1.5
  matplotlib==3.7.5
  mdurl==0.1.2
  multidict==6.1.0
  multiprocess==0.70.17
  numpy==1.24.4
  orjson==3.10.7
  packaging==24.1
  pandas==2.0.3
  pillow==10.4.0
  prettytable==3.11.0
  propcache==0.2.0
  pyarrow==17.0.0
  pydantic==2.9.2
  pydantic-core==2.23.4
  pydub==0.25.1
  pygments==2.18.0
  pyparsing==3.1.4
  python-dateutil==2.9.0.post0
  python-multipart==0.0.12
  pytz==2024.2
  PyYAML==6.0.2
  regex==2024.9.11
  requests==2.32.3
  rich==13.9.2
  ruff==0.6.9
  sacremoses==0.1.1
  safetensors==0.4.5
  scikit-learn==1.3.2
  scipy==1.10.1
  semantic-version==2.10.0
  shellingham==1.5.4
  six==1.16.0
  sniffio==1.3.1
  starlette==0.39.2
  threadpoolctl==3.5.0
  tokenizers==0.9.4
  tomlkit==0.12.0
  torch==1.7.1+cu110
  torchtyping==0.1.5
  tqdm==4.66.5
  transformers==4.2.1
  typeguard==2.13.3
  typer==0.12.5
  typing-extensions==4.12.2
  tzdata==2024.2
  urllib3==2.2.3
  uvicorn==0.31.1
  wcwidth==0.2.13
  websockets==12.0
  xxhash==3.5.0
  yarl==1.15.1
  zipp==3.20.2
```
</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot reproduce the result for `bert-base-uncased`, `avg_first_last` setting #285

Code executed

Results

Expected results (scores shown in your paper)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cannot reproduce the result for bert-base-uncased, avg_first_last setting #285

Description

Code executed

Results

Expected results (scores shown in your paper)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Cannot reproduce the result for `bert-base-uncased`, `avg_first_last` setting #285