Skip to content

Potential Bug: Improvement in 3-Digit Addition Baseline by Adjusting Prompt Formatting #1

Open
@james016

Description

@james016

Hello there,

First off, thank you for the amazing work on "Teaching Arithmetic to Small Transformers" and for sharing the code. I'm a researcher who's been exploring your methods, and I find them quite enlightening.

However, while trying out your baseline tests and conducting some badcase analysis, I believe I've come across a potential bug that might impact the results presented in your paper.

The Main Issue:

My primary observation is that the lower accuracy rate for the plain baseline could be attributed to the prompt formatting used during the testing phase. After a small adjustment, the accuracy jumps from 87.27% to 95.58% without retraining the model. I suspect that if the model is retrained and the best-performing model on the validation set is selected, the accuracy could go up to around 97%, which is comparable to the 'plain2' model's results.

Explanation:

The change is quite simple: just prepend a newline (\n) character before the existing prompt during testing. The discovery came after noticing that even in the training dataset, the plain method could only achieve around a 90% accuracy, which seemed odd to me.

Upon further analysis, I found that the issue mostly occurs with arithmetic tasks of the form A2A1+C3C2C1= or A1+C3C2C1=, where GPT, being a next-word predictor, can sometimes match the input to an incorrect but similar-looking arithmetic equation. For example, if the test prompt is 1+234=235, and the training dataset contains \n21+234=255, the model may incorrectly produce 1+234=255.

Adding a newline character at the beginning, as in \n1+234=, prevents this issue. The model cannot match \n1+234= with \n21+234=255, thereby substantially improving accuracy.

I hope this observation is useful and I would love to know your thoughts on it.

Best regards,
Su

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions