Description
Hello there,
First off, thank you for the amazing work on "Teaching Arithmetic to Small Transformers" and for sharing the code. I'm a researcher who's been exploring your methods, and I find them quite enlightening.
However, while trying out your baseline tests and conducting some badcase analysis, I believe I've come across a potential bug that might impact the results presented in your paper.
The Main Issue:
My primary observation is that the lower accuracy rate for the plain baseline could be attributed to the prompt formatting used during the testing phase. After a small adjustment, the accuracy jumps from 87.27% to 95.58% without retraining the model. I suspect that if the model is retrained and the best-performing model on the validation set is selected, the accuracy could go up to around 97%, which is comparable to the 'plain2' model's results.
Explanation:
The change is quite simple: just prepend a newline (\n
) character before the existing prompt during testing. The discovery came after noticing that even in the training dataset, the plain method could only achieve around a 90% accuracy, which seemed odd to me.
Upon further analysis, I found that the issue mostly occurs with arithmetic tasks of the form A2A1+C3C2C1=
or A1+C3C2C1=
, where GPT, being a next-word predictor, can sometimes match the input to an incorrect but similar-looking arithmetic equation. For example, if the test prompt is 1+234=235
, and the training dataset contains \n21+234=255
, the model may incorrectly produce 1+234=255
.
Adding a newline character at the beginning, as in \n1+234=
, prevents this issue. The model cannot match \n1+234=
with \n21+234=255
, thereby substantially improving accuracy.
I hope this observation is useful and I would love to know your thoughts on it.
Best regards,
Su