Description
Hello again,
I recently submitted an issue titled "Potential Bug: Improvement in 3-Digit Addition Baseline by Adjusting Prompt Formatting" and would like to follow up with another query related to your "Teaching Arithmetic to Small Transformers" work.
Concerns:
I have some questions about the principal argument in your paper, which suggests that solving arithmetic problems by considering the most significant digit first requires a more global approach, making the task significantly harder to train.
When examining a 3-digit addition task like (A3A2A1 + B3B2B1 = C3C2C1), the paper claims that (C3) would require comprehensive, global information. However, in most instances, (C3) can be computed using only (A3), (B3), and possibly the carry from (A2 + B2). The task only requires information from all digits when (A2 + B2 = 9).
For the reverse
method (A3A2A1 + B3B2B1 = C1C2C3), the computation for (C3) seems similarly dependent on carries from (B2), (A2), and (C2). Therefore, it's unclear to me why there would be a substantial difference in complexity between the plain
and reverse
methods for calculating (C3).
Additional Evidence:
In my earlier investigation, I observed that bad cases from the plain2
method rarely included situations where (A2 + B2 = 9). This leads me to wonder if the primary reason the reverse
method performs better might differ from what is discussed in the paper.
I'm eager to hear your insights on this matter.
Best regards,
Su Wang