Skip to content

Fix parsing of integer literals with base prefix #106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

wnienhaus
Copy link
Collaborator

MicroPython 1.25.0 introduced a breaking change, aligning the behaviour of the int() function closer to the behaviour of CPython (something along the lines of: strings are assumed to represent a decimal number, unless a base is specified. if a base of 0 is specified, is the base is inferred from the string)

This broke our parsing logic, which relied on the previous behaviour of the int() function to automatically determine the base of the string literal, based on a base prefix present in the string. Specifying base 0 was not a solution, as this resulted in parsing behaviour different from GNU as.

Additionally, we never actually parsed octal in the format 0100 correctly - even before this PR; that number would have been interpreted as 100 rather than 64.

So, to fix this, and to ensure our parsing matches the GNU assembler, this PR implements a custom parse_int() function, using the base prefix in a string to determine the correct base to pass to int(). The following are supported:

  • 0x -> treated as hex
  • 0b -> treated as binary
  • 0... -> treated as octal
  • 0o -> treated as octal
  • anything else parsed as decimal

The parse_int method also supports the negative prefix operator for all of the above cases.

This change also ensures .int, .long, .word directives correctly handle the above mentioned formats. This fixes the issue described in #104.

Note: GNU as does not actually accept the octal prefix 0o..., but we accept it as a convenience, as this is accepted in Python code. This means however, that our assembler accepts code which GNU as does not accept. But the other way around, we still accept all code that GNU as accepts, which was one of our goals.

@wnienhaus wnienhaus self-assigned this Jun 19, 2025
@wnienhaus wnienhaus requested a review from ThomasWaldmann June 19, 2025 20:25
@wnienhaus wnienhaus removed their assignment Jun 19, 2025
@wnienhaus wnienhaus force-pushed the fix-int-parsing-with-base-prefix branch from 9452423 to 23f8ab4 Compare June 19, 2025 20:42
@wnienhaus
Copy link
Collaborator Author

After merging #107 the tests now pass.

Copy link
Member

@dpgeorge dpgeorge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specifying base 0 was not a solution, as this resulted in parsing behaviour different from GNU as.

I guess the simplest fix here would be to just replace int(x) with int(x, 0). That should restore the existing behaviour. But it looks like you want to improve things further, which is great!

@mjaspers2mtu
Copy link

Hey @wnienhaus , tested with the ulp programs on my s2, and had no issues 👍

@wnienhaus
Copy link
Collaborator Author

wnienhaus commented Jun 30, 2025

Specifying base 0 was not a solution, as this resulted in parsing behaviour different from GNU as.

I guess the simplest fix here would be to just replace int(x) with int(x, 0). That should restore the existing behaviour. But it looks like you want to improve things further, which is great!

Yes, int(x, 0) would have restored the previous behaviour, but it wasn't the behaviour we needed. Of course it wasn't the behaviour we needed even before, so this PR technically fixes 2 things - adapt to the new MicroPython behaviour and fix parsing behaviour to match GNU as.

That said, since int(x, 0) exists, and because it's really just octal parsing we need extra, I just tried this simpler approach:

def parse_int(literal):
    if len(literal) > 2:
        prefix_start = 1 if literal[0] == '-' else 0  # skip over negative sign if present
        if literal[prefix_start] == '0' and literal[prefix_start+1] in '123456789':
            return int(literal, 8)

    return int(literal, 0)

and all tests still pass.

So it's really just the octal case that's different (and theoretically we should disallow python style octal (0b..), but I had already decided to support it).

Now I am starting the overthink this:

  • what is better for clarity and/or long-term stability? To handle all cases we support explicitly (as I am doing now)? Or to just handle the extra octal case we need?
  • I see very little performance difference, and the shorter code saves perhaps a few bytes of memory, but it's probably not worth quibbling over.

I think I'll keep the current approach, as it's very explicit about what we support (including explicitly supporting python style octal). I'll just remove the comment about legacy octal format, because from the GNU as perspective, it's the currently valid and only possible octal format.

(Happy to get feedback on my chosen approach)

@wnienhaus
Copy link
Collaborator Author

Ok. Fixes pushed. Will squash-merge this once approved.

@wnienhaus wnienhaus force-pushed the fix-int-parsing-with-base-prefix branch from e4a7e33 to f7dfddc Compare June 30, 2025 11:09
remove duplication
@wnienhaus wnienhaus force-pushed the fix-int-parsing-with-base-prefix branch from f7dfddc to 5c84d08 Compare June 30, 2025 11:12
@dpgeorge
Copy link
Member

dpgeorge commented Jul 2, 2025

That said, since int(x, 0) exists, and because it's really just octal parsing we need extra, I just tried this simpler approach:

IMO that's quite a bit simpler and easier to read/understand. I would vote for this approach.

Or a little simpler still:

def parse_int(literal):
    if len(literal) >= 2 and literal.startswith(("0", "-0")) and literal.lstrip("-0").isdigt():
        return int(literal, 8)
    return int(literal, 0)

Note: I think you need the >= 2 to cover cases like 07.

Simply treat the special octal case and let Python handle the rest.
Also add extra tests, to show additional edge cases that are handled
correctly.
@wnienhaus
Copy link
Collaborator Author

Thanks for that feedback. Shorter is nice, so let's go with that. MicroPython's startswith does not support multiple values (tuple), so I expanded that into two separate conditions.

The >=2 is indeed needed in the shorter version of the code (both yours and also my earlier test code - well spotted). In the previous longer version all <=2 cases were also valid decimal, so they worked fine via the fall-through case.

I added some extra tests to show more cases that (should) work as expected, e.g. 07 as you mentioned. I also noticed, my short version did not handle 000010 correctly, i.e. octal with extra zero padding. GNU as understands that as decimal 8. My long parsing code worked correctly for that case, but my short version did not.

Anyway, your proposed code handles all cases correctly, so I used it now. And now there are tests to verify they all work as intended.

Thanks.

@dpgeorge
Copy link
Member

dpgeorge commented Jul 3, 2025

MicroPython's startswith does not support multiple values (tuple), so I expanded that into two separate conditions.

Ah, it will in the next release! But in the interest of backwards compatibility, best not to rely on that yet.

Anyway, your proposed code handles all cases correctly, so I used it now. And now there are tests to verify they all work as intended.

Very good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants