Skip to content

Allow underscores in numbers. Closes #1677#3120

Closed
Balajiganapathi wants to merge 0 commit intoargotorg:developfrom
Balajiganapathi:develop
Closed

Allow underscores in numbers. Closes #1677#3120
Balajiganapathi wants to merge 0 commit intoargotorg:developfrom
Balajiganapathi:develop

Conversation

@Balajiganapathi
Copy link
Contributor

@Balajiganapathi Balajiganapathi commented Oct 20, 2017

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't prevent the leading underscore which seems to be weird in hex (0x_1234)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added that as python allows it:
(an example from https://www.python.org/dev/peps/pep-0515/)
flags = 0b_0011_1111_0100_1110

@axic
Copy link
Contributor

axic commented Oct 23, 2017

Thanks! Can you please add tests that (assuming we strictly follow PEP-0515):

  • trailing underscores are not allowed (in both decimal, hex and exponential)
  • underscores within the exponent part is allowed (1e3_00_0)
  • leading underscores in decimals (disallowed by PEP)
  • consecutive underscores are not allowed in any of the places where underscores are allowed

Deviating from the PEP I'd argue that leading underscores in hex are a nuisance.

Also please update the documentation (search for "rational and integer literals").

@axic
Copy link
Contributor

axic commented Oct 23, 2017

Also there's this rule in the PEP:

For the b, x and o format specifiers, _ will be allowed and group by 4 digits.

e.g. 0x1_2_3 is not valid, but 0x1234_5678 is

@Balajiganapathi
Copy link
Contributor Author

I agree with you that 0x_abc looks weird and not consistent with disallowing leading underscore for decimal numbers.

Also there's this rule in the PEP:
For the b, x and o format specifiers, _ will be allowed and group by 4 digits.
e.g. 0x1_2_3 is not valid, but 0x1234_5678 is

My understanding is that this is for the output format specifier, when converting from number to string. This is not a constraint on the syntax of a numeric literal. The grammar given does not specify this constraint.

I am working on adding the tests and disallowing leading underscore. Is this additional grouping constraint really needed?

@Balajiganapathi
Copy link
Contributor Author

Balajiganapathi commented Oct 23, 2017

@axic I have added more tests and docs.

I also tried in python3.6 and it does allow arbitrary grouping of digits.

e.g. x = 0xab_cd_e_f is valid

@pirapira
Copy link
Contributor

@axic do you still want precisely sized groupings?

@axic
Copy link
Contributor

axic commented Oct 24, 2017

Even the PEP introductory section suggests that. I think it definitely is a better idea to enforce that (we can always lax the rules, but harder to make them strict once out in a lax version).

@Balajiganapathi
Copy link
Contributor Author

@axic Ok will do this then.

Just to be clear the following will be invalid:

0x_abcd_cdef (leading underscore)
0xab_cd (not grouped by 4)

What about the following?
0x12345678_9abc - one group of 4 has underscore but other doesn't.
0x123_4567 - The total number of digits is not divisible by 4.

@axic
Copy link
Contributor

axic commented Oct 24, 2017

Good questions!

0x12345678_9abc - one group of 4 has underscore but other doesn't.

I guess this is fine too (I would personally not allow it, but I guess could be used to show where the fixed point would be after conversion? Tiny use case though). Perhaps this is the point where lax rules gain ground.

0x123_4567 - The total number of digits is not divisible by 4.

I think this is fine.

@Balajiganapathi
Copy link
Contributor Author

Balajiganapathi commented Oct 24, 2017

@axic Thanks One more question :)

Since we are allowing 0x123_4567, should we allow 0x1234_567 too? I think allowing only one of these 2 makes sense (and will make implementation slightly easier).

@axic
Copy link
Contributor

axic commented Oct 24, 2017

I wouldn't allow 0x1234_567.

@Balajiganapathi
Copy link
Contributor Author

Hi, I am thinking about the best way to implement this. I think this rule will cover all the cases. Or do you want more restrictive ones?

  • The number of digits to the right of each underscore should be a multiple of 4.
  • And ofcourse, no leading, trailing or double underscores

This will take care of all the above cases. But it will also allow such cases:
0x12341234_123456781234_abcd

@chriseth
Copy link
Contributor

Due to our ambiguous handling of literals, we should allow both 0x123_4567 and 0x1234_567. The first makes sense if used in a number context, the second if used in a bytes32 context.

Ok, perhaps to complicate things even further:

If the literal ends in 4 hex digits without separator, the length of the first element is irrelevant. If it starts with four hex digits without separator, the last element has to have an even length. So 0x1234_567 would be disallowed, but 0x1234_5670 would not be.

@axic
Copy link
Contributor

axic commented Oct 24, 2017

I'd argue that in a number context one could say leaving the leading 0 nibble out is acceptable (as it is a number), while in a bytes32 context it isn't since that refers to actual bytes.

0x1234_567 is very misleading, because even in a bytes32 context it will be assigned as 0x01234567...0.

@chriseth
Copy link
Contributor

Exactly! Although the "context" should not influence whether the literal is valid or not, it should be visible from looking at the literal itself.

@axic
Copy link
Contributor

axic commented Oct 24, 2017

I can't follow you, you say 0x1234_567 should be allowed because it makes sense for bytes32. I do not think it makes sense for it at all.

Looking at bytes32 x = 0x1234_567 I'd get the impression that it will be 0x12345670...0, but it will be 0x01234567...0.

@chriseth
Copy link
Contributor

Ok, sorry, let me clarify: 0x1234_567 should not be allowed, but 0x123_4567 should be.

@axic
Copy link
Contributor

axic commented Oct 24, 2017

Also I'd say the parser should not enforce any rules (perhaps apart from stray leading, trailing underscores), but rather have it in the SyntaxChecker.

Reason:

  • better error reporting
  • perhaps less code (could use a regular expression there).
  • and easier to describe it in grammar.txt

@Balajiganapathi
Copy link
Contributor Author

Balajiganapathi commented Oct 24, 2017

@chriseth @axic , is the following rule fine?

  • If there is a semicolon in hex numeric literal, then it should be before every 4th digit from right.

but rather have it in the SyntaxChecker.

How about in RationalNumberType::isValidLiteral in Types.cpp file?

and easier to describe it in grammar.txt

I could not find this file in this repo :(

@Balajiganapathi
Copy link
Contributor Author

Also I am assuming this does not apply to base 10 numerals as the rules of separating are different in different countries.

@chriseth
Copy link
Contributor

@axic great idea about syntaxchecker! This would also automatically allow the parsing to continue in case of an error.

@Balajiganapathi I would prefer the following rule:

If there is no separator at all, it is fine.
If there is a separator, the number of hex digits between separators has to be exactly 4 and special rules about the part before the first separator and after the last one are:

If the literal ends in 4 hex digits without separator, the length of the first element can be arbitrary. If it starts with four hex digits without separator, the last element has to have an even length. If neither the first nor the last has 4 hex digits without separator, the literal is invalid.

@Balajiganapathi
Copy link
Contributor Author

Not able to find out why ci build is failing. It passed in first commit. In next 2 commits I only made some cosmetic changes, yet the ci is showing build failed. I looked at build logs, it looks like it is randomly failing some.

@axic
Copy link
Contributor

axic commented Oct 26, 2017

Downloading version soljson-v0.4.18+commit.9cf6e910.js
Hash mismatch: 0x0478b43de978b1af1d6d6d8c09e84cdb2cc8ed76218d38f17b841b6e539742f0 vs 0xcecc3e8b4d1a9bb6ceadae7f681def7982755cbe97a4d28c043d3136ccbe7df1

This failure is down to Travis, you cannot do much against it.

@Balajiganapathi
Copy link
Contributor Author

@axic / @chriseth can you please review the changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add an assert that the first two characters are 0x.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you make the error message a little more specific? Something like
Invalid use of underscores in hex literal. Found inner part of length 2 (has to be 4 characters).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also make this error message more specific.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think control flow would be clearer if you move the rollback(1) before the if statement (and of course save m_char in a local variable).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, reading that again - why do you advance if you rollback later anyway?

Copy link
Contributor

@axic axic Nov 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses length and digits. I think it should just use one term and digits seems to be fine. This applies to the other messages too.

@pirapira
Copy link
Contributor

pirapira commented Dec 8, 2017

Please rebase.

@Balajiganapathi
Copy link
Contributor Author

@pirapira Rebased.

@chriseth
Copy link
Contributor

@axic please merge if you are fine with it.

Changelog.md Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this entry to 0.4.20.

@chriseth
Copy link
Contributor

Rebased.

@chriseth
Copy link
Contributor

chriseth commented Mar 6, 2018

Moved to 0.5.0

pirapira
pirapira previously approved these changes Mar 6, 2018
@chriseth
Copy link
Contributor

chriseth commented Mar 9, 2018

@ekpyron @bit-shift does one of you want to continue this? There are still some unaddressed comments by @axic.

@Balajiganapathi
Copy link
Contributor Author

@chriseth I will fix those this weekend.

@axic
Copy link
Contributor

axic commented Apr 11, 2018

Rebased.

@christianparpart
Copy link
Contributor

not sure how I closed this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments