Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-46258: Streamline isqrt fast path #30333

Merged
merged 6 commits into from
Jan 15, 2022

Conversation

mdickinson
Copy link
Member

@mdickinson mdickinson commented Jan 2, 2022

This PR makes some minor simplifications to the implementation of math.isqrt. Those simplifications improve the speed of math.isqrt for inputs smaller than 2**64 by around 20% on my machine.

In detail:

  • In _approximate_sqrt, use a lookup table based on the topmost 8 bits to replace the first two Newton iterations. This reduces the total number of divisions needed from 4 to 2.
  • Change the return type of _approximate_sqrt from uint64_t to uint32_t.
  • Replace the u * u - 1U >= m test with the simpler but equivalent test u * u > m.
  • Add casts to make it clear to the compiler that a 32-bit-by-32-bit division is enough for the first of the two divisions in _approximate_sqrt (though I'd expect most compilers not to need this).
  • Suggest to the compiler that _approximate_sqrt be inlined.
  • Simplify the shift computation.

On my Intel x64 machine, the assembly produced by Clang involves one divl instruction and one divq instruction.

There's one subtle but important change here: in the previous implementation of _approximate_sqrt, it was possible for the returned value to be exactly 2**32 (but no larger), so we couldn't assume that it would fit in a uint32_t. With the new code, the returned value is always < 2**32. It's this fact that makes it possible to change the return type, and to replace the u*u - 1 >= m test with u*u > m. To prove this bound: if we ignore overflow for the moment, since the result of _approximate_sqrt is always within 1 of the true square root (following the proof already outlined in the comments), the only way that the result of the final addition can be 2**32 (overflowing to 0) is if the input n exceeds (2**32 - 1)**2. But in that case we know most of the top bits of n: the top 32 bits are either 0xfffffffe or 0xffffffff, and so we can trace through the first steps of _approximate_sqrt - the value of u retrieved from the lookup table is 255, then the value assigned to u in the following line is 65536. Then it's easy to see that if u = 65536, (n >> 17 / u) < 2**31 and u << 15 = 2**31, so the result of the addition fits in a uint32_t.

https://bugs.python.org/issue46258

@mdickinson mdickinson changed the title Streamline isqrt fast path bpo-46258: Streamline isqrt fast path Jan 4, 2022
@mdickinson mdickinson marked this pull request as draft January 4, 2022 20:39
@mdickinson
Copy link
Member Author

Turning this into a draft: I think we can go one better and remove another division, at the expense of expanding the lookup table from 12 bytes to 192 bytes. I may not have time to make those changes before this weekend, though.

@mdickinson
Copy link
Member Author

I think we can go one better and remove another division, at the expense of expanding the lookup table from 12 bytes to 192 bytes.

Done. The speedup on my machine for a random selection of 64-bit integers is now over 20%.

@mdickinson mdickinson marked this pull request as ready for review January 5, 2022 17:36
@mdickinson
Copy link
Member Author

Merging after self-review.

@mdickinson mdickinson merged commit d02c5e9 into python:main Jan 15, 2022
@bedevere-bot
Copy link

@mdickinson: Please replace # with GH- in the commit message next time. Thanks!

@mdickinson mdickinson deleted the streamline-isqrt-fast-path branch January 15, 2022 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants