-
-
Notifications
You must be signed in to change notification settings - Fork 31.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-46258: Streamline isqrt fast path #30333
bpo-46258: Streamline isqrt fast path #30333
Conversation
Turning this into a draft: I think we can go one better and remove another division, at the expense of expanding the lookup table from 12 bytes to 192 bytes. I may not have time to make those changes before this weekend, though. |
Done. The speedup on my machine for a random selection of 64-bit integers is now over 20%. |
Merging after self-review. |
@mdickinson: Please replace |
This PR makes some minor simplifications to the implementation of
math.isqrt
. Those simplifications improve the speed ofmath.isqrt
for inputs smaller than2**64
by around 20% on my machine.In detail:
_approximate_sqrt
, use a lookup table based on the topmost 8 bits to replace the first two Newton iterations. This reduces the total number of divisions needed from 4 to 2._approximate_sqrt
fromuint64_t
touint32_t
.u * u - 1U >= m
test with the simpler but equivalent testu * u > m
._approximate_sqrt
(though I'd expect most compilers not to need this)._approximate_sqrt
be inlined.On my Intel x64 machine, the assembly produced by Clang involves one
divl
instruction and onedivq
instruction.There's one subtle but important change here: in the previous implementation of
_approximate_sqrt
, it was possible for the returned value to be exactly2**32
(but no larger), so we couldn't assume that it would fit in auint32_t
. With the new code, the returned value is always< 2**32
. It's this fact that makes it possible to change the return type, and to replace theu*u - 1 >= m
test withu*u > m
. To prove this bound: if we ignore overflow for the moment, since the result of_approximate_sqrt
is always within1
of the true square root (following the proof already outlined in the comments), the only way that the result of the final addition can be2**32
(overflowing to0
) is if the inputn
exceeds(2**32 - 1)**2
. But in that case we know most of the top bits ofn
: the top 32 bits are either0xfffffffe
or0xffffffff
, and so we can trace through the first steps of_approximate_sqrt
- the value ofu
retrieved from the lookup table is255
, then the value assigned tou
in the following line is65536
. Then it's easy to see that ifu = 65536
,(n >> 17 / u) < 2**31
andu << 15 = 2**31
, so the result of the addition fits in auint32_t
.https://bugs.python.org/issue46258