-
Notifications
You must be signed in to change notification settings - Fork 921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For powers of 10, replace ipow with switch #15353
For powers of 10, replace ipow with switch #15353
Conversation
/ok to test |
are there benchmarks that show the performance impact? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. One suggestion to simplify.
// Stop at 10^18 to speed up compilation; literals can be used for smaller powers of 10. | ||
static_assert(Exp10 >= 18); | ||
if constexpr (Exp10 == 18) | ||
return __int128_t(1000000000000000000LL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use separators for readability.
return __int128_t(1000000000000000000LL); | |
return __int128_t(1'000'000'000'000'000'000LL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The jitify compile cannot handle separators when combined with the LL at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you opened a bug report for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, should I open an issue here on the github repo, or a bug through the nvidia bugs system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean an issue on jitify's GitHub repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, done.
CUDF_HOST_DEVICE inline T divide_power10_64bit(T value, int exp10) | ||
{ | ||
// See comments in divide_power10_32bit() for discussion. | ||
// clang-format off |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to leave clang-format on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turning it off makes it line-up as:
case 0: return value;
case 1: return value / 10;
case 2: return value / 100;
case 3: return value / 1000;
case 4: return value / 10000;
case 5: return value / 100000;
case 6: return value / 1000000;
case 7: return value / 10000000;
case 8: return value / 100000000;
case 9: return value / 1000000000;
case 10: return value / 10000000000LL;
case 11: return value / 100000000000LL;
case 12: return value / 1000000000000LL;
case 13: return value / 10000000000000LL;
case 14: return value / 100000000000000LL;
case 15: return value / 1000000000000000LL;
case 16: return value / 10000000000000000LL;
case 17: return value / 100000000000000000LL;
case 18: return value / 1000000000000000000LL;
case 19: return value / 10000000000000000000ULL; // 10^19 only fits if unsigned!
default: return 0;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which I think is less readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t see the problem. Let’s turn it on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With clang-format off it's much easier to see that everything is increasing by a power of 10 when they're all lined up at the same position. It's not that important to me though, so I'll re-enable clang-format here.
// If you do, prefer calling the bit-size-specific versions | ||
if constexpr (sizeof(Rep) <= 4) { | ||
return multiply_power10_32bit(value, exp10); | ||
} else if constexpr (sizeof(Rep) == 8) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the above using <= 4
but this is using == 8
? Shouldn't both be <=
or ==
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could call it with a type where sizeof is 1 or 2, hence the <= 4. I could change the other to <= 8, but I'm not anticipating us having any types with size in between.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let’s be consistent and use <=.
…idia/cudf into actual_ipow_improve
…arking issues. 128bit functions can now be inlined.
…d. Code comments updated.
When the full suite of benchmarks were run, it turns out that the GroupBy benchmarks were too adversely affected by this change to go in as it was. There was too much register pressure using these inlined functions for fixed-point comparison operations (rescaling) that was blowing up the benchmark. Therefore this code will now only be used for the decimal <--> floating conversion, and has been moved to a new header file dedicated for that (upcoming) code. Since it's been removed from fixed_point.hpp, we no longer need to modify the jitify compilation flags either for 128bit integer types. Finally, since the decimal <--> floating conversion only uses these functions for unsigned integers, the code has been updated for those as well. |
LGTM with all the latest changes! |
/merge |
This PR contains the main algorithm for the new decimal <--> floating conversion code. This algorithm was written to address the precision issues described [here](#14169). ### Summary * The new algorithm is more accurate than the previous code, but it is also far more complex. * It can perform conversions that were not even possible in the old code due to overflow (decimal32/64/128 conversions only worked for scale factors up to 10^9/18/38, respectively). Now the entire floating-point range is convertible, including denormals. * This new algorithm is significantly faster in some parts of the conversion phase-space, and in some parts slightly slower. ### Previous PR's These contain the supporting parts of this work: * [Explicit conversion PR](#15438) * [Benchmarking PR](#15334) * [Powers-of-10 PR](#15353) * [Utilities PR](#15359). These utilities are updated here to support denormals. ### Algorithm Outline We convert floating -> (integer) decimal by: * Extract the floating-point mantissa (converted to integer) and power-of-2 * For float we use a uint64 to contain our data during the below shifting/scaling, for double uint128_t * In this shifting integer, we alternately apply the extracted powers-of-2 (bit-shifts, until they're all used) and scale-factor powers-of-10 (multiply/divide) as needed to reach the desired scale factor. Decimal -> floating is just the reverse operation. ### Supplemental Changes * Testing: Add decimal128, add precise-conversion tests. Remove kludges due to inaccurate conversions. Add test for zeroes. * Benchmarking: Enable regions of conversion phase-space for benchmarking that were not possible in the old algorithm. * Unary: Cleanup by using CUDF_ENABLE_IF. Call new conversion code for base-10 fixed-point. ### Performance for various conversions/input-ranges * Note: F32/F64 is float/double New algorithm is **FASTER** by: * F64 --> decimal64: 60% for E8 --> E15 * F64 --> decimal128: 13% for E-8 --> E-15 * F64 --> decimal128: 22% for E8 --> E15 * F64 --> decimal128: 27% for E31 --> E38 * decimal32 --> F64: 18% for E-3 --> E4 * decimal64 --> F64: 27% for E-14 --> E-7 * decimal64 --> F64: 17% for E-3 --> E4 * decimal128 --> F64: 21% for E-14 --> E-7 * decimal128 --> F64: 11% for E-3 --> E4 * decimal128 --> F64: 13% for E31 --> E38 New algorithm is **SLOWER** by: * F32 --> decimal32: 3% for E-3 --> E4 * F32 --> decimal64: 2% for E-14 --> E14 * F64 --> decimal32: 3% for E-3 --> E4 * decimal32 --> F32: 5% for E-3 --> E4 * decimal128 --> F64: 36% for E-37 --> E-30 Other kernels: * The PYMOD binary-op benchmark is 7% slower. ### Performance discussion * Many conversions have identical speed, indicating these algorithms are often fast and we are instead bottlenecked on overheads such as getting the input to the gpu in the first place. * F64 conversions are often much faster than the old algorithm as the new algorithm completely avoids the FP64 pipeline. Other than the cast to double itself, all of the operations are on integers. Thus we don't have threads competing with each other and taking turns for access to the floating-point cores. * The conversions are slightly slower for floats with powers-of-10 near zero. Presumably this is due to code overhead for e.g., handling a large range of inputs, UB-checks for bit shifts, branches for denormals, etc. * The conversion is slower for decimal128 conversions with very small exponents, which requires several large divisions (128bit divided by 64bit). * The PYMOD kernel is slower due to register pressure from the introduction of the new division routines in the earlier PR. Even though this benchmark does not perform decimal <--> floating conversions, it gets hit because of inlined template code in the kernel increasing the code/register pressure. Authors: - Paul Mattione (https://github.com/pmattione-nvidia) Approvers: - Jason Lowe (https://github.com/jlowe) - Bradley Dice (https://github.com/bdice) - Mike Wilson (https://github.com/hyperbolic2346) URL: #15905
This adds a new runtime calculation of the power-of-10 needed for applying decimal scale factors with a switch statement. This provides the fastest way of applying the scale. Note that the multiply and divide operations are performed within the switch itself, so that the compiler sees the full instruction to optimize assembly code gen. See code comments for details.
This cannot be used within fixed_point (e.g. for comparison operators and rescaling) as it introduced too much register pressure to unrelated benchmarks. It will only be used for the decimal <--> floating conversion, so it has been moved there to be in a new header file where that code will reside (in an upcoming PR). This is part of a larger change to change the algorithm for decimal <--> floating conversion to a more accurate one that is forthcoming soon.
Checklist