-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audio: Use generic saturation logic for improved efficiency #9170
Audio: Use generic saturation logic for improved efficiency #9170
Conversation
1ec3a14
to
375e417
Compare
There's two commits from your previous PRs included to this, only the third commit is relevant. I don't think there's any dependency so you can leave them out from this. |
src/audio/tdfb/tdfb_direction.c
Outdated
@@ -285,9 +284,10 @@ static void level_update(struct tdfb_comp_data *cd, int frames, int ch_count, in | |||
/* Calculate mean square level */ | |||
for (n = 0; n < frames; n++) { | |||
s = *p; | |||
tmp += ((int64_t)s * s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s
is 16 bits, so casting it to 32 bits should be enough and should make this calculation faster. And yes, the external parentheses are redundant. And it actually doesn't matter where this calculation is made - here or where it was originally, right? So in its present form this change isn't improving anything
|
||
/* Apply masks to selectively replace x with min or max values, or keep x as is. */ | ||
x = (x & ~mask_overflow) | (max_val & mask_overflow); | ||
x = (x & ~mask_underflow) | (min_val & mask_underflow); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, I'm counting 12 64-bit operations here. Is this really faster than 2 if
s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShriramShastry sorry, I think there's a mistake in your analysis, at least in the first screenshot. In it you show the execution of the sat_int32()
function which of course is much smaller now. But we need to look at the sat_generic()
function which I'm pretty sure will be larger and possibly slower than the present version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think running a pipeline on real HW before and after will show the change in a E2E use case and show us an overall change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShriramShastry sorry, I dont need the full simulator logs but, but just a abbreviated clip that shows before and after (as I cant tell what numbers I should be looking at above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK,
Previously, the Original ( Before
) Function in GitHub used the branching if-else
approach to calculate saturation. After that, the modified ( After
) function under review calculates saturation using bitwise
arithmetic.
Observation :
The saturation function limits the input value to a certain range, specific to the bit width of the output (32-bit, 24-bit, 16-bit, 8-bit). The Cycles (Before)
and Cycles (After)
columns indicate the number of cycles it took to execute the function before and after some modification, respectively.
Here's a summary of the cycles before
and after
the modification :
sat_int32
: No change (20 cycles both before and after)
sat_int24
: Increased from 16 to 21 cycles
sat_int16
: Increased from 13 to 21 cycles
sat_int8
: Increased from 13 to 20 cycles
In summary, the modification resulted in an increase
in the number of cycles for sat_int24
, sat_int16
, and sat_int8
functions, while the sat_int32
function's performance remained unchanged
in terms of cycles.
[WIP] : The increased cycle counts observed for the 24-bit, 16-bit, and 8-bit saturation functions suggest there is room for further optimization. Employing 31- and 63-bit shifts to create masks enables us to identify overflow and underflow scenarios without the need for branching logic
(1) Generic_HiFi5
(2) Generic_HiFi4
(3) Generic_HiFi3
(4) ACE_3
(5) ACE_1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In summary, the modification resulted in an
increase
in the number of cycles forsat_int24
,sat_int16
, andsat_int8
functions, while thesat_int32
function's performance remainedunchanged
in terms of cycles.
Thank you, this matches my understanding, that those 64-bit calculations cannot result in better performance than a single comparison. Maybe this PR should be closed as a failed optimisation attempt
375e417
to
15e57ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review. I have taken care for your suggestion tmp += ((int32_t)s * s);
in 9169.
|
||
/* Apply masks to selectively replace x with min or max values, or keep x as is. */ | ||
x = (x & ~mask_overflow) | (max_val & mask_overflow); | ||
x = (x & ~mask_underflow) | (min_val & mask_underflow); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optimisation attempt seems to have failed
15e57ee
to
06939dd
Compare
The performance gap has been addressed with the most recent modification, in which the Cycles (Original) Here's a summary of the cycles before and after the modification : sat_int32: No change in cycle count before and after changes, resulting in 19 cycles. ACE-3
Generic HiFi5 Generic HiFi4 Generic HiFi3 |
This check-in refactors saturation functions to use bitwise operations for handling overflow and underflow, improving efficiency. - Replaced if-else checks with bitwise masks to handle overflow and underflow for int32, int24, int16, and int8 saturation functions. - For each bit width, created masks by shifting the results of comparisons (x - MIN_VAL) and (MAX_VAL - x) to the respective overflow/underflow detection bits. - Mask >> 31 or >> 63 results in either 0 (no overflow/underflow) or -1 (overflow/underflow occurred). - Applied bitwise operations to conditionally replace original values with boundary values (MIN or MAX) based on the computed masks. - This approach avoids branching, improves efficiency, and ensures accurate saturation within specified bit-width limits. Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
@ShriramShastry this is the sort of data we need for every change as its not really clear from all the table (there is too much data). We just need to know the before and after in a simplified form like your comment above.
Ok, so this looks like a small gain for int32, but a bigger loss for 24, 16 and 8 and means the PR should not be merged until we can show no loss for any format. Do you have a plan to rework ? |
I have reworked on sat-24,16 and 8 code and addressed performance loss !! With latest check-in there is no performance difference between I 've added the latest results here : #9170 (comment). Please take a look. Thank you. |
06939dd
to
db47d6e
Compare
db47d6e
to
cc0ef9a
Compare
so, there's no need for this PR then. We don't make changes just because they don't make things worse. We only make changes if they make something better. |
Thank you for your feedback on the recent changes. I would like to address your concerns regarding performance. Summary of Changes Benefits of the Optimization Consistent Performance: There is no performance degradation; the cycle counts before and after the optimization remain the same across all tested values. Branch Prediction: The use of bitwise operations eliminates branches, which can reduce the risk of branch mispredictions and enhance predictability in pipeline performance. Code Clarity and Maintenance: The bitwise approach makes the code more compact and potentially less error-prone, improving readability and future maintainability. Future-proofing: Branch-free code is often favored in modern compiler and processor optimizations, which might yield further advantages on future architectures. Conclusion |
Then you need to update the commit title and commit message. As they stand:
Was there some accuracy issue that this is fixing? This seems missing from the commit message too:
If this just a code clean-up then say "more readable code" and don't mention performance at all. If it's more than a clean-up then you need some data or reference. EDIT: it's not a clean-up at all, the new code is much harder to follow. |
else | ||
return (int32_t)x; | ||
int64_t mask_overflow = (INT32_MAX - x) >> 63; | ||
int64_t mask_underflow = (x - INT32_MIN) >> 63; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When x
is big enough then (x - INT32_MIN)
overflows, which is undefined behavior because it's signed.
Same overflow issue above when x
is negative enough.
Signed overflow is undefined behavior:
https://wiki.sei.cmu.edu/confluence/display/c/INT32-C.+Ensure+that+operations+on+signed+integers+do+not+result+in+overflow
Code like this is being banned from the Linux kernel right now for security reasons:
https://lwn.net/Articles/979747/
There are compiler flags to define it but that would introduce a very subtle and hard to track dependency. For what benefit?
https://www.gnu.org/software/c-intro-and-ref/manual/html_node/Signed-Overflow.html
Even ignoring undefined behavior, this code is much, much harder to follow than the previous one.
Can branch prediction ever matter for code that small?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can branch prediction ever matter for code that small?
Yes- Please take a look at the low-level impacts through assembly comparision
The bitwise approach uses fewer branching instructions, making it more efficient as branch prediction may fail and cause pipeline stalls.
Bitwise Approach: Uses fewer branches, reducing the risk of pipeline stalls and improving efficiency.
Complexity: Despite appearing complex, bitwise operations provide stability and avoid undefined behavior from signed overflows.
Branch Prediction Impact: Even small code sections can significantly benefit from reduced branching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShriramShastry agree this should improve branch prediction and instruction pipeline throughput, but I think its best to show a real world test i.e. take a piece of FW code that is a heavy user of these APIs and place some timestamps around usage. The timestamps can then be measured before/after the changes with a real workload to show improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Complexity: Despite appearing complex, bitwise operations provide stability and avoid undefined behavior from signed overflows.
I don't think you understand what "undefined behavior" means. This is not a problem that shows up in assembly code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your feedback. Marc.
I thank your attention to the complexity and stability provided by bitwise operations. However, I believe there might be a misunderstanding regarding the definition and implications of "undefined behavior" (UB) in the context of signed integer overflow.
Explanation:
Undefined Behavior (UB):
I went over the links shared above, In C and C++, "undefined behavior" refers to a situation where the language specification does not define what should happen. This can lead to unpredictable results as the compiler is free to generate any code for UB constructs.
A classic case of UB in C is signed integer overflow. For example, adding two large int32_t
values that exceed the range of int32_t
can result in UB because the C standard does not define the outcome of overflow for signed integers. https://wiki.sei.cmu.edu/confluence/display/c/INT32-C.+Ensure+that+operations+on+signed+integers+do+not+result+in+overflow#NoncompliantCodeExample
Bitwise Operations:
The bitwise operations used in the provided code ensure that values are clamped/bounded within their respective ranges without causing overflow, thereby ensuring predictable behavior.
For example, the code for sat_int32:
static inline int32_t sat_int32(int64_t x) {
int64_t mask_overflow = (INT32_MAX - x) >> 63;
int64_t mask_underflow = (x - INT32_MIN) >> 63;
x = (x & ~mask_overflow) | (INT32_MAX & mask_overflow);
x = (x & ~mask_underflow) | (INT32_MIN & mask_underflow);
return (int32_t)x;
}
Here, bitwise operations are used to safely bound int64_t
values to the int32_t
range, avoiding the issues of signed overflow.
Understanding in Assembly:
While undefined behavior often has unpredictable and compiler-dependent outcomes, bitwise operations tend to translate directly to assembly without causing UB.
Ensuring signed overflows are handled in C code avoids reliance on compiler-specific behavior and promotes consistent, predictable execution across different platforms.
Summary:
Bitwise operations provide a robust way to handle out-of-range values, preventing undefined behavior stemming from signed integer overflow in C. This ensures predictable, stable, and portable code execution.
If there is an aspect I might have overlooked regarding the specific implications on assembly code or if you meant something different by undefined behavior, please let me know, and I can address it further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bitwise operations used in the provided code ensure that values are clamped/bounded within their respective ranges without causing overflow, thereby ensuring predictable behavior.
static inline int32_t sat_int32(int64_t x) {
int64_t mask_overflow = (INT32_MAX - x) >> 63;
This is not true because the shift happens after the undefined behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can branch prediction ever matter for code that small?
Yes- Please take a look at the low-level impacts through assembly comparison
Assembly differences are interesting but they're not answering the question. The question was "Does this matter? (and how much)". For this sort of optimization to matter you need:
- A branch predictor
- A branch prediction which is difficult to make and often enough wrong
- A long enough pipeline
- Code in the critical path
Most performance optimizations tend to be very "cruel" because they're correct in theory and they don't matter in practice. That's what Donald Knuth meant when he wrote "premature optimization is the root of all evil" https://en.wikiquote.org/wiki/Donald_Knuth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Real world workload would show performance change and settle discussion.
else | ||
return (int32_t)x; | ||
int64_t mask_overflow = (INT32_MAX - x) >> 63; | ||
int64_t mask_underflow = (x - INT32_MIN) >> 63; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ShriramShastry agree this should improve branch prediction and instruction pipeline throughput, but I think its best to show a real world test i.e. take a piece of FW code that is a heavy user of these APIs and place some timestamps around usage. The timestamps can then be measured before/after the changes with a real workload to show improvement.
I did a performance comparison with git main vs main + #9170 with TGL build with gcc and xcc. The gcc build currently can't run enabled DRC (reason unknown, a regression or just too much load) so I used different configurations for hw:0,0 playback on sof-hda-generic-4ch.tplg GCC (gain 1.1 44, gain 2.1 44, EQIIR eq_iir_flat.txt, EQFIR eq_fir_flat.txt, DRC passthrough.txt off)
XCC (gain 1.1 44, gain 2.1 44, EQIIR eq_iif_spk.txt, EQFIR eq_fir_spk.txt, DRC threshold_-25_knee_15_ratio_10.txt on)
I'll see if I can repeat this with some remote MTL device. But this doesn't look good for TGL. |
Similar result with MTL, no improvement with xcc build, and about same 4 MCPS slower with gcc build of this PR. |
So... I hate to say it but I'm pretty sure this change is just a big noop to the optimizer. Here's some test code that extracts the two variants in individually compilable form, and a just-non-trivial-enough outer loop to let the compiler try to use the code in a natural setting. And the generated code from xt-clang (RI-2022.10, at -Os) is essentially identical. Both variants reduce to SALTU instructions. There is no branching in the current code nor masking in the new one; the compiler is smart enough to recognize what's happening. #define INT_MAX 0x7fffffff
#define INT_MIN 0x80000000
static inline int sat32_cmp(long long x)
{
return x > INT_MAX ? INT_MAX : (x < INT_MIN ? INT_MIN : x);
}
static inline int sat32_mask(long long x)
{
long long mask_overflow = (INT_MAX - x) >> 63;
long long mask_underflow = (x - INT_MIN) >> 63;
x = (x & ~mask_overflow) | (INT_MAX & mask_overflow);
x = (x & ~mask_underflow) | (INT_MIN & mask_underflow);
return (int) x;
}
void sat_array_add_cmp(int *dst, int *a, int *b, int n)
{
for (int i = 0; i < n; i++) {
dst[i] = sat32_cmp(a[i] + (long long) b[i]);
}
}
void sat_array_add_mask(int *dst, int *a, int *b, int n)
{
for (int i = 0; i < n; i++) {
dst[i] = sat32_mask(a[i] + (long long) b[i]);
}
} |
And indeed, gcc[1] is not smart enough to figure it out, and emits the code more or less as written. But... honestly my read is that the branch-based code is better and not worse. It's a lot fewer instructions, and Xtensa pipelines are really short in practice. Certainly I've never bumped up against branch stalls as a performance case myself; most branches are very cheap. [1] Also a clang I have built from the Espressif LLVM tree that I've been playing with. It actually appears to build Zephyr for SOF targets pretty well, if anything slightly better code than gcc, and all the variants can go in one toolchain instead of N. I need to get that cleaned up and submitted somewhere, with a writeup for Zephyr... |
I used for test these compiler versions
Detailed resuls: |
Used algorithms settings for xcc
Settings for gcc
(sorry, should use those long names instead of numids with sof-ctl, in TGL and MTL the numids are not the same, while the the long string names are) |
You can get the front-end version with
|
@ShriramShastry I assume this PR has been split up and been re-created into smaller PRs ? if so, we should close and focus on the new PRs. |
Introduce a arithmetic bitwise saturation function operations across different integer sizes (8, 16, 24, 32 bits) in the
processing block.
Replaced
if-else
checks with bitwise masks to handle overflow andunderflow for int32, int24, int16, and int8 saturation functions.
For each bit width, created masks by shifting the results of comparisons
(x -
MIN_VAL
) and (MAX_VAL
- x) to the respective overflow/underflowdetection bits.
Mask >>
31
or >>63
results in either0
(no overflow/underflow) or-1
(overflow/underflow occurred).
Applied bitwise operations to conditionally replace original values
with boundary values (
MIN
orMAX
) based on the computed masks.This approach avoids branching, improves efficiency, and ensures
accurate saturation within specified bit-width limits.
HiFi3 Inline Summary
sat_int32: Decreased from 17 to 6-9 cycles; Bit true: Mixed; Gain: ~47.06% to 64.71%
sat_int24: Decreased from 7 to 5 cycles; Bit true: Yes; Gain: ~28.57%
sat_int16: Decreased from 13 to 7 cycles; Bit true: Yes; Gain: ~46.15%
sat_int8: Decreased from 3 to 1 cycle; Bit true: Yes; Gain: ~66.67%
HiFi4 Inline Summary
sat_int32: Decreased from 11 to 6-8 cycles; Bit true: Mixed; Gain: ~27.27% to 45.45%
sat_int24: No change (5 cycles); Bit true: Yes
sat_int16: Decreased from 9 to 7 cycles; Bit true: Yes; Gain: ~22.22%
sat_int8: Decreased from 3 to 1 cycle; Bit true: Yes; Gain: ~66.67%
HiFi5 Inline Summary
sat_int32: Decreased from 11 to 6-8 cycles; Bit true: Mixed; Gain: ~27.27% to 45.45%
sat_int24: No change (5 cycles); Bit true: Yes
sat_int16: Decreased from 9 to 7 cycles; Bit true: Yes; Gain: ~22.22%
sat_int8: Decreased from 3 to 1 cycle; Bit true: Yes; Gain: ~66.67%
Summary: Inline functions showed significant performance gains across all architectures,
especially in cycle reductions, while noinline functions maintained consistent cycle counts with bit true results.