Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audio Math: Improve audio format saturation functions. #9344

Merged

Conversation

ShriramShastry
Copy link
Contributor

  • Updated sat_int32 to support 64-bit to 32-bit saturation with proper bitwise shifting.
  • Added vec_sat_int32x2 function to saturate and pack two 64-bit integers into 32x2 vector.
  • Improved sat_int24 for 32-bit to 24-bit saturation.
  • Added vec_sat_int24x2 to saturate and pack two 32-bit integers into a 32x2 vector.
  • Corrected sat_int16 to return a single 16-bit saturated integer.
  • Added vec_sat_int16x4 function to saturate and pack four 32-bit integers into 16x4 vector.
  • Improved sat_int8 to properly extract the lower 8 bits of a 32-bit integer.
  • Added vec_sat_int8x2 to pack two 32-bit integers in an 8x2 vector.

@ShriramShastry
Copy link
Contributor Author

ShriramShastry commented Aug 1, 2024

Performance results using Generic HiFi compilers

					Performance Comparison-hifi5:
		|Function    |  Array Size  | Branching Cycles  |  Before HiFi Cycles | After HiFi Cycles |  Bit True? |
		+------------+--------------+-------------------+---------------------+-------------------+------------+
		|sat_int32   |      32      |              241  |              122    |              105  |      Yes   |
		|sat_int24   |      32      |               77  |               59    |               43  |      Yes   |
		|sat_int16   |      32      |               80  |               71    |               45  |      Yes   |
		|sat_int8    |      32      |               77  |               91    |               22  |      Yes   |

					Performance Comparison-hifi4:
		|Function    |  Array Size  | Branching Cycles  |  Before HiFi Cycles | After HiFi Cycles |  Bit True? |
		+------------+--------------+-------------------+---------------------+-------------------+------------+
		|sat_int32   |      32      |              222  |              124    |               107 |      Yes   |
		|sat_int24   |      32      |               75  |               59    |                41 |      Yes   |
		|sat_int16   |      32      |               80  |               74    |                45 |      Yes   |
		|sat_int8    |      32      |               77  |               89    |                22 |      Yes   |

					Performance Comparison-hifi3
		|Function    |  Array Size  | Branching Cycles  |  Before HiFi Cycles | After HiFi Cycles |  Bit True? |
		+------------+--------------+-------------------+---------------------+-------------------+------------+
		|sat_int32   |      32      |              402  |              157    |              136  |      Yes   |
		|sat_int24   |      32      |              146  |               76    |               71  |      Yes   |
		|sat_int16   |      32      |              110  |               102   |               64  |      Yes   |
		|sat_int8    |      32      |              146  |               149   |               30  |      Yes   |





example usage in FW:

void dsp_write_loop_int24_hifi(int32_t* arr, size_t size) {
    ae_int32x2* ptr_out = (ae_int32x2*)arr;
    int reminder = size % 2;
    int i;

    for (i = 0; i < size / 2; i += 1) {
        size_t base_idx = i * 2;
        ptr_out[i] = vec_sat_int24x2(arr[base_idx], arr[base_idx + 1]);
    }

    // Process the remaining elements
    i = (size / 2) * 2;
    for (; i < size; i += 1) {
        arr[i] = sat_int24(arr[i]);
    }
}

@cujomalainey
Copy link
Member

Each line should be it's own commit, please split this.

Please read the section on atomic comic convention here

@ShriramShastry ShriramShastry force-pushed the shastry_saturation_hifi_performance branch from 098e7df to 5e6e50c Compare August 4, 2024 19:06
@ShriramShastry
Copy link
Contributor Author

Each line should be it's own commit, please split this.

Please read the section on atomic comic convention here

Done ! Could you please take another look?
Thank you!

@lgirdwood
Copy link
Member

Performance results using Generic HiFi compilers

					Performance Comparison-hifi5:
		|Function    |  Array Size  | Branching Cycles  |  HiFi Cycles      |  Bit True? |
		+------------+--------------+-------------------+-------------------+------------+
		|sat_int32   |      32      |              241  |              105  |      Yes   |
		|sat_int24   |      32      |               77  |               43  |      Yes   |
		|sat_int16   |      32      |               80  |               45  |      Yes   |
		|sat_int8    |      32      |               77  |               22  |      Yes   |

					Performance Comparison-hifi4:
		|Function    |  Array Size  | Branching Cycles  |  HiFi Cycles      |  Bit True? |
		+------------+--------------+-------------------+-------------------+------------+
		|sat_int32   |      32      |              222  |              107  |      Yes   |
		|sat_int24   |      32      |               75  |               41  |      Yes   |
		|sat_int16   |      32      |               80  |               45  |      Yes   |
		|sat_int8    |      32      |               77  |               22  |      Yes   |

					Performance Comparison-hifi3
		|Function    |  Array Size  | Branching Cycles  |  HiFi Cycles      |  Bit True? |
		+------------+--------------+-------------------+-------------------+------------+
		|sat_int32   |      32      |              402  |              136  |      Yes   |
		|sat_int24   |      32      |              146  |               71  |      Yes   |
		|sat_int16   |      32      |              110  |               64  |      Yes   |
		|sat_int8    |      32      |              146  |               30  |      Yes   |

@ShriramShastry can you add data that shows the before results (I think these are all "after" data). That ways its easier to see the benefits. Thanks !

@ShriramShastry
Copy link
Contributor Author

ShriramShastry commented Aug 5, 2024

@ShriramShastry can you add data that shows the before results (I think these are all "after" data). That ways its easier to see the benefits. Thanks !

Thank you; please look for the updated results table.

				Performance Comparison - HiFi5
	|Function	|Before HiFi Cycles	|After HiFi Cycles	|% Change |
	+---------------+-----------------------+-----------------------+---------+
	|sat_int32	|122			|105			|-13.93   |
	|sat_int24	|59			|43			|-27.12   |
	|sat_int16	|71			|45			|-36.62   |
	|sat_int8	|91			|22			|-75.82   |
	
				Performance Comparison - HiFi4
	|Function	|Before HiFi Cycles	|After HiFi Cycles	|% Change |
	+---------------+-----------------------+-----------------------+---------+
	|sat_int32	|124			|107			|-13.71   |
	|sat_int24	|59			|41			|-30.51   |
	|sat_int16	|74			|45			|-39.19   |
	|sat_int8	|89			|22			|-75.28   |
	
				Performance Comparison - HiFi3
	|Function	|Before HiFi Cycles	|After HiFi Cycles	|% Change | 
	+---------------+-----------------------+-----------------------+---------+
	|sat_int32	|157			|136			|-13.38   |
	|sat_int24	|76			|71			|-6.58    |
	|sat_int16	|102			|64			|-37.25   |
	|sat_int8	|149			|30			|-79.87   |

Summary
The percentage change for sat_int8 is particularly significant across all HiFi versions, showing the largest reduction in cycles.
sat_int24 and sat_int16 also show considerable improvements in performance.
The change in sat_int32 cycles is more moderate compared to the others.

@lgirdwood
Copy link
Member

@singalsu good performance updates from @ShriramShastry pls review.

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved the C code changes, @singalsu will need to approve the maths.

Copy link
Collaborator

@singalsu singalsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see performance improvement from this with MTL xcc builds of EQs, volume, DRC, TDFB. But also there wasn't any degrade of performance, same results within a fraction of MCPS. I think it's because most of the code is sat_int32() that wasn't improved (at least much) by looks of the code though you reported some improve. Can you check especially the s24 case.

#include <xtensa/config/defs.h>
#include <xtensa/tie/xt_hifi3.h>
#include <stdint.h>
#include <xtensa/tie/xt_hifi2.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what here needs HiFi2 headers? There should be HiFi3 operation available for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C Syntax recommendation
#include <xtensa/tie/xt_hifi2.h>

I am OK, to remove.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it compiles without that, I think better to remove to avoid confusion.

static inline int32_t sat_int32(int64_t x)
{
return (ae_int32)AE_ROUND32F48SSYM(AE_SLAI64(x, 16));
/* Shift left by 32 bits with saturation */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes an overflow with very large input values. But wouldn't this be a simpler fix:

return (ae_int32)AE_ROUND32F48SSYM(AE_SLAI64S(x, 16));

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do you have idea why your version is faster in your test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/thesofproject/sof/issues/9285

With input values of -9223372036854775808 and 9223372036854775807 . It is expected to return -2147483648 and 2147483647 but we are observing 0 and -1

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is possible overflow with AE_SLAI64() instead of AE_SLAI64S(). The proper link to issue is: #9285.

Not sure, but without looking disasembly I'm guessing your version has the small speed advantage from easier type casts. Maybe (int32_t)AE_MOVINT32_FROMINT64() is just a normal store operation. I'm OK with this if you can change the typecast from (int) to (int32_t), if the code still compiles, since the that's type declared type of the function.


/* Convert 64-bit integers to ae_f64 format */
d0 = int64_rtor_ae_f64(x);
d1 = int64_rtor_ae_f64(y);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would just C typecast d0 = (ae_f64)x; work ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, (ae_f64) is similar int64_rtor_ae_f64.

HiFi-5
d0 = (ae_f64)x;
d1 = (ae_f64)y;
Performance Comparison:
Function           |  Array Size  | Branching Cycles  |  HiFi Cycles      |  Bit True? |
-------------------+--------------+-------------------+-------------------+------------+
review_sat_int32   |      32      |              214  |              105  |      Yes   |

d0 = int64_rtor_ae_f64(x);
d1 = int64_rtor_ae_f64(y);
Performance Comparison:
Function           |  Array Size  | Branching Cycles  |  HiFi Cycles      |  Bit True? |
-------------------+--------------+-------------------+-------------------+------------+
review_sat_int32   |      32      |              241  |              105  |      Yes   |

Thanks !!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'd prefer that since it's more readable version. int64_rtor_ae_f64() makes readers wonder what it does.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "Branching Cycles" and why does it differ for these ways? Is (ae_f64) typecast better with smaller number? Still the HiFi cycles are the same. Not critical but if typecast way is good for you, I prefer it.

src/include/sof/audio/format_hifi3.h Show resolved Hide resolved
src/include/sof/audio/format_hifi3.h Show resolved Hide resolved
src/include/sof/audio/format_hifi3.h Show resolved Hide resolved
src/include/sof/audio/format_hifi3.h Show resolved Hide resolved
@lgirdwood
Copy link
Member

I didn't see performance improvement from this with MTL xcc builds of EQs, volume, DRC, TDFB. But also there wasn't any degrade of performance, same results within a fraction of MCPS. I think it's because most of the code is sat_int32() that wasn't improved (at least much) by looks of the code though you reported some improve. Can you check especially the s24 case.

Its probably difficult to see the gain from logs if its hidden in larger infra especially as we as saving single instructions in some cases. If logs are marginal AND maths is correct then we should trust the tooling that these are valid improvements and merge as future use cases may hit these APIs harder where benefits in the logs will be more obvious.

@singalsu
Copy link
Collaborator

I didn't see performance improvement from this with MTL xcc builds of EQs, volume, DRC, TDFB. But also there wasn't any degrade of performance, same results within a fraction of MCPS. I think it's because most of the code is sat_int32() that wasn't improved (at least much) by looks of the code though you reported some improve. Can you check especially the s24 case.

Its probably difficult to see the gain from logs if its hidden in larger infra especially as we as saving single instructions in some cases. If logs are marginal AND maths is correct then we should trust the tooling that these are valid improvements and merge as future use cases may hit these APIs harder where benefits in the logs will be more obvious.

There is no answers to my questions yet, not removing my reject. I have especially concern about 24 bit saturate intrinsics compatibility with the tricky S24_4LE format in SOF. I can check it today myself of course.

ae_f64 shifted = AE_SLAI64S(x, 32);

/* Shift right by 32 bits */
return (int)AE_MOVINT32_FROMINT64(AE_SRAI64(shifted, 32));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the typecast should be (int32_t).

Copy link
Contributor Author

@ShriramShastry ShriramShastry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for examining the code. I've included my responses. Please have a look.

static inline int32_t sat_int32(int64_t x)
{
return (ae_int32)AE_ROUND32F48SSYM(AE_SLAI64(x, 16));
/* Shift left by 32 bits with saturation */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/thesofproject/sof/issues/9285

With input values of -9223372036854775808 and 9223372036854775807 . It is expected to return -2147483648 and 2147483647 but we are observing 0 and -1

#include <xtensa/config/defs.h>
#include <xtensa/tie/xt_hifi3.h>
#include <stdint.h>
#include <xtensa/tie/xt_hifi2.h>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C Syntax recommendation
#include <xtensa/tie/xt_hifi2.h>

I am OK, to remove.


/* Convert 64-bit integers to ae_f64 format */
d0 = int64_rtor_ae_f64(x);
d1 = int64_rtor_ae_f64(y);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, (ae_f64) is similar int64_rtor_ae_f64.

HiFi-5
d0 = (ae_f64)x;
d1 = (ae_f64)y;
Performance Comparison:
Function           |  Array Size  | Branching Cycles  |  HiFi Cycles      |  Bit True? |
-------------------+--------------+-------------------+-------------------+------------+
review_sat_int32   |      32      |              214  |              105  |      Yes   |

d0 = int64_rtor_ae_f64(x);
d1 = int64_rtor_ae_f64(y);
Performance Comparison:
Function           |  Array Size  | Branching Cycles  |  HiFi Cycles      |  Bit True? |
-------------------+--------------+-------------------+-------------------+------------+
review_sat_int32   |      32      |              241  |              105  |      Yes   |

Thanks !!

src/include/sof/audio/format_hifi3.h Show resolved Hide resolved
src/include/sof/audio/format_hifi3.h Show resolved Hide resolved
Copy link
Collaborator

@singalsu singalsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for responses, many resolved issues. Some still left, mainly commit text improvement need I think

src/include/sof/audio/format_hifi3.h Show resolved Hide resolved
#include <xtensa/config/defs.h>
#include <xtensa/tie/xt_hifi3.h>
#include <stdint.h>
#include <xtensa/tie/xt_hifi2.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it compiles without that, I think better to remove to avoid confusion.

static inline int32_t sat_int32(int64_t x)
{
return (ae_int32)AE_ROUND32F48SSYM(AE_SLAI64(x, 16));
/* Shift left by 32 bits with saturation */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is possible overflow with AE_SLAI64() instead of AE_SLAI64S(). The proper link to issue is: #9285.

Not sure, but without looking disasembly I'm guessing your version has the small speed advantage from easier type casts. Maybe (int32_t)AE_MOVINT32_FROMINT64() is just a normal store operation. I'm OK with this if you can change the typecast from (int) to (int32_t), if the code still compiles, since the that's type declared type of the function.


/* Convert 64-bit integers to ae_f64 format */
d0 = int64_rtor_ae_f64(x);
d1 = int64_rtor_ae_f64(y);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'd prefer that since it's more readable version. int64_rtor_ae_f64() makes readers wonder what it does.


/* Convert 64-bit integers to ae_f64 format */
d0 = int64_rtor_ae_f64(x);
d1 = int64_rtor_ae_f64(y);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "Branching Cycles" and why does it differ for these ways? Is (ae_f64) typecast better with smaller number? Still the HiFi cycles are the same. Not critical but if typecast way is good for you, I prefer it.

src/include/sof/audio/format_hifi3.h Show resolved Hide resolved
src/include/sof/audio/format_hifi3.h Show resolved Hide resolved
- Changed year to 2024 and added author information.
- Updated the license identifier to BSD-3-Clause.

Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
- Reshuffle standard headers order.

Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
- Enhanced function documentation.
- Replaced AE_SLAI64 with AE_SLAI64S for better saturation control.
- Added intermediary shift left by 32 bits with saturation.
- Final shift right by 32 bits for accurate rounding and conversion.
- No performance degradation observed.

This check-in ensures more accurate and reliable results by using more
appropriate intrinsic functions and thorough step-by-step operations.

Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
- Added to saturate and pack two 64-bit integers into 32x2 vector.

Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
- Added detailed function documentation.
- Changed from AE_SRAI32(AE_SLAI32S(x, 8), 8) to AE_SAT24S for
  more accurate 24-bit saturation.
- Used ae_f32x2 type to ensure correct input handling.
- No performance degradation observed.

This check-in enhances the sat_int24 function by improving its logic
and documentation to ensure more accurate 32-bit to 24-bit conversion.

Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
- Added to saturate and pack two 32-bit integers into a 32x2 vector.

Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
- Added detailed function documentation.
- Changed from AE_SAT16X4(x, x) to a more descriptive process:
  - Moved the 32-bit value to ae_int32x2 type for accurate input
   handling.
  - Used AE_SAT16X4 to saturate and pack values.
  - Extracted the first 16-bit value with AE_MOVAD16_0.
- No performance degradation observed.

This check-in enhances the sat_int16 function by improving its logic
and documentation to ensure more accurate 32-bit to 16-bit conversion.

Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
- Added to saturate and pack four 32-bit integers into 16x4 vector.

Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
- Added detailed function documentation.
- Changed from if-else checks to bitwise operations for efficiency:
  - Shifted left by 24 bits to prepare for saturation.
  - Shifted right by 24 bits to sign-extend the value.
  - Extracted the 8-bit value using AE_MOVAD32_L.
- No performance degradation observed.

This check-in refines the sat_int8 function by improving its logic
and documentation to ensure a more efficient and accurate 32-bit to
8-bit conversion process.

Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
- Added to pack two 32-bit integers in an 8x2 vector.

Signed-off-by: Shriram Shastry <malladi.sastry@intel.com>
@ShriramShastry ShriramShastry force-pushed the shastry_saturation_hifi_performance branch 2 times, most recently from 8e2b48f to c6af29a Compare August 18, 2024 11:38
Copy link
Contributor Author

@ShriramShastry ShriramShastry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have address the review comments. Please take look,
Thank you.

@lgirdwood
Copy link
Member

@singalsu good for you ?

@@ -53,9 +53,23 @@ static inline ae_int32x2 vec_sat_int32x2(int64_t x, int64_t y)
/* Round and saturate both 64-bit values to 32-bit and pack them */
return (ae_int32x2)AE_ROUND32X2F64SSYM(d0, d1);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit about commit text, the new version isn't more accurate, it's mote efficient.

@lgirdwood lgirdwood merged commit 5e1e05d into thesofproject:main Aug 22, 2024
45 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants