Skip to content

Rewrite some batch algorithms with AVX2 #6839

Open
@Lloyd-Pottiger

Description

Enhancement

There are some batch algorithms that use sse2, like

#if __SSE2__
/** A slightly more optimized version.
* Based on the assumption that often pieces of consecutive values
* completely pass or do not pass the filter.
* Therefore, we will optimistically check the parts of `SIMD_BYTES` values.
*/
static constexpr size_t SIMD_BYTES = 16;
const __m128i zero16 = _mm_setzero_si128();
const UInt8 * filt_end_sse = filt_pos + size / SIMD_BYTES * SIMD_BYTES;
while (filt_pos < filt_end_sse)
{
int mask = _mm_movemask_epi8(_mm_cmpgt_epi8(_mm_loadu_si128(reinterpret_cast<const __m128i *>(filt_pos)), zero16));
if (0 == mask)
{
/// Nothing is inserted.
}
else if (0xFFFF == mask)
{
res_data.insert(data_pos, data_pos + SIMD_BYTES);
}
else
{
for (size_t i = 0; i < SIMD_BYTES; ++i)
if (filt_pos[i])
res_data.push_back(data_pos[i]);
}
filt_pos += SIMD_BYTES;
data_pos += SIMD_BYTES;
}
#endif

Since we enable avx2 by default, we can rewrite them with avx2.

To deploy TiFlash under the Linux AMD64 architecture, the CPU must support AVX2 instruction sets. Use cat /proc/cpuinfo | grep avx2 to confirm that there is output. By using such CPU instruction sets, TiFlash's vectorization engine can deliver better performance.
https://docs.pingcap.com/tidb/dev/tiflash-overview#architecture

Metadata

Assignees

No one assigned

    Labels

    good first issueDenotes an issue ready for a new contributor, according to the "help wanted" guidelines.help wantedDenotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.type/enhancementThe issue or PR belongs to an enhancement.type/performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions