Rewrite some batch algorithms with `AVX2`

## Enhancement

There are some batch algorithms that use sse2, like 
https://github.com/pingcap/tiflash/blob/47d4c8fa5c0beccc009166eea00187bb2cd0904e/dbms/src/Columns/ColumnVector.cpp#L227-L260

Since we enable avx2 by default, we can rewrite them with avx2.

> To deploy TiFlash under the Linux AMD64 architecture, the CPU must support AVX2 instruction sets. Use cat /proc/cpuinfo | grep avx2 to confirm that there is output. By using such CPU instruction sets, TiFlash's vectorization engine can deliver better performance.
https://docs.pingcap.com/tidb/dev/tiflash-overview#architecture

	#if __SSE2__
	/** A slightly more optimized version.
	* Based on the assumption that often pieces of consecutive values
	* completely pass or do not pass the filter.
	* Therefore, we will optimistically check the parts of `SIMD_BYTES` values.
	*/

	static constexpr size_t SIMD_BYTES = 16;
	const __m128i zero16 = _mm_setzero_si128();
	const UInt8 * filt_end_sse = filt_pos + size / SIMD_BYTES * SIMD_BYTES;

	while (filt_pos < filt_end_sse)
	{
	int mask = _mm_movemask_epi8(_mm_cmpgt_epi8(_mm_loadu_si128(reinterpret_cast<const __m128i *>(filt_pos)), zero16));

	if (0 == mask)
	{
	/// Nothing is inserted.
	}
	else if (0xFFFF == mask)
	{
	res_data.insert(data_pos, data_pos + SIMD_BYTES);
	}
	else
	{
	for (size_t i = 0; i < SIMD_BYTES; ++i)
	if (filt_pos[i])
	res_data.push_back(data_pos[i]);
	}

	filt_pos += SIMD_BYTES;
	data_pos += SIMD_BYTES;
	}
	#endif

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite some batch algorithms with `AVX2` #6839

Enhancement

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rewrite some batch algorithms with AVX2 #6839

Description