Skip to content

Bitmap scanning_JA to review #2032

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
# User change
title: "Optimize bitmap scanning in databases with SVE and NEON on Arm servers"

weight: 2

layout: "learningpathall"
---
## Overview

Bitmap scanning is a core operation in many database systems. It's essential for powering fast filtering in bitmap indexes, Bloom filters, and column filters. However, these scans can become performance bottlenecks in complex analytical queries.

In this Learning Path, you’ll learn how to accelerate bitmap scanning using Arm’s vector processing technologies - NEON and SVE - on Neoverse V2–based servers like AWS Graviton4.

Specifically, you will:

* Explore how to use SVE instructions on Arm Neoverse V2–based servers like AWS Graviton4 to optimize bitmap scanning
* Compare scalar, NEON, and SVE implementations to demonstrate the performance benefits of specialized vector instructions

## What is bitmap scanning in databases?

Bitmap scanning involves searching through a bit vector to find positions where bits are set (`1`) or unset (`0`).

In database systems, bitmaps are commonly used to represent:

* **Bitmap indexes**: each bit represents whether a row satisfies a particular condition
* **Bloom filters**: probabilistic data structures used to test set membership
* **Column filters**: bit vectors indicating which rows match certain predicates

The operation of scanning a bitmap to find set bits is often in the critical path of query execution, making it a prime candidate for optimization.

## The evolution of vector processing for bitmap scanning

Here's how vector processing has evolved to improve bitmap scanning performance:

* **Generic scalar processing**: traditional bit-by-bit processing with conditional branches
* **Optimized scalar processing**: byte-level skipping to avoid processing empty bytes
* **NEON**: fixed-width 128-bit SIMD processing with vector operations
* **SVE**: scalable vector processing with predication and specialized instructions like MATCH

## Set up your Arm development environment

To follow this Learning Path, you will need:

* An AWS Graviton4 instance running `Ubuntu 24.04`.
* A GCC compiler with SVE support

First, install the required development tools:

```bash
sudo apt-get update
sudo apt-get install -y build-essential gcc g++
```
{{% notice Tip %}}
An effective way to achieve optimal performance on Arm is not only through optimal flag usage, but also by using the most recent compiler version. For best performance, use the latest available GCC version with SVE support. This Learning Path was tested with GCC 13, the default on Ubuntu 24.04. Newer versions should also work.
{{% /notice %}}


Create a directory for your implementations:
```bash
mkdir -p bitmap_scan
cd bitmap_scan
```

## Next up: build the bitmap scanning foundation
With your development environment set up, you're ready to dive into the core of bitmap scanning. In the next section, you’ll define a minimal bitmap data structure and implement utility functions to set, clear, and inspect individual bits.
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
# User change
title: "Build and manage a bit vector in C"

weight: 3

layout: "learningpathall"

---
## Bitmap data structure

Now let's define a simple bitmap data structure that serves as the foundation for the different implementations. The bitmap implementation uses a simple structure with three key components:
- A byte array to store the actual bits
- Tracking of the physical size (bytes)
- Tracking of the logical size (bits)

For testing the different implementations in this Learning Path, you also need functions to generate and analyze the bitmaps.

Use a file editor of your choice and then copy the code below into `bitvector_scan_benchmark.c`:

```c
// Define a simple bit vector structure
typedef struct {
uint8_t* data;
size_t size_bytes;
size_t size_bits;
} bitvector_t;

// Create a new bit vector
bitvector_t* bitvector_create(size_t size_bits) {
bitvector_t* bv = (bitvector_t*)malloc(sizeof(bitvector_t));
bv->size_bits = size_bits;
bv->size_bytes = (size_bits + 7) / 8;
bv->data = (uint8_t*)calloc(bv->size_bytes, 1);
return bv;
}

// Free bit vector resources
void bitvector_free(bitvector_t* bv) {
free(bv->data);
free(bv);
}

// Set a bit in the bit vector
void bitvector_set_bit(bitvector_t* bv, size_t pos) {
if (pos < bv->size_bits) {
bv->data[pos / 8] |= (1 << (pos % 8));
}
}

// Get a bit from the bit vector
bool bitvector_get_bit(bitvector_t* bv, size_t pos) {
if (pos < bv->size_bits) {
return (bv->data[pos / 8] & (1 << (pos % 8))) != 0;
}
return false;
}

// Generate a bit vector with specified density
bitvector_t* generate_bitvector(size_t size_bits, double density) {
bitvector_t* bv = bitvector_create(size_bits);

// Set bits according to density
size_t num_bits_to_set = (size_t)(size_bits * density);

for (size_t i = 0; i < num_bits_to_set; i++) {
size_t pos = rand() % size_bits;
bitvector_set_bit(bv, pos);
}

return bv;
}

// Count set bits in the bit vector
size_t bitvector_count_scalar(bitvector_t* bv) {
size_t count = 0;
for (size_t i = 0; i < bv->size_bits; i++) {
if (bitvector_get_bit(bv, i)) {
count++;
}
}
return count;
}
```

## Next up: implement and benchmark your first scalar bitmap scan

With your bit vector infrastructure in place, you're now ready to scan it for set bits—the core operation that underpins all bitmap-based filters in database systems.
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
---
# User change
title: "Implement scalar bitmap scanning in C"

weight: 4

layout: "learningpathall"


---
## Bitmap scanning implementations

Bitmap scanning is a fundamental operation in performance-critical systems such as databases, search engines, and filtering pipelines. It involves identifying the positions of set bits (`1`s) in a bit vector, which is often used to represent filtered rows, bitmap indexes, or membership flags.

In this section, you'll implement multiple scalar approaches to bitmap scanning in C, starting with a simple per-bit baseline, followed by an optimized version that reduces overhead for sparse data.

Now, let’s walk through the scalar versions of this operation that locate all set bit positions.

### Generic scalar implementation

This is the most straightforward implementation, checking each bit individually. It serves as the baseline for comparison against the other implementations to follow.

Copy the code below into the same file:

```c
// Generic scalar implementation of bit vector scanning (bit-by-bit)
size_t scan_bitvector_scalar_generic(bitvector_t* bv, uint32_t* result_positions) {
size_t result_count = 0;

for (size_t i = 0; i < bv->size_bits; i++) {
if (bitvector_get_bit(bv, i)) {
result_positions[result_count++] = i;
}
}

return result_count;
}
```

You might notice that this generic C implementation processes every bit, even when most bits are not set. It has high per-bit function call overhead and does not take advantage of any vector instructions.

In the following implementations, you can address these inefficiencies with more optimized techniques.

### Optimized scalar implementation

This implementation adds byte-level skipping to avoid processing empty bytes.

Copy this optimized C scalar implementation code into the same file:

```c
// Optimized scalar implementation of bit vector scanning (byte-level)
size_t scan_bitvector_scalar(bitvector_t* bv, uint32_t* result_positions) {
size_t result_count = 0;

for (size_t byte_idx = 0; byte_idx < bv->size_bytes; byte_idx++) {
uint8_t byte = bv->data[byte_idx];

// Skip empty bytes
if (byte == 0) {
continue;
}

// Process each bit in the byte
for (int bit_pos = 0; bit_pos < 8; bit_pos++) {
if (byte & (1 << bit_pos)) {
size_t global_pos = byte_idx * 8 + bit_pos;
if (global_pos < bv->size_bits) {
result_positions[result_count++] = global_pos;
}
}
}
}

return result_count;
}
```
Instead of iterating through each bit individually, this implementation processes one byte (8 bits) at a time. The main optimization over the previous scalar implementation is checking if an entire byte is zero and skipping it entirely. For sparse bitmaps, this can dramatically reduce the number of bit checks.

## Next up: accelerate bitmap scanning with NEON and SVE

You’ve now implemented two scalar scanning routines:

* A generic per-bit loop for correctness and simplicity

* An optimized scalar version that improves performance using byte-level skipping

These provide a solid foundation and performance baseline—but scalar methods can only take you so far. To unlock real throughput gains, it’s time to leverage SIMD (Single Instruction, Multiple Data) execution.

In the next section, you’ll explore how to use Arm NEON and SVE vector instructions to accelerate bitmap scanning. These approaches will process multiple bytes at once and significantly outperform scalar loops—especially on modern Arm-based CPUs like AWS Graviton4.
Loading