CoD Bloom Effect on CPU

Implementing a bloom effect on CPU for learning purposes.

Result

The Algorithm

As can be seen from the diagram below, we start by downsampling the original image multiple times while storing them in a list. then we take the smallest image and upsample it to be the same size as the previous (larger) image aka the image that has been downsampled one less time. Now that we have two images of the same size (e.g. 256x256 and 256x256) we mix them together. Here we use a simple lerp with alpha=0.2. We continue until we get the final image.

A few notes:

Bilinear tap is retrieving the bilinear interpolation of the 4 closest pixels around a target point (x, y). x and y are float values between 0-1 but we need to find their 4 nearest pixels and blend their values based on their distance to (x,y). See

bloom_cpp/src_shit/main.cpp

Line 36 in 560ed06

double BilinearTap(const MyImage& image, double x, double y, int channel) {
Downsampling is done by calculating the weighted sum of bilinear taps at 13 predefined points each with predefined weights. See

bloom_cpp/src_shit/main.cpp

Line 124 in 560ed06

const std::vector<std::pair<double,double>> Coords = {
Upsampling is done similarly but using 9 points with predefined weights. See

bloom_cpp/src_shit/main.cpp

Line 76 in 560ed06

const std::vector<std::pair<double, double>> Coords = {

The blending of downsampled and upsampled images look like this:

B-E are downsampled versions of the original image (A)
then upsampling:

E' = E
D' = lerp(D, E', 0.2)
C' = lerp(C, D', 0.2)
B' = lerp(B, C', 0.2)
A' = lerp(A, B', 0.2)

A' is the result

Performance Comparison

Measured on an Intel i7-12700 CPU

Implementation	time	fps
Python	39.5 secs	0.025 fps
C++ (Debug)	2.7 secs	0.37 fps
C++ (Release)	0.68 secs	1.47 fps
C++ (Improved)	0.11 secs	9 fps
C++ (Multithread 8t)	0.032 secs	31 fps
OpenGL	? secs	? fps

Note: Change ${SRC_DIR} in CMakeLists.txt to compile other versions of the code.

Performance Improvements

I tried different approaches to improve performance, here I list them for future reference. (CPU: i7-4930k)

Technique	Time	Speedup
Baseline	3.05 secs	1x
Flat vector instead of 3D vector	1.30 secs	2.3x
Inline the GetPixel and SetPixel	1.08 secs	2.82x
Flat Lerp instead of 3D Lerp	1.06 secs	2.87x
Arithmetic Optimisations in BilinearTap	0.98 secs	3.11x
static const double arrays instead of vectors for weights	0.94 secs	3.24x
Avoid divisions in the inner-loop	0.87 secs	3.50x
std::move and emplace_back	0.81 secs	3.76x
Claude's version	0.72 secs	4.23x
OpenMP Multithread with 10 threads (Claude)	0.16 secs	19x

1. Flat vector instead of 3D vector

This improved the performance from 3.05 secs to 1.30 secs per 1024x1024 image.

From

std::vector<std::vector<std::vector<double>>> pixels;

double GetPixel(int x, int y, int channel) const{
    return data[y][x][channel];
}

void SetPixel(int x, int y, int channel, double value) {
    data[y][x][channel] = value;
}

To

std::vector<double> pixels;

double GetPixel(int x, int y, int channel) const{
    return data[y * width + x * channels + channel];
}
void SetPixel(int x, int y, int channel, double value) {
    data[y * width + x * channels + channel] = value;
}

2. Inline GetPixel and SetPixel

GetPixel and SetPixel are called very often, let's reduce the number of function calls.

This improved the performance from 1.30 secs to 1.08 secs per 1024x1024 image.

3. Flat Lerp instead of 3D Lerp

This didn't help that much.

From

MyImage Lerp(const MyImage& a, const MyImage& b, double t) {
    MyImage result(a.width, a.height, a.channels);
    for (int i = 0; i < a.height; ++i) {
        for (int j = 0; j < a.width; ++j) {
            for (int ch = 0; ch < a.channels; ++ch) {
                result.SetPixel(j, i, ch, a.GetPixel(j, i, ch) * (1.0 - t) + b.GetPixel(j, i, ch) * t);
            }
        }
    }
    return result;
}

To

MyImage Lerp(const MyImage& a, const MyImage& b, double t) {
    MyImage result(a.width, a.height, a.channels);
    for(int i=0; i<a.data.size(); ++i) {
        result.data[i] = a.data[i] * (1.0 - t) + b.data[i] * t;
    }
    return result;
}

4. Micro-optimizations in BilinearTap

Reducing the number of multiplications and also fetching data for the four pixels directly from memory.

This improved the performance from 1.06 secs to 0.98 secs per 1024x1024 image.

Refer to BilinearTap() for more details.

5. static const double arrays instead of vectors for weights

Used static const double arrays instead of std::vector for weights.

This improved the performance from 0.98 secs to 0.94 secs per 1024x1024 image.

6. Avoid divisions in the inner-loop

Avoid divisions in the inner-loop by pre-computing the inverse of new_w and new_h.

This improved the performance from 0.94 secs to 0.88 secs per 1024x1024 image.

7. std::move and emplace_back

Claude recommended a few other optimizations, including:

Using std::move instead of copying
Using std::vector::emplace_back instead of std::vector::push_back

This improved the performance from 0.88 secs to 0.80 secs per 1024x1024 image.

8. OpenMP Multithreading

Claude wrote a version of the code with OpenMP multithreading. for this image-size (1024x1024) and i7-12700(16pt + 4et), 8 threads gave the best performance.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
images		images
src		src
src_claude		src_claude
src_claude_openmp		src_claude_openmp
src_shit		src_shit
.gitignore		.gitignore
Bloom_CPP		Bloom_CPP
Bloom_CPP.exe		Bloom_CPP.exe
CMakeLists.txt		CMakeLists.txt
README.md		README.md
bloom_flowchart.png		bloom_flowchart.png
downsampled_0.png		downsampled_0.png
downsampled_1.png		downsampled_1.png
downsampled_2.png		downsampled_2.png
downsampled_3.png		downsampled_3.png
downsampled_4.png		downsampled_4.png
downsampled_5.png		downsampled_5.png
downsampled_6.png		downsampled_6.png
downsampled_7.png		downsampled_7.png
graph.png		graph.png
output.png		output.png
upsampled_1.png		upsampled_1.png
upsampled_2.png		upsampled_2.png
upsampled_3.png		upsampled_3.png
upsampled_4.png		upsampled_4.png
upsampled_5.png		upsampled_5.png
upsampled_6.png		upsampled_6.png
upsampled_7.png		upsampled_7.png
upsampled_8.png		upsampled_8.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CoD Bloom Effect on CPU

Result

The Algorithm

Performance Comparison

Performance Improvements

1. Flat vector instead of 3D vector

2. Inline GetPixel and SetPixel

3. Flat Lerp instead of 3D Lerp

4. Micro-optimizations in BilinearTap

5. static const double arrays instead of vectors for weights

6. Avoid divisions in the inner-loop

7. std::move and emplace_back

8. OpenMP Multithreading

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Gholamrezadar/bloom_cpp

Folders and files

Latest commit

History

Repository files navigation

CoD Bloom Effect on CPU

Result

The Algorithm

Performance Comparison

Performance Improvements

1. Flat vector instead of 3D vector

2. Inline GetPixel and SetPixel

3. Flat Lerp instead of 3D Lerp

4. Micro-optimizations in BilinearTap

5. static const double arrays instead of vectors for weights

6. Avoid divisions in the inner-loop

7. std::move and emplace_back

8. OpenMP Multithreading

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages