Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization for eliminating redundant memory operations #165

Merged
merged 8 commits into from
Jun 4, 2022

Conversation

yyzdtccjdtc
Copy link
Contributor

Optimization for eliminating redundant memory operations related to this issue.
#163 (comment)

@sharkdp
Copy link
Owner

sharkdp commented May 24, 2022

Nice. Can you tell us a bit more about how you benchmarked this?

@yyzdtccjdtc
Copy link
Contributor Author

Nice. Can you tell us a bit more about how you benchmarked this?

I tested it with the command line 'pastel distinct 2000 >/dev/null' and then took the average execution time. According to my test results, the running time was reduced from 9.54 seconds to 8.41 seconds, which has a 13.4% speedup.

Also I compiled it with the opt-level = 3.

@sharkdp
Copy link
Owner

sharkdp commented May 26, 2022

Unfortunately, I can not reproduce these benchmark results:

Command Mean [s] Min [s] Max [s] Relative
./pastel-master distinct 200 1.166 ± 0.023 1.141 1.208 1.00
./pastel-165 distinct 200 1.252 ± 0.023 1.215 1.298 1.07 ± 0.03

These results were created using a normal cargo build --release build. Benchmarking was done with hyperfine:

hyperfine -L version master,165 --warmup 3 './pastel-{version} distinct 200' --export-markdown results.md

@yyzdtccjdtc
Copy link
Contributor Author

Unfortunately, I can not reproduce these benchmark results:

Command Mean [s] Min [s] Max [s] Relative
./pastel-master distinct 200 1.166 ± 0.023 1.141 1.208 1.00
./pastel-165 distinct 200 1.252 ± 0.023 1.215 1.298 1.07 ± 0.03
These results were created using a normal cargo build --release build. Benchmarking was done with hyperfine:

hyperfine -L version master,165 --warmup 3 './pastel-{version} distinct 200' --export-markdown results.md

I believe if you try to change it to distinct 2000, you can see the difference.

@sharkdp
Copy link
Owner

sharkdp commented May 26, 2022

pastel distinct 2000 is not really a realistic use-case, to be honest. Everything up to 100... maybe. But who wants to generate 2000 "visually distinct" colors?

@yyzdtccjdtc
Copy link
Contributor Author

yyzdtccjdtc commented May 29, 2022

By carefully looking into the code I found the problem. The Lab structures in the Vec<(Color, Lab)> are generated here.

pastel/src/distinct.rs

Lines 82 to 85 in 47f9ddd

let colors = initial_colors
.iter()
.map(|c| (c.clone(), c.to_lab()))
.collect();

So my get_labs function to get the Lab structure out is a redundant operation.

    pub fn get_labs(&self) -> Vec<Lab> {
        self.colors.iter().map(|(_, l)| l.clone()).collect()
    }

My current approach is to generate two separate labs and colors vectors directly at the time of this SimulatedAnnealing structure creation.

According to my tests with ./pastel distinct 200 >results.md, the average execution time decreased from 0.7158s to 0.6389s, which is a 12% speedup. Hopefully, your test will achieve the same results as mine :)

src/distinct.rs Outdated
@@ -65,7 +65,8 @@ pub struct SimulationParameters {
}

pub struct SimulatedAnnealing<R: Rng> {
colors: Vec<(Color, Lab)>,
colors: Vec<Color>,
labs: Vec<Lab>,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we maybe call it lab_values everywhere, instead of labs?

src/lib.rs Outdated
@@ -938,7 +938,7 @@ impl fmt::Display for LMS {
}
}

#[derive(Debug, Clone, PartialEq)]
#[derive(Debug, Clone, PartialEq, Copy)]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to derive copy? I'd prefer if we explicitly call .clone() when we really want it. The struct contains 4 64bit floats, so it's probably cheaper to usually pass it by reference, not by value, right?

src/distinct.rs Outdated
match self.distance_metric {
DistanceMetric::CIE76 => delta_e::cie76(&a.1, &b.1),
DistanceMetric::CIEDE2000 => delta_e::ciede2000(&a.1, &b.1),
DistanceMetric::CIE76 => delta_e::cie76(&a, &b),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please fix the clippy warning here: "this expression creates a reference which is immediately dereferenced by the compiler"

src/distinct.rs Outdated
DistanceMetric::CIE76 => delta_e::cie76(&a.1, &b.1),
DistanceMetric::CIEDE2000 => delta_e::ciede2000(&a.1, &b.1),
DistanceMetric::CIE76 => delta_e::cie76(&a, &b),
DistanceMetric::CIEDE2000 => delta_e::ciede2000(&a, &b),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

… and here.

@sharkdp
Copy link
Owner

sharkdp commented May 29, 2022

Now I can reproduce those results - thank you!

@yyzdtccjdtc
Copy link
Contributor Author

yyzdtccjdtc commented May 30, 2022

I changed all labs to lab_values and fixed the clippy warning. I also replaced the derive copy with clone() function at the same time.

The struct contains 4 64bit floats, so it's probably cheaper to usually pass it by reference, not by value, right?

I also looked into this issue a bit. I want to state in advance that this is a compiler sensitive issue.According to the output file of objdump, I found that if we use the reference like this
let at_lab = &lab_values[color], it will load the address of lab_values[color] first, then push this address one the stack, and finally it will first load the address from the stack and then load the three values from this address (because cie76 and ciede2000 functions only need the first three elements of Lab struct to do the calculation). So a total of 6 memory operations were performed.

But if we use copy() or clone() let at_lab = lab_values[color].clone();, it will load the values directly with two movupd instruction which can load 128 bits at a time. It then does the same three load operations as the reference to do the calculation. There are a total of 5 memory operations here, one less than reference.

And the execution time also proves this, the average time of reference is 0.68s (1.06x speedup), while the average time of clone is 0.64s (1.12x speedup).

If you think this modification is not so generic, I can also remove it and keep only the code related to the optimization of the separation of colors and lab_values vectors.

@yyzdtccjdtc
Copy link
Contributor Author

@sharkdp Please tell me if there are any other changes I need to make?

@sharkdp
Copy link
Owner

sharkdp commented Jun 4, 2022

No, looks good. Thank you very much!

@sharkdp sharkdp merged commit ed35893 into sharkdp:master Jun 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants