Skip to content

Fix performance regression in hvcat of simple matrices #57422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

BioTurboNick
Copy link
Contributor

@BioTurboNick BioTurboNick commented Feb 15, 2025

As pointed out by @Zentrik here, recent Nanosoldier runs suggested a significant performance regression for simple hvcats as a result of #39729 .

I revisted the code and determined that the main cause was that typed_hvncat iterated over each element and had to calculate the linear index for each one, resulting in many multiplication and addition operations for each element. I realized that CartesianIndices could be used to restore the copy-as-a-whole pattern that typed_hvcat used, while retaining generality for arbitrary dimensions.

As I recall, I believe a limitation when I wrote the hvncat code was that certain features were not available during compiler bootstrapping, requiring fully manual indexing. Since the compiler has been made an stdlib, I believe that made this PR possible.

Before merging I would also want to check that I didn't hurt the hvncat performance at all. Done

This should ideally be marked for 1.12 backport.

@KristofferC KristofferC added performance Must go faster backport 1.12 Change should be backported to release-1.12 labels Feb 15, 2025
@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Feb 15, 2025

I don't think the test failure is related? It occurred testing Profile module... EDIT: Yep, not related.

@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Feb 15, 2025

There's unfortunately extra overhead for everything else not intended to be addressed, looks like mostly because getindex with CartesianIndices unfortunately relies on slow integer division via _ind2sub.

EDIT: Ugh, there seems to be an annoying trade-off in performance. I'll need to explore further.

@KristofferC KristofferC mentioned this pull request Feb 15, 2025
31 tasks
@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Feb 16, 2025

I believe I got it. The overhead of the block copy was too much for small arrays, so I added a branch to use the original loop for those. Crossover point seemed to be around 4-8 elements, so I branched at >4.

Two other aspects addressed:

  • 1d arrays of pure numbers were a bit slow compared with cat, so I adopted its approach
  • Identified significant performance reduction in an important case (see below), and found unusual time spent in setindex_shape_check. Adding @inline eliminated the bottleneck entirely, though could that be a symptom of a broader regression?
const x = [1 2; 3 4;;; 5 6; 7 8] # cat([1 2; 3 4], [5 6; 7 8], dims=3)
const y = x .+ 1
e17() = [x ;;; x ;;;; y ;;; y] # 99.356 ns (6 allocations: 544 bytes), was 4x slower and many more allocations

EDIT: There was one trade-off I didn't find an optimal solution for, and settled on resolving in favor of all-arrays as the more common choice (no change from master). If the elements to cat are all arrays, then the dimension-calculation in _typed_hvncat_dims is more efficient iterating over eachindex of the tuple and indexing into it. If the elements are a mixture of arrays and scalars, then iterating over the elements with enumerate is more efficient. If the situations are swapped, there's substantial overhead indexing into the tuple (mixed arrays and scalars), or substantial overhead performing the iteration itself (just arrays). Ultimately not a big impact, but a bit of gripe that the compiler can be fickle like that.

@KristofferC KristofferC mentioned this pull request Feb 17, 2025
24 tasks
This was referenced Mar 24, 2025
@KristofferC KristofferC mentioned this pull request Apr 4, 2025
51 tasks
@KristofferC KristofferC mentioned this pull request Apr 29, 2025
44 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1.12 Change should be backported to release-1.12 performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants