Skip to content

Conversation

@BioTurboNick
Copy link
Contributor

@BioTurboNick BioTurboNick commented Feb 15, 2025

As pointed out by @Zentrik here, recent Nanosoldier runs suggested a significant performance regression for simple hvcats as a result of #39729 .

I revisted the code and determined that the main cause was that typed_hvncat iterated over each element and had to calculate the linear index for each one, resulting in many multiplication and addition operations for each element. I realized that CartesianIndices could be used to restore the copy-as-a-whole pattern that typed_hvcat used, while retaining generality for arbitrary dimensions.

As I recall, I believe a limitation when I wrote the hvncat code was that certain features were not available during compiler bootstrapping, requiring fully manual indexing. Since the compiler has been made an stdlib, I believe that made this PR possible.

Before merging I would also want to check that I didn't hurt the hvncat performance at all. Done

This should ideally be marked for 1.12 backport.

@KristofferC KristofferC added performance Must go faster backport 1.12 Change should be backported to release-1.12 labels Feb 15, 2025
@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Feb 15, 2025

I don't think the test failure is related? It occurred testing Profile module... EDIT: Yep, not related.

@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Feb 15, 2025

There's unfortunately extra overhead for everything else not intended to be addressed, looks like mostly because getindex with CartesianIndices unfortunately relies on slow integer division via _ind2sub.

EDIT: Ugh, there seems to be an annoying trade-off in performance. I'll need to explore further.

@KristofferC KristofferC mentioned this pull request Feb 15, 2025
31 tasks
@BioTurboNick
Copy link
Contributor Author

BioTurboNick commented Feb 16, 2025

I believe I got it. The overhead of the block copy was too much for small arrays, so I added a branch to use the original loop for those. Crossover point seemed to be around 4-8 elements, so I branched at >4.

Two other aspects addressed:

  • 1d arrays of pure numbers were a bit slow compared with cat, so I adopted its approach
  • Identified significant performance reduction in an important case (see below), and found unusual time spent in setindex_shape_check. Adding @inline eliminated the bottleneck entirely, though could that be a symptom of a broader regression?
const x = [1 2; 3 4;;; 5 6; 7 8] # cat([1 2; 3 4], [5 6; 7 8], dims=3)
const y = x .+ 1
e17() = [x ;;; x ;;;; y ;;; y] # 99.356 ns (6 allocations: 544 bytes), was 4x slower and many more allocations

EDIT: There was one trade-off I didn't find an optimal solution for, and settled on resolving in favor of all-arrays as the more common choice (no change from master). If the elements to cat are all arrays, then the dimension-calculation in _typed_hvncat_dims is more efficient iterating over eachindex of the tuple and indexing into it. If the elements are a mixture of arrays and scalars, then iterating over the elements with enumerate is more efficient. If the situations are swapped, there's substantial overhead indexing into the tuple (mixed arrays and scalars), or substantial overhead performing the iteration itself (just arrays). Ultimately not a big impact, but a bit of gripe that the compiler can be fickle like that.

@KristofferC KristofferC mentioned this pull request Feb 17, 2025
24 tasks
This was referenced Mar 24, 2025
@KristofferC KristofferC mentioned this pull request Apr 4, 2025
51 tasks
@KristofferC KristofferC mentioned this pull request Apr 29, 2025
53 tasks
@KristofferC KristofferC mentioned this pull request May 9, 2025
58 tasks
@KristofferC KristofferC mentioned this pull request Jun 6, 2025
60 tasks
@KristofferC KristofferC mentioned this pull request Jul 22, 2025
20 tasks
@KristofferC KristofferC mentioned this pull request Aug 6, 2025
38 tasks
@KristofferC KristofferC mentioned this pull request Aug 19, 2025
27 tasks
@KristofferC KristofferC mentioned this pull request Sep 24, 2025
24 tasks
@KristofferC KristofferC mentioned this pull request Sep 30, 2025
47 tasks
Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@vtjnash
Copy link
Member

vtjnash commented Oct 16, 2025

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your job failed.

@KristofferC KristofferC mentioned this pull request Oct 21, 2025
19 tasks
@BioTurboNick
Copy link
Contributor Author

Is the nanosoldier failure something to do with the PR, or does it just need to be rerun?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport 1.12 Change should be backported to release-1.12 performance Must go faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants