Skip to content

Improve performance of functions with dynamic arguments #345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 1, 2025

Conversation

osa1
Copy link
Member

@osa1 osa1 commented Apr 16, 2025

Matrix4.translate is used by Flutter and it often appears as an overhead when
profiling Flutter apps compiled to Wasm.

matrix4_translate_prof

The function currently takes a dynamic argument and takes different code
paths based on the type.

This is inefficient when the call site already knows the argument's type.

In general, when we have a performance-critical generic function that checks
types of the arguments, it makes sense to introduce specialized versions of the
function based on precise argument types that it handles, so that callers can
call the more efficient specialized versions and avoid the type test overheads.

For each function with dynamic argument, this PR introduces specialized
functions based on precise argument types that the functions handle, call those
specialized functions from the dynamic functions, and then inline the
dynamic functions.

This allows compilers to eliminate the type tests when the types are known in a
call site, and call the efficient functions directly.

For example, for Vector4.translate, we introduce:

  • Vector4.translateByDouble(double x, double y, double z)
  • Vector4.translateByVector3(Vector3 v3)
  • Vector4.translateByVector4(Vector4 v4)

Call sites that know the argument type can directly call these for performance.

Existing call sites will be automatically improved when updating the library,
as the type-testing functions will be inlined and type tests will be eliminated
in most (probably all cases).

dart2wasm benchmarks: (-O2)

// Before
Matrix4.translateByDoubleGeneric(RunTime): 8.032 us.
Matrix4.translateByVector3Generic(RunTime): 8.457468284493933 us.
Matrix4.translateByVector4Generic(RunTime): 9.094468169361408 us.

// After
Matrix4.translateByDoubleGeneric(RunTime): 4.86 us.
Matrix4.translateByVector3Generic(RunTime): 4.796994003757495 us.
Matrix4.translateByVector4Generic(RunTime): 4.795 us.
Matrix4.translateByDouble(RunTime): 4.8425 us.
Matrix4.translateByVector3(RunTime): 4.997243753445308 us.
Matrix4.translateByVector4(RunTime): 4.909493863132671 us.

dart2js benchmarks: (-O4)

// Before
Matrix4.translateByDoubleGeneric(RunTime): 6.224801171860612 us.
Matrix4.translateByVector3Generic(RunTime): 9.400219449286789 us.
Matrix4.translateByVector4Generic(RunTime): 10.751954304194207 us.

// After
Matrix4.translateByDoubleGeneric(RunTime): 4.21 us.
Matrix4.translateByVector3Generic(RunTime): 4.388117869308463 us.
Matrix4.translateByVector4Generic(RunTime): 4.9375 us.
Matrix4.translateByDouble(RunTime): 4.1782447771940285 us.
Matrix4.translateByVector3(RunTime): 4.436994453756933 us.
Matrix4.translateByVector4(RunTime): 5.0375 us.

AOT benchmarks:

// Before
Matrix4.translateByDoubleGeneric(RunTime): 6.5326339347321305 us.
Matrix4.translateByVector3Generic(RunTime): 6.445286109427781 us.
Matrix4.translateByVector4Generic(RunTime): 6.807083684061711 us.

// After
Matrix4.translateByDoubleGeneric(RunTime): 4.193214 us.
Matrix4.translateByVector3Generic(RunTime): 4.132444 us.
Matrix4.translateByVector4Generic(RunTime): 4.821352357295285 us.
Matrix4.translateByDouble(RunTime): 4.074234 us.
Matrix4.translateByVector3(RunTime): 4.061202 us.
Matrix4.translateByVector4(RunTime): 4.735463330670837 us.

JIT benchmarks:

// Before
Matrix4.translateByDoubleGeneric(RunTime): 4.189242 us.
Matrix4.translateByVector3Generic(RunTime): 8.532402136592522 us.
Matrix4.translateByVector4Generic(RunTime): 7.937426969102983 us.

// After
Matrix4.translateByDoubleGeneric(RunTime): 4.029674 us.
Matrix4.translateByVector3Generic(RunTime): 4.037351666666667 us.
Matrix4.translateByVector4Generic(RunTime): 4.901740372824534 us.
Matrix4.translateByDouble(RunTime): 4.247654940431325 us.
Matrix4.translateByVector3(RunTime): 4.058694 us.
Matrix4.translateByVector4(RunTime): 4.831515710605362 us.

Performance or other improved functions scale, scaled, operator *, and
leftTranslate should be improved similarly.

@coveralls
Copy link

coveralls commented Apr 16, 2025

Coverage Status

coverage: 26.388% (+0.02%) from 26.372%
when pulling 7fb9a02 on osa1:faster_vector4_translate
into 39cafd4 on google:master.

@osa1 osa1 marked this pull request as ready for review April 16, 2025 11:44
@osa1
Copy link
Member Author

osa1 commented Apr 16, 2025

Somehow I can't add reviewers to the PR.

@mkustermann @eyebrowsoffire could you have a look?

@spydon
Copy link
Collaborator

spydon commented Apr 16, 2025

Maybe it's time to make a breaking release soon instead? Then I would remove all the dynamic arguments from all the classes, and we would also follow semver for #270

@osa1
Copy link
Member Author

osa1 commented Apr 16, 2025

This change doesn't need to be a breaking change. We can add a deprecation to translate though as I would think that it's too slow for what it's intended for.

@spydon
Copy link
Collaborator

spydon commented Apr 16, 2025

This change doesn't need to be a breaking change. We can add a deprecation to translate though as I would think that it's too slow for what it's intended for.

Yeah, it wouldn't be a breaking change right now. But if following semver, marking it as deprecated is just the first step towards a breaking change, when removing the deprecated method major should be bumped (https://semver.org/#how-should-i-handle-deprecating-functionality).

I'm not too fond of all the byType methods, but I also can't come up with a better solution, and it is definitely better than taking dynamic as an argument.

@kevmoo
Copy link
Collaborator

kevmoo commented Apr 25, 2025

Let's get a changelog in here and bump the minor (.X.) version

@kevmoo

This comment was marked as resolved.

@kevmoo
Copy link
Collaborator

kevmoo commented Apr 25, 2025

@osa1 – by all means revert if you don't like the last commit

@osa1 osa1 force-pushed the faster_vector4_translate branch from f4bf7e3 to cdbed22 Compare April 29, 2025 11:15
@osa1
Copy link
Member Author

osa1 commented Apr 29, 2025

by all means revert if you don't like the last commit

I dropped your commit as it changed benchmark results and made them worse, and then I had to debug this because I wasn't able to reproduce the benchmark results in my PR description.

(The "cleanup" change calls less efficient function from more efficient one and makes translateByVector3 benchmark worse)

@osa1 osa1 changed the title Add specialized Vector4.translate methods based on argument types Improve performance of functions with dynamic arguments Apr 30, 2025
@osa1
Copy link
Member Author

osa1 commented Apr 30, 2025

@mraleph @rakudrama @mkustermann I think I addressed all of the feedback. Could you PTAL? Start with the PR description, I just applied the same changes to other functions that take dynamic arguments.

@kevmoo as far as I know still no one from the backend teams can approve reviews, can we give at least one person from each backend review rights? They already reviewed this PR, they just officially can't approve.

(Same as SDK, it makes sense for changes to libraries to be reviewed by backend teams to catch any performance or binary size issues)

Copy link

@mkustermann mkustermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM with the following comments

  • move stores up as far as possible
  • access [15] first
  • have single implementation that all others delegate to
  • make functions only have a single return & mark as @pragma('dart2js:prefer-inline')

If the last point improves performance in dart2js similar to dart2wasm & vm, then do we get meaningful perf benefits from having the {translate,scale,...}By{Double,...} operations?

@osa1
Copy link
Member Author

osa1 commented Apr 30, 2025

move stores up as far as possible

Done. I haven't benchmarked this change in isolation.

have single implementation that all others delegate to

Done. Doesn't affect performance when I inline everything.

make functions only have a single return & mark as @pragma('dart2js:prefer-inline')

Done. Improves dart2js performance.

access [15] first

NOT done: doesn't make any difference in dart2wasm and AOT, but makes dart2js perform worse.


I'll update benchmark results in the PR description in a bit. Done.

}
}
}

void main() {
MatrixMultiplyBenchmark.main();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This structure of running each benchmark in turn is susceptible to JIT-effects where the code is initially specialized to the first benchmark, and then de-optimized and recompiled against less constrained inputs.
When this happens, the reported performance is dependent on the order of the benchmarks,
To counter this, I usually create a list of benchmarks, warm them all up, then do the actual measurements.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do it separately. It's going to be a big change to update all of the benchmarks, and it can be done separately.

@rakudrama
Copy link
Collaborator

Generally LGTM with the following comments

  • move stores up as far as possible

I'm worried about moving stores. There are constructors that take 'storage' as an input, so there can be aliasing via typed data views.
If you move a store, there is an obligation to explain to the reader why the order is safe for aliased inputs.
Yes, there are lots of bugs already with aliasing, but lets not add more.

  • access [15] first

Can we make the statement check _m4storage[15]; work on all platforms? It works for VM-AOT and dart2js.
Moving statements for the effect of moving [15] first is less readable and open to changing read-write order that may affect aliased inputs.

  • have single implementation that all others delegate to
  • make functions only have a single return & mark as @pragma('dart2js:prefer-inline')

If the last point improves performance in dart2js similar to dart2wasm & vm, then do we get meaningful perf benefits from having the {translate,scale,...}By{Double,...} operations?

}
throw ArgumentError(arg);
return value;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is worthwhile making the dynamic versions faster like this.
The user has to do a dynamic call or cast to use the result.
I'd rather they were deprecated with proper versioning.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's free performance, why not make it faster?

(It's free because I assume all calls sites will directly call it, so it'll always be inlined, and the type tests will always disappear as the argument types will also known at the call sites. So no binary size increase.)

I added deprecations.

@osa1
Copy link
Member Author

osa1 commented May 1, 2025

I'm worried about moving stores. There are constructors that take 'storage' as an input, so there can be aliasing via typed data views.
If you move a store, there is an obligation to explain to the reader why the order is safe for aliased inputs.
Yes, there are lots of bugs already with aliasing, but lets not add more.

Which function(s) do you mean that can have aliasing bugs?

The code with stores moved up is this:

void translateByDouble(double tx, double ty, double tz, double tw) {
  final t1 = _m4storage[0] * tx +
      _m4storage[4] * ty +
      _m4storage[8] * tz +
      _m4storage[12] * tw;
  _m4storage[12] = t1;

  final t2 = _m4storage[1] * tx +
      _m4storage[5] * ty +
      _m4storage[9] * tz +
      _m4storage[13] * tw;
  _m4storage[13] = t2;

  final t3 = _m4storage[2] * tx +
      _m4storage[6] * ty +
      _m4storage[10] * tz +
      _m4storage[14] * tw;
  _m4storage[14] = t3;

  final t4 = _m4storage[3] * tx +
      _m4storage[7] * ty +
      _m4storage[11] * tz +
      _m4storage[15] * tw;
  _m4storage[15] = t4;
}

There's only one array being used here, so aliasing can't be an issue.

Can we make the statement check _m4storage[15]; work on all platforms?

I reported benchmarks with this here: #345 (comment).

It doesn't make any difference in any of: AOT, dart2js, dart2wasm.

It's difficult to make it work with dart2wasm because wasm-opt doesn't track accessed locations and omit bounds checks based on that. As mentioned in the thread, I reported this use case to the wasm-opt team and but it's in the backlog with low priority.

@osa1
Copy link
Member Author

osa1 commented May 1, 2025

I think all feedback is addressed now. @kevmoo we can merge this.

Make sure to squash and copy/paste PR description as the commit message. GitHub's UI messes with formatting when it automatically inserts PR description as the commit message when you click on "squash and merge".

@osa1
Copy link
Member Author

osa1 commented May 1, 2025

Bumped major version instead as this is a breaking change. (new membrers added to non-final class)

This reverts commit 01daccc.
@osa1
Copy link
Member Author

osa1 commented May 1, 2025

Per Martin's feedback I reverted the major version bump.

Flutter uses this package, and when it migrates to the new major version other packages that are commonly used with Flutter that also use vector_math won't compile. To avoid this we bump minor version.

The change is technically breaking (new members added to non-final class), but you shouldn't extend these classes anyway (kills performance). So hopefully no one's done it.

@kevmoo kevmoo merged commit 0279cb8 into google:master May 1, 2025
6 checks passed
@osa1 osa1 deleted the faster_vector4_translate branch May 1, 2025 16:23
copybara-service bot pushed a commit to dart-lang/sdk that referenced this pull request May 5, 2025
Revisions updated by `dart tools/rev_sdk_deps.dart`.

dartdoc (https://github.com/dart-lang/dartdoc/compare/95105e9..e4f9451):
  e4f9451a  2025-05-05  Jonas Finnemann Jensen  Fix duplicate entries of elements in a category when re-exported. (dart-lang/dartdoc#4043)
  876180bd  2025-05-01  dependabot[bot]  Bump github/codeql-action from 3.28.13 to 3.28.16 in the github-actions group (dart-lang/dartdoc#4045)

http (https://github.com/dart-lang/http/compare/63c477b..78d6114):
  78d6114  2025-05-02  Brian Quinlan  Add a new exception type `NSErrorClientException` (dart-lang/http#1763)
  7a2e7d5  2025-05-01  Brian Quinlan  Add a useful stringificiation to `WebSocketConnectionClosed` (dart-lang/http#1764)
  ccb6533  2025-05-01  dependabot[bot]  Bump the github-actions group across 1 directory with 4 updates (dart-lang/http#1761)
  7568b5c  2025-05-02  Alex Li  [web_socket] Adds `WebSocketException.toString()` (dart-lang/http#1756)
  3e4cceb  2025-05-01  Brian Quinlan  Make response headers tests pass on firefox (dart-lang/http#1762)
  5704b0c  2025-05-01  Brian Quinlan  [cronet_http/cupertino_http]: Fixes bugs where cancelling `StreamedResponse.stream` did not sever the connection (dart-lang/http#1760)

test (https://github.com/dart-lang/test/compare/c3755d8..55d1f9e):
  55d1f9ed  2025-05-05  Fichtelcoder  Fix typos in json_reporter documentation (dart-lang/test#2493)

vector_math (https://github.com/google/vector_math.dart/compare/39cafd4..0279cb8):
  0279cb8  2025-05-01  Ömer Sinan Ağacan  Improve performance of functions with dynamic arguments (google/vector_math.dart#345)

Change-Id: I9a67b997ebcf7ebe29162f8f524628013d53af5a
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/426581
Reviewed-by: Konstantin Shcheglov <scheglov@google.com>
Commit-Queue: Devon Carew <devoncarew@google.com>
@kevmoo
Copy link
Collaborator

kevmoo commented May 12, 2025

dart-lang/sdk@3b444c5 landed and I'm guessing this rolled to Google3.

Should we do a release and get this rolled to Flutter?

Should we do a verification pass on Flutter first?

@osa1
Copy link
Member Author

osa1 commented May 13, 2025

I think ideally for a package this small and simple we shouldn't need rolls for testing, we should be able to test it properly with just dart test and release.

I don't know how comprehensive testing in this package is as I did just one commit to the package, but I see tests for all the changed functions: translate, leftTranslate, scale, multiply.

Given that the changes are simple, and we have unit tests covering each of the changed functions, I would just publish.

/// Multiply this by a translation from the left.
@pragma('wasm:prefer-inline')
@pragma('vm:prefer-inline')
@pragma('dart2js:prefer-inline')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get inlining the redirecting functions. But these are BIG functions. Should we really be inlining them?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@osa1 – wondering if we should hold off on the 2nd level of inlining until we know it's a net win...maybe?

CC @rakudrama

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be counterproductive for JavaScript. I'd rather we didn't prefer-inline for dart2js.

JavaScript has its own JIT, and that can do inlining itself.
If this function is not inlined by dart2js, and super hot, it will be JIT-optimized.
Then all the call sites benefit, even if most of them are not hot.

Over-inlining can cause the host function less likely to be JIT-ed, even if it is hot.
Either is is simply too big to to attempt JIT optimization, or there are more opportunities for de-optimization, and eventually the JIT-optimizer declines to make further attempts.

Any function in JavaScript with more than a few operations is at risk of these counter-productive behaviors.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #347

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants