[math] Add fast pose composition functions for point, vector3, and vector4 #23456

calderpg-tri · 2025-09-21T23:49:02Z

Adds implementations in fast_pose_composition_functions for the following operations:

RotationMatrix<double> * Vector3<double>
RigidTransform<double> * Vector3<double>
RigidTransform<double> * Vector4<double>

+@jwnimmer-tri for feature review, thanks!

This change is

calderpg-tri

Reviewable status: 1 unresolved discussion, needs platform reviewer assigned, needs at least two assigned reviewers, missing label for release notes (waiting on @calderpg-tri)

a discussion (no related file):
working most of the test failures appears to be simple tolerance issues, where the new version is slightly different:

Expected error: 2.0000000000000002e-15
Observed error: 5.5025428657984321e-15
...
Expected error: 4.0000000000000003e-15
Observed error: 4.1936940031739312e-15
...
diff = 3.552713678800501e-15, tolerance = 2.220446049250313e-16, relative tolerance = 2.1040402406621104e-15
...
diff = 8.881784197001252e-16, tolerance = 2.220446049250313e-16, relative tolerance = 3.730793199006042e-16
...
Expected error: 2.9999999999999998e-15
Observed error: 3.2213814948889308e-15
...
Expected error: 5e-15
Observed error: 5.2440690678778878e-15
...
Expected error: 2.9999999999999998e-15
Observed error: 3.2213814948889308e-15
...
diff = 5.329070518200751e-15, tolerance = 3.552713678800501e-15
...
diff = 1.9984014443252818e-15, tolerance = 1.8651746813702627e-15

calderpg-tri

The one test failure I don't quite understand is from //tools/install/libdrake:py/exported_symbols_test which seems to be finding new? symbols from hwy

FUNC WEAK DEFAULT
 hwy::ChosenTarget::LoadMask() const
 _ZNK3hwy12ChosenTarget8LoadMaskEv

FUNC WEAK DEFAULT
 hwy::ChosenTarget::LoadMask() const
 _ZNK3hwy12ChosenTarget8LoadMaskEv

Reviewable status: 1 unresolved discussion, needs platform reviewer assigned, needs at least two assigned reviewers, missing label for release notes

jwnimmer-tri

I will help out with this PR but I don't have a lot of extra time this week, and I'm not the SME for this code, so I'll ask +assignee:@sherm1 to please join me in feature review.

I'll check on the symbol test failure locally.

@jwnimmer-tri reviewed 4 of 6 files at r1.
Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @calderpg-tri)

math/fast_pose_composition_functions.h line 44 at r1 (raw file):

                  RotationMatrix<double>* R_AC);

/* Composes a RotationMatrix<double> with a 3-element vector, resulting in a new

@sherm1 please weigh in on the new mnemonics of v3 and p and v4 for the RHS operands.

math/fast_pose_composition_functions.h line 49 at r1 (raw file):

It is OK for v_A to overlap with one or both inputs. */
void ComposeRv3(const RotationMatrix<double>& R_AB, const double* v_B,
                double* v_A);

All of the new functions should be using Eigen::Vector3d and Eigen::Vector4d for their argument types, not raw pointers. This matches how the existing API pattern, and adds clarity about the array sizes and (lack of) memory aliasing between the matrix and either vector.

Therefore, we also should amend the "overlap" sentence in the documentation. Only when two arguments are the same type may they overlap in the first place, so it's really only ever the A and B vectors that may overlap; the R_AB or X_AB matrix can never overlap with the vectors anyway.

But really, looking at the call sites, in fact the input and output vectors will never alias each other in practice, so I think we should change the contract here that aliasing is forbidden. Then, we can drop the extra "temp" copying inside of the implementation (and the extra function that copies to temp and calls "NoAlias" function -- we only need the "NoAlias" in the first place).

The existing composition functions support operator* and operator*=. Since we don't have mul-assign operators for vectors, the aliasing is never an issue.

math/fast_pose_composition_functions.cc line 733 at r1 (raw file):

  // Column stu:                            s   t   u   _  =
  auto stu_ = hn::Mul(xyz_, WWW_);      //  xW  yW  zW  _

BTW I wonder if doing this term (xyz*WWW) last would safe us a cycle? As a masked load, the xyz_ will take longer than the unmasked loads, so in theory something like abc*XXX could get started faster than xyz*WWW could, and we can overlap the extra masking latency onto the FMA chain.

jwnimmer-tri

See #23457 for the exported symbols test fix. (Feel free to cherry-pick it here for now to placate CI.)

Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @calderpg-tri)

calderpg-tri

Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes

math/fast_pose_composition_functions.h line 49 at r1 (raw file):

All of the new functions should be using Eigen::Vector3d and Eigen::Vector4d for their argument types, not raw pointers

I tried this first, using the same style of forward declaration as is used here for RigidTransform and RotationMatrix, and got a bunch of incomplete type compilation failures for Vector3<T> in rigid_transform and rotation_matrix.

But really, looking at the call sites, in fact the input and output vectors will never alias each other in practice, so I think we should change the contract here that aliasing is forbidden.

That seems reasonable. fast_pose_composition_functions_test does exercise the aliasing case, but that is entirely artificial.

jwnimmer-tri

Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @calderpg-tri)

math/fast_pose_composition_functions.h line 49 at r1 (raw file):

I tried this first, using the same style of forward declaration ...

Ah, of course. We can't reasonably forward-declare Eigen types.

Still, it would be nice to find a way to denote the array extent in the signature.

Could we use fixed-size arrays (like double[3]) in the signatures? IIRC those are passed by-pointer so would be the same machine code but more clear to the reader.

calderpg-tri

Thanks!

Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes

math/fast_pose_composition_functions.h line 49 at r1 (raw file):

Previously, jwnimmer-tri (Jeremy Nimmer) wrote…

I tried this first, using the same style of forward declaration ...

Ah, of course. We can't reasonably forward-declare Eigen types.

Still, it would be nice to find a way to denote the array extent in the signature.

Could we use fixed-size arrays (like double[3]) in the signatures? IIRC those are passed by-pointer so would be the same machine code but more clear to the reader.

Done

sherm1

Great to see us get more value out of SIMD! That said, I examined these small operations in Godbolt some time back and came to the conclusion at the time that the Highway versions were slower than what g++ and clang++ generated with -O2. See my comments here for the SIMD attempts I made that failed to improve anything.

IIRC there are two main problems with trying to get a win here:

the targeted load & store, and shuffle operations are so expensive that the compilers' use of 2-wide SIMD + 1 scalar was faster than attempting to use 4-wide SIMD to produce 3 results, and
we lose the ability to inline these small operations by forcing an actual call into these handcrafted methods (introducing unnecessary argument passing and stack operations as well as forcing the caller to protect its register contents).

Since the purpose of handcrafting these methods is purely to get more performance, we need to be sure that that objective is achieved. What's needed is benchmarks that are relevant to the applications you care about, and published timings showing the value of the new code compared to the old. Do you have those measurements already? I've been using multibody/benchmarking/cassie for my multibody performance work -- possibly that would work for you too? Some examples of published performance results used to justify optimizations: #21853, #22253, #23061.

I'll hold off detailed review until we have some performance numbers.

Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @calderpg-tri)

sherm1

BTW in case these small operations don't pay off, the same operation applied to a bunch of vectors at once might.

Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @calderpg-tri)

calderpg-tri

My primary motivation here is to replace the current use of Isometry3d in the sphere-model collision checkers (which perform a large number of Isometry3d * Vector4d), with RigidTransformd. Using the existing collision_checking_benchmark and planning_benchmark in Anzu on my AMD 7965WX, this is what I see for the three cases:

Anzu master (Isometry3d * Vector4d)

Collision checker type	Binary check	Displacement check
Voxel	4.654853301e-06	2.2129226112e-05
MbPEnv	1.101966348e-05	2.7652729506e-05
SGCC	2.189804233e-05	6.491391658099999e-05

Planning operation	Time
Roadmap build	22.103597
PRM query	0.005182799645000001
BiRRT query	0.0035363034780000004

Anzu branch (RigidTransformd * Vector4d)

Collision checker type	Binary check	Displacement check
Voxel	6.012198614e-06	2.2356318583e-05
MbPEnv	1.1892335881999999e-05	2.8617117438000002e-05
SGCC	2.1872209202e-05	6.4194232426e-05

Planning operation	Time
Roadmap build	31.761208995
PRM query	0.007026622368999998
BiRRT query	0.0051097984800000055

Anzu branch + this Drake PR (RigidTransformd * Vector4d)

Collision checker type	Binary check	Displacement check
Voxel	4.57489179e-06	1.9536279983e-05
MbPEnv	1.0714951889e-05	2.6504123077e-05
SGCC	2.2281717575e-05	6.3413477381e-05

Planning operation	Time
Roadmap build	25.662554353
PRM query	0.005922336646000002
BiRRT query	0.004069964112

and my general takeaway is that the switch to RigidTransformd * Vector4d on its own comes with a pretty noticeable performance hit and then the SIMD implementation gets most of that back. I will see if the Cassie benchmark is useful as well.

(To be clear, these benchmarks are not as clean and principled as the Cassie benchmark, but they have been useful)

It's worth noting that the performance benefit (or not) of these SIMD operations is likely going to depend heavily on the CPU in use. Something with an early AVX2 implementation (Haswell, Broadwell, Zen 1, etc) is probably not going to see any improvement, but at this point I don't think we should be optimizing Drake code for hardware that old.

Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes

sherm1

Thanks, Calder. This is good data but doesn't appear yet to be an apples-to-apples comparison. I can see why our C++ RigidTransform*Vector4 might perform poorly since it does runtime error checking that your routines don't do. Before replacing that with a family of tricky hand coded functions it would be worth writing the equivalent C++ functions and timing those.

We could consider simplifying the existing RigidTransform::operator*(Vector4) method to replace the error checking with a cautionary comment and a DEBUG_ASSERT. It seems a little extreme to worry about the provided value of W in Release builds.

Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @calderpg-tri)

jwnimmer-tri · 2025-09-22T23:10:42Z

I can see why our C++ RigidTransform*Vector4 might perform poorly since it does runtime error checking that your routines don't do.

My impression of the code is that the if (vec_B(3) == 1 || vec_B(3) == 0) { guard remained intact in this PR, and thus in the reported numbers.

sherm1

Oh, sorry -- I missed that, thanks. In that case I think it would be worth a quick test that makes a non-error checking more-likely-to-be-inlined version of RigidTransform::operator*(Vector4) in C++ to see if that would get us back to the Eigen performance level.

Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @calderpg-tri)

jwnimmer-tri · 2025-09-23T13:29:06Z

The next reply due here is on Calder with the new experiment (and then Sherm to offer more steering), but I'll also point out some context that might help steer the discussion:

Sherm's linked prior exploration was benchmarking RigidTransform * Vector3, while Calder's new benchmark is doing RigidTransform * Vector4d. It's entirely possible that the Vector3 changes here are nil / for the worse (there are no benchmark results reported in this PR yet for those operations) but the Vector4 change is an improvement.

If we don't have benchmark leverage on the Vector3 changes, then possibly this PR should be narrowed down to just the Vector4 change (whether that be removing the if-else-throw code, or switching it to use SIMD highway).

sherm1

Yes, that's helpful. We shouldn't lump the R*v3 and R*v4 methods together -- since gains here are likely to be marginal at best it is possible that v3 is zero or negative while v4 is positive. I can run my usual suite of benchmarks against these changes if that's helpful Calder -- I don't think Drake uses v4 at all but it does make extensive use of R*v3.

Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @calderpg-tri)

calderpg-tri

I can run my usual suite of benchmarks against these changes if that's helpful Calder

I would say go ahead, so long as you have a new enough machine to test on (I would think Skylake{-X,-SP}/Zen 2 would be the oldest useful hardware)

Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes

sherm1 · 2025-09-24T22:48:56Z

I ran //multibody/benchmarking/cassie on two machines, with today's Drake master and with this PR. The machines:

Puget Haswell-EP Xeon E5-2699 (older)
Laptop TigerLake Xeon W-11855M (recent, has AVX512 etc)

The benchmark tests nine multibody plant computations on the Cassie robot. None of these use the R*v4 operator but they do use R*v3 to varying degrees. (The tests are: position kinematics, velocity kinematics, composite body inertias, Jacobian by slow method, Fast block Jacobian, mass matrix via inverse dynamics, fast mass matrix, inverse dynamics, forward dynamics.)

The handcoded R*v3 produced a small but noticeable slowdown on both machines.
Old Puget: average -3.9%, worst cases were -10% on MM via ID and -9% on ID alone
New Laptop: average -4.8%, worst cases were -18% on MM via ID and -9% on ID alone.

In my view this is sufficient evidence to say we should focus on R*v4 in this PR and not attempt to hand code R*v3.

Add fast pose composition functions for point, vector3, and vector4

5e8da09

calderpg-tri commented Sep 22, 2025

View reviewed changes

calderpg-tri assigned jwnimmer-tri Sep 22, 2025

jwnimmer-tri assigned sherm1 Sep 22, 2025

jwnimmer-tri reviewed Sep 22, 2025

View reviewed changes

jwnimmer-tri mentioned this pull request Sep 22, 2025

[workspace] Fix a little more highway symbol leakage #23457

Merged

jwnimmer-tri reviewed Sep 22, 2025

View reviewed changes

calderpg-tri commented Sep 22, 2025

View reviewed changes

jwnimmer-tri reviewed Sep 22, 2025

View reviewed changes

Update function signatures and remove aliasing from test cases

46eb388

calderpg-tri commented Sep 22, 2025

View reviewed changes

sherm1 reviewed Sep 22, 2025

View reviewed changes

calderpg-tri commented Sep 22, 2025

View reviewed changes

sherm1 reviewed Sep 22, 2025

View reviewed changes

sherm1 reviewed Sep 23, 2025

View reviewed changes

calderpg-tri commented Sep 23, 2025

View reviewed changes

[math] Add fast pose composition functions for point, vector3, and vector4 #23456

Are you sure you want to change the base?

[math] Add fast pose composition functions for point, vector3, and vector4 #23456

Uh oh!

Conversation

calderpg-tri commented Sep 21, 2025 • edited by sherm1 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

calderpg-tri left a comment

Choose a reason for hiding this comment

Uh oh!

calderpg-tri left a comment

Choose a reason for hiding this comment

Uh oh!

jwnimmer-tri left a comment

Choose a reason for hiding this comment

Uh oh!

jwnimmer-tri left a comment

Choose a reason for hiding this comment

Uh oh!

calderpg-tri left a comment

Choose a reason for hiding this comment

Uh oh!

jwnimmer-tri left a comment

Choose a reason for hiding this comment

Uh oh!

calderpg-tri left a comment

Choose a reason for hiding this comment

Uh oh!

sherm1 left a comment

Choose a reason for hiding this comment

Uh oh!

sherm1 left a comment

Choose a reason for hiding this comment

Uh oh!

calderpg-tri left a comment

Choose a reason for hiding this comment

Uh oh!

sherm1 left a comment

Choose a reason for hiding this comment

Uh oh!

jwnimmer-tri commented Sep 22, 2025

Uh oh!

sherm1 left a comment

Choose a reason for hiding this comment

Uh oh!

jwnimmer-tri commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sherm1 left a comment

Choose a reason for hiding this comment

Uh oh!

calderpg-tri left a comment

Choose a reason for hiding this comment

Uh oh!

sherm1 commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

calderpg-tri commented Sep 21, 2025 •

edited by sherm1

Loading

jwnimmer-tri commented Sep 23, 2025 •

edited

Loading