-
Notifications
You must be signed in to change notification settings - Fork 1.3k
[math] Add fast pose composition functions for point, vector3, and vector4 #23456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[math] Add fast pose composition functions for point, vector3, and vector4 #23456
Conversation
calderpg-tri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 1 unresolved discussion, needs platform reviewer assigned, needs at least two assigned reviewers, missing label for release notes (waiting on @calderpg-tri)
a discussion (no related file):
working most of the test failures appears to be simple tolerance issues, where the new version is slightly different:
Expected error: 2.0000000000000002e-15
Observed error: 5.5025428657984321e-15
...
Expected error: 4.0000000000000003e-15
Observed error: 4.1936940031739312e-15
...
diff = 3.552713678800501e-15, tolerance = 2.220446049250313e-16, relative tolerance = 2.1040402406621104e-15
...
diff = 8.881784197001252e-16, tolerance = 2.220446049250313e-16, relative tolerance = 3.730793199006042e-16
...
Expected error: 2.9999999999999998e-15
Observed error: 3.2213814948889308e-15
...
Expected error: 5e-15
Observed error: 5.2440690678778878e-15
...
Expected error: 2.9999999999999998e-15
Observed error: 3.2213814948889308e-15
...
diff = 5.329070518200751e-15, tolerance = 3.552713678800501e-15
...
diff = 1.9984014443252818e-15, tolerance = 1.8651746813702627e-15
calderpg-tri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The one test failure I don't quite understand is from //tools/install/libdrake:py/exported_symbols_test which seems to be finding new? symbols from hwy
FUNC WEAK DEFAULT
hwy::ChosenTarget::LoadMask() const
_ZNK3hwy12ChosenTarget8LoadMaskEv
FUNC WEAK DEFAULT
hwy::ChosenTarget::LoadMask() const
_ZNK3hwy12ChosenTarget8LoadMaskEv
Reviewable status: 1 unresolved discussion, needs platform reviewer assigned, needs at least two assigned reviewers, missing label for release notes
jwnimmer-tri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will help out with this PR but I don't have a lot of extra time this week, and I'm not the SME for this code, so I'll ask +assignee:@sherm1 to please join me in feature review.
I'll check on the symbol test failure locally.
@jwnimmer-tri reviewed 4 of 6 files at r1.
Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @calderpg-tri)
math/fast_pose_composition_functions.h line 44 at r1 (raw file):
RotationMatrix<double>* R_AC); /* Composes a RotationMatrix<double> with a 3-element vector, resulting in a new
@sherm1 please weigh in on the new mnemonics of v3 and p and v4 for the RHS operands.
math/fast_pose_composition_functions.h line 49 at r1 (raw file):
It is OK for v_A to overlap with one or both inputs. */ void ComposeRv3(const RotationMatrix<double>& R_AB, const double* v_B, double* v_A);
All of the new functions should be using Eigen::Vector3d and Eigen::Vector4d for their argument types, not raw pointers. This matches how the existing API pattern, and adds clarity about the array sizes and (lack of) memory aliasing between the matrix and either vector.
Therefore, we also should amend the "overlap" sentence in the documentation. Only when two arguments are the same type may they overlap in the first place, so it's really only ever the A and B vectors that may overlap; the R_AB or X_AB matrix can never overlap with the vectors anyway.
But really, looking at the call sites, in fact the input and output vectors will never alias each other in practice, so I think we should change the contract here that aliasing is forbidden. Then, we can drop the extra "temp" copying inside of the implementation (and the extra function that copies to temp and calls "NoAlias" function -- we only need the "NoAlias" in the first place).
The existing composition functions support operator* and operator*=. Since we don't have mul-assign operators for vectors, the aliasing is never an issue.
math/fast_pose_composition_functions.cc line 733 at r1 (raw file):
// Column stu: s t u _ = auto stu_ = hn::Mul(xyz_, WWW_); // xW yW zW _
BTW I wonder if doing this term (xyz*WWW) last would safe us a cycle? As a masked load, the xyz_ will take longer than the unmasked loads, so in theory something like abc*XXX could get started faster than xyz*WWW could, and we can overlap the extra masking latency onto the FMA chain.
jwnimmer-tri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #23457 for the exported symbols test fix. (Feel free to cherry-pick it here for now to placate CI.)
Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @calderpg-tri)
calderpg-tri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes
math/fast_pose_composition_functions.h line 49 at r1 (raw file):
All of the new functions should be using
Eigen::Vector3dandEigen::Vector4dfor their argument types, not raw pointers
I tried this first, using the same style of forward declaration as is used here for RigidTransform and RotationMatrix, and got a bunch of incomplete type compilation failures for Vector3<T> in rigid_transform and rotation_matrix.
But really, looking at the call sites, in fact the input and output vectors will never alias each other in practice, so I think we should change the contract here that aliasing is forbidden.
That seems reasonable. fast_pose_composition_functions_test does exercise the aliasing case, but that is entirely artificial.
jwnimmer-tri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), missing label for release notes (waiting on @calderpg-tri)
math/fast_pose_composition_functions.h line 49 at r1 (raw file):
I tried this first, using the same style of forward declaration ...
Ah, of course. We can't reasonably forward-declare Eigen types.
Still, it would be nice to find a way to denote the array extent in the signature.
Could we use fixed-size arrays (like double[3]) in the signatures? IIRC those are passed by-pointer so would be the same machine code but more clear to the reader.
calderpg-tri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes
math/fast_pose_composition_functions.h line 49 at r1 (raw file):
Previously, jwnimmer-tri (Jeremy Nimmer) wrote…
I tried this first, using the same style of forward declaration ...
Ah, of course. We can't reasonably forward-declare Eigen types.
Still, it would be nice to find a way to denote the array extent in the signature.
Could we use fixed-size arrays (like
double[3]) in the signatures? IIRC those are passed by-pointer so would be the same machine code but more clear to the reader.
Done
sherm1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great to see us get more value out of SIMD! That said, I examined these small operations in Godbolt some time back and came to the conclusion at the time that the Highway versions were slower than what g++ and clang++ generated with -O2. See my comments here for the SIMD attempts I made that failed to improve anything.
IIRC there are two main problems with trying to get a win here:
- the targeted load & store, and shuffle operations are so expensive that the compilers' use of 2-wide SIMD + 1 scalar was faster than attempting to use 4-wide SIMD to produce 3 results, and
- we lose the ability to inline these small operations by forcing an actual call into these handcrafted methods (introducing unnecessary argument passing and stack operations as well as forcing the caller to protect its register contents).
Since the purpose of handcrafting these methods is purely to get more performance, we need to be sure that that objective is achieved. What's needed is benchmarks that are relevant to the applications you care about, and published timings showing the value of the new code compared to the old. Do you have those measurements already? I've been using multibody/benchmarking/cassie for my multibody performance work -- possibly that would work for you too? Some examples of published performance results used to justify optimizations: #21853, #22253, #23061.
I'll hold off detailed review until we have some performance numbers.
Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @calderpg-tri)
sherm1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW in case these small operations don't pay off, the same operation applied to a bunch of vectors at once might.
Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @calderpg-tri)
calderpg-tri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My primary motivation here is to replace the current use of Isometry3d in the sphere-model collision checkers (which perform a large number of Isometry3d * Vector4d), with RigidTransformd. Using the existing collision_checking_benchmark and planning_benchmark in Anzu on my AMD 7965WX, this is what I see for the three cases:
- Anzu
master(Isometry3d * Vector4d)
| Collision checker type | Binary check | Displacement check |
|---|---|---|
| Voxel | 4.654853301e-06 | 2.2129226112e-05 |
| MbPEnv | 1.101966348e-05 | 2.7652729506e-05 |
| SGCC | 2.189804233e-05 | 6.491391658099999e-05 |
| Planning operation | Time |
|---|---|
| Roadmap build | 22.103597 |
| PRM query | 0.005182799645000001 |
| BiRRT query | 0.0035363034780000004 |
- Anzu branch (
RigidTransformd * Vector4d)
| Collision checker type | Binary check | Displacement check |
|---|---|---|
| Voxel | 6.012198614e-06 | 2.2356318583e-05 |
| MbPEnv | 1.1892335881999999e-05 | 2.8617117438000002e-05 |
| SGCC | 2.1872209202e-05 | 6.4194232426e-05 |
| Planning operation | Time |
|---|---|
| Roadmap build | 31.761208995 |
| PRM query | 0.007026622368999998 |
| BiRRT query | 0.0051097984800000055 |
- Anzu branch + this Drake PR (
RigidTransformd*Vector4d)
| Collision checker type | Binary check | Displacement check |
|---|---|---|
| Voxel | 4.57489179e-06 | 1.9536279983e-05 |
| MbPEnv | 1.0714951889e-05 | 2.6504123077e-05 |
| SGCC | 2.2281717575e-05 | 6.3413477381e-05 |
| Planning operation | Time |
|---|---|
| Roadmap build | 25.662554353 |
| PRM query | 0.005922336646000002 |
| BiRRT query | 0.004069964112 |
and my general takeaway is that the switch to RigidTransformd * Vector4d on its own comes with a pretty noticeable performance hit and then the SIMD implementation gets most of that back. I will see if the Cassie benchmark is useful as well.
(To be clear, these benchmarks are not as clean and principled as the Cassie benchmark, but they have been useful)
It's worth noting that the performance benefit (or not) of these SIMD operations is likely going to depend heavily on the CPU in use. Something with an early AVX2 implementation (Haswell, Broadwell, Zen 1, etc) is probably not going to see any improvement, but at this point I don't think we should be optimizing Drake code for hardware that old.
Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes
sherm1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Calder. This is good data but doesn't appear yet to be an apples-to-apples comparison. I can see why our C++ RigidTransform*Vector4 might perform poorly since it does runtime error checking that your routines don't do. Before replacing that with a family of tricky hand coded functions it would be worth writing the equivalent C++ functions and timing those.
We could consider simplifying the existing RigidTransform::operator*(Vector4) method to replace the error checking with a cautionary comment and a DEBUG_ASSERT. It seems a little extreme to worry about the provided value of W in Release builds.
Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @calderpg-tri)
My impression of the code is that the |
sherm1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry -- I missed that, thanks. In that case I think it would be worth a quick test that makes a non-error checking more-likely-to-be-inlined version of RigidTransform::operator*(Vector4) in C++ to see if that would get us back to the Eigen performance level.
Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @calderpg-tri)
|
The next reply due here is on Calder with the new experiment (and then Sherm to offer more steering), but I'll also point out some context that might help steer the discussion: Sherm's linked prior exploration was benchmarking If we don't have benchmark leverage on the Vector3 changes, then possibly this PR should be narrowed down to just the Vector4 change (whether that be removing the if-else-throw code, or switching it to use SIMD highway). |
sherm1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's helpful. We shouldn't lump the R*v3 and R*v4 methods together -- since gains here are likely to be marginal at best it is possible that v3 is zero or negative while v4 is positive. I can run my usual suite of benchmarks against these changes if that's helpful Calder -- I don't think Drake uses v4 at all but it does make extensive use of R*v3.
Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes (waiting on @calderpg-tri)
calderpg-tri
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can run my usual suite of benchmarks against these changes if that's helpful Calder
I would say go ahead, so long as you have a new enough machine to test on (I would think Skylake{-X,-SP}/Zen 2 would be the oldest useful hardware)
Reviewable status: 4 unresolved discussions, LGTM missing from assignees jwnimmer-tri(platform),sherm1(platform), commits need curation (https://drake.mit.edu/reviewable.html#curated-commits), missing label for release notes
|
I ran //multibody/benchmarking/cassie on two machines, with today's Drake master and with this PR. The machines:
The benchmark tests nine multibody plant computations on the Cassie robot. None of these use the The handcoded In my view this is sufficient evidence to say we should focus on |
Adds implementations in
fast_pose_composition_functionsfor the following operations:RotationMatrix<double> * Vector3<double>RigidTransform<double> * Vector3<double>RigidTransform<double> * Vector4<double>+@jwnimmer-tri for feature review, thanks!
This change is