Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging the Eigen Backend and Common Interface for Linear Solvers #111

Closed
wants to merge 3,365 commits into from

Conversation

ProfFan
Copy link
Collaborator

@ProfFan ProfFan commented Sep 9, 2019

This is a placeholder for Travis to run CI tests. Also a place for commenting on the possible design goals and decisions.


This change is Reviewable

@@ -100,26 +100,34 @@ namespace gtsam {
}

/* ************************************************************************* */
vector<boost::tuple<size_t, size_t, double> > GaussianFactorGraph::sparseJacobian() const {
vector<boost::tuple<size_t, size_t, double> >
GaussianFactorGraph::sparseJacobian(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we do this function (with its tests) in a separate PR? Then this PR can be exclusively about linear solvers.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is in 837a09e, which is a late commit in Mandy's work on the Eigen backend. I can definitely separate them into different PRs, but I don't really see it as necessary because they will be rewritten in this PR anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gchenfc please move these GFG functions and their tests to a separate PR to be merged first?

#include <gtsam/linear/VectorValues.h>

namespace gtsam {
class LinearSolver {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do :)

gtsam/linear/LinearSolver.h Outdated Show resolved Hide resolved
@@ -0,0 +1,5 @@
//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doxygen. Does this file need to exist?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change if there is nothing to implement

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does all the Ordering stuff actually belong to LinearSolverParams?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it’s an important property for direct solvers...

gtsam/linear/Scatter.cpp Outdated Show resolved Hide resolved
gtsam/linear/SparseEigenSolver.cpp Show resolved Hide resolved
gtsam/linear/SparseEigenSolver.cpp Outdated Show resolved Hide resolved
gtsam/linear/SparseEigenSolver.h Outdated Show resolved Hide resolved
gtsam/nonlinear/NonlinearOptimizer.cpp Outdated Show resolved Hide resolved
gtsam/nonlinear/NonlinearOptimizerParams.h Outdated Show resolved Hide resolved
@ProfFan
Copy link
Collaborator Author

ProfFan commented Sep 10, 2019

Hopefully (as a KeyVector) Ordering should be assign-able at least when one is empty :)

@ProfFan ProfFan changed the title [Placeholder, DO_NOT_MERGE] Merging the Eigen Backend and Common Interface for Linear Solvers [DO_NOT_MERGE] Merging the Eigen Backend and Common Interface for Linear Solvers Sep 12, 2019
@ProfFan
Copy link
Collaborator Author

ProfFan commented Sep 23, 2019

#121

@ProfFan ProfFan added this to the GTSAM 4.1 milestone Sep 23, 2019
@dellaert
Copy link
Member

@ProfFan I propose you do a PR "Linearize straight to sparse matrix" to this branch.

@ProfFan
Copy link
Collaborator Author

ProfFan commented May 31, 2020

I'll add some results in my optimization efforts:

Merging the sparse matrix generation

Contrary to initial projections, the speedup gained is minimal. It appears that generation of a sparse matrix with Eigen's setFromTriplets is a magical procedure, so manual tuning is necessary but does not guarantee better performance.

Profiling

I profiled the timeSFMBAL script with the following results:

image

Flamegraph: https://storage.cloud.google.com/pastebin-bucket/flamegraph-gtsam-eigen.svg

It turns out that the major overheads are:

  1. Sparse matrix multiplication AT * A (25%)
  2. Eigen's LDLT Cholesky (22%) computes the ordering even there is no (natural) ordering

Nevertheless, the current setup allows us to have other backends as well, so I will start experimenting with SuiteSparse and cuSPARSE to see if there are any benefits.

@ProfFan
Copy link
Collaborator Author

ProfFan commented May 31, 2020

Fixed a bug in the sparse Eigen solver, now the result:

/Users/proffan/Projects/Development/CV/SLAM/gtsam_build/timing/timeSFMBAL
native-profiler-starter: waiting for profiler...
native-profiler: starting target executable itself...
Initial error: 4.18566e+06, values: 22122
iter      cost      cost_change    lambda  success iter_time
   0  1.030608e+05    4.08e+06    1.00e-04     1    6.35e+00
   1  6.149065e+04    4.16e+04    3.33e-05     1    6.01e+00
   2  1.956166e+04    4.19e+04    3.33e-05     1    5.97e+00
   3  1.814774e+04    1.41e+03    1.11e-05     1    5.96e+00
   4  1.803417e+04    1.14e+02    3.98e-06     1    6.35e+00
   5  1.803390e+04    2.62e-01    1.33e-06     1    6.16e+00
   6  1.803390e+04    8.14e-05    4.42e-07     1    6.06e+00
-Total: 0 CPU (0 times, 0 wall, 31.23 children, min: 0 max: 0)
|   -optimize: 67.24 CPU (1 times, 67.5212 wall, 31.23 children, min: 67.24 max: 67.24)
|   |   -EigenOptimizer obtainSparseMatrix: 10.64 CPU (7 times, 10.6967 wall, 10.64 children, min: 10.64 max: 10.64)
|   |   |   -GaussianFactorGraph sparseJacobian: 5.8 CPU (7 times, 5.82378 wall, 5.8 children, min: 5.8 max: 5.8)
|   |   |   -EigenOptimizer convertSparse: 4.84 CPU (7 times, 4.87271 wall, 4.84 children, min: 4.84 max: 4.84)
|   |   -EigenOptimizer optimizeEigenCholesky create solver: 19.76 CPU (7 times, 19.8424 wall, 19.76 children, min: 19.76 max: 19.76)
|   |   -EigenOptimizer optimizeEigenCholesky solve: 0.83 CPU (7 times, 0.808024 wall, 0.83 children, min: 0.83 max: 0.83)

Compared to SEQUENTIAL_CHOLESKY:

-Total: 0 CPU (0 times, 0 wall, 77.28 children, min: 0 max: 0)
|   -optimize: 77.28 CPU (1 times, 77.4759 wall, 77.28 children, min: 77.28 max: 77.28)

We are 14% faster with Eigen's solver.

Flamegraph: https://storage.googleapis.com/pastebin-bucket/flamegraph_gtsam_eigen_solver_v2.svg

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 1, 2020

This commit further gains 1s timing advantage:

-Total: 0 CPU (0 times, 0 wall, 12.26 children, min: 0 max: 0)
|   -optimize: 66.02 CPU (1 times, 66.1306 wall, 12.26 children, min: 66.02 max: 66.02)
|   |   -EigenOptimizer optimizeEigenCholesky: 32.71 CPU (7 times, 32.7449 wall, 12.26 children, min: 32.71 max: 32.71)
|   |   |   -EigenOptimizer optimizeEigenCholesky create solver: 9.99 CPU (7 times, 9.99557 wall, 9.99 children, min: 9.99 max: 9.99)
|   |   |   -EigenOptimizer optimizeEigenCholesky solve: 2.27 CPU (7 times, 2.25941 wall, 2.27 children, min: 2.27 max: 2.27)

However, profiling shows that further improvement is unlikely under the Eigen Sparse framework.

Copy link
Member

@dellaert dellaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Eigen's LDLT Cholesky (22%) computes the ordering even there is no (natural) ordering

It is crucial that Eigen does not compute it’s own ordering. If needed (really!!??), we need to make a copy of Eigen’s LDL that does not. We cannot compare any solver with any other solver if the ordering changes. And almost the whole point of GTSAM is that we control the ordering, nobody else.

* -------------------------------------------------------------------------- */

/**
* @file LinearSolver.h
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


/**
* @file LinearSolver.h
* @brief Common Interface for Linear Solvers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 1, 2020

  1. Eigen's LDLT Cholesky (22%) computes the ordering even there is no (natural) ordering

It is crucial that Eigen does not compute it’s own ordering. If needed (really!!??), we need to make a copy of Eigen’s LDL that does not. We cannot compare any solver with any other solver if the ordering changes. And almost the whole point of GTSAM is that we control the ordering, nobody else.

@dellaert To clarify: what I mean is that eigen spends some time computes the natural (identity) ordering as a part of the solving process, which is an overhead but we still control the ordering because this (Eigen) ordering is identity.

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 1, 2020

I did some more tests with the Eigen solver and TBB. TBB is giving me a boost on my macOS 10.15. (1:04 without TBB, 0:54 with TBB).

If we turn on MKL and use MKL's PardisoLDLT the speed is even faster (0:48), but that is out of the scope, as we cannot control the ordering with that.

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 2, 2020

I solved the issue that Eigen is computing a "null" ordering. Now we do not have that overhead anymore (saved another 2 seconds)

@dellaert
Copy link
Member

dellaert commented Jun 2, 2020

I solved the issue that Eigen is computing a "null" ordering. Now we do not have that overhead anymore (saved another 2 seconds)

Nice. Let's merge the other PR (after you get rid of O(m) mallocs :-)) and then have another chat about the timing.

@ProfFan ProfFan changed the title [DO_NOT_MERGE] Merging the Eigen Backend and Common Interface for Linear Solvers Merging the Eigen Backend and Common Interface for Linear Solvers Jun 4, 2020
@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 4, 2020

1:00 without TBB with current develop, 0:45 with TBB.

@dellaert
Copy link
Member

dellaert commented Jun 4, 2020

1:00 without TBB with current develop, 0:45 with TBB.

Not fully understanding. TBB with eigen solver, or with sequential cholesky?

Maybe make a colab that does all timings and nicely formats? (all currently supported linear solvers in this branch, ie. old solvers + Eigen cholesky and QR)

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 7, 2020

With SuiteSparse (CHOLMOD) solver & TBB ON:

Initial error: 4.18566e+06, values: 22122
iter      cost      cost_change    lambda  success iter_time
   0  1.030608e+05    4.08e+06    1.00e-04     1    4.41e+00
   1  6.149065e+04    4.16e+04    3.33e-05     1    4.38e+00
   2  1.956166e+04    4.19e+04    3.33e-05     1    4.38e+00
   3  1.814774e+04    1.41e+03    1.11e-05     1    4.41e+00
   4  1.803417e+04    1.14e+02    3.98e-06     1    4.42e+00
   5  1.803390e+04    2.62e-01    1.33e-06     1    4.39e+00
   6  1.803390e+04    8.14e-05    4.42e-07     1    4.39e+00
-Total: 0 CPU (0 times, 0 wall, 2.5 children, min: 0 max: 0)
|   -optimize: 58.61 CPU (1 times, 37.785 wall, 2.5 children, min: 58.61 max: 58.61)
|   |   -SuiteSparseSolver optimizeEigenCholesky: 22.09 CPU (7 times, 22.154 wall, 2.5 children, min: 22.09 max: 22.09)
|   |   |   -SuiteSparseSolver optimizeEigenCholesky create solver: 1.31 CPU (7 times, 1.34256 wall, 1.31 children, min: 1.31 max: 1.31)
|   |   |   -SuiteSparseSolver optimizeEigenCholesky solve: 1.19 CPU (7 times, 1.18271 wall, 1.19 children, min: 1.19 max: 1.19)
LD_LIBRARY_PATH=/opt/intel/lib ../gtsam_build/timing/timeSFMBAL  57.24s user 1.82s system 153% cpu 38.480 total

@dellaert
Copy link
Member

dellaert commented Jun 7, 2020

With SuiteSparse (CHOLMOD) solver & TBB ON:

Cool. I’m not totally understanding the timing numbers (wall 22s == CPU 22s so does cholmod not use multicore?)

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 7, 2020

@dellaert The actual timing is the last line: 57.24s user 1.82s system 153% cpu 38.480 total.

Just got a mind-blowing result, I reran the timing on my Linux (same hardware), and SEQUENTIAL_CHOLESKY (GTSAM solver) beats SuiteSparse and Eigen sparse by 2x and 3x. Also on Linux the time is 10x faster than on macOS...

On my Linux:

Solver Time
SEQUENTIAL_CHOLESKY 2.7s
EIGEN_CHOLESKY 5s
SUITESPARSE_CHOLESKY 4.79s

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 7, 2020

Exactly the same iterations:

Initial error: 4.18566e+06, values: 22122
iter      cost      cost_change    lambda  success iter_time
   0  1.030740e+05    4.08e+06    1.00e-04     1    3.79e-01
   1  6.148214e+04    4.16e+04    3.33e-05     1    3.23e-01
   2  1.956152e+04    4.19e+04    3.33e-05     1    3.22e-01
   3  1.814774e+04    1.41e+03    1.11e-05     1    3.25e-01
   4  1.803417e+04    1.14e+02    3.98e-06     1    3.19e-01
   5  1.803390e+04    2.62e-01    1.33e-06     1    3.20e-01
   6  1.803390e+04    8.14e-05    4.42e-07     1    3.19e-01
-Total: 0 CPU (0 times, 0 wall, 3.2 children, min: 0 max: 0)
|   -optimize: 3.2 CPU (1 times, 2.64848 wall, 3.2 children, min: 3.2 max: 3.2)
timing/timeSFMBAL  3.13s user 0.20s system 119% cpu 2.786 total

But 20x faster...

@dellaert
Copy link
Member

dellaert commented Jun 7, 2020

@dellaert The actual timing is the last line: 57.24s user 1.82s system 153% cpu 38.480 total.

Just got a mind-blowing result, I reran the timing on my Linux (same hardware), and SEQUENTIAL_CHOLESKY (GTSAM solver) beats SuiteSparse and Eigen sparse by 2x and 3x. Also on Linux the time is 10x faster than on macOS...

On my Linux:

Solver Time
SEQUENTIAL_CHOLESKY 2.7s
EIGEN_CHOLESKY 5s
SUITESPARSE_CHOLESKY 4.79s

Wow! Yeah, malloc on Mac has always been super-slow, which is one guess as to what is going on. Is it also with clang or is this with gcc? Finally, is this with or without TBB? SEQUENTIAL_CHOLESKY does not really exploit multi-threading, for that try MULTIFRONTAL_CHLESKY.

@dellaert
Copy link
Member

dellaert commented Jun 7, 2020

I reran the timing on my Linux (same hardware).

What do you mean? Just linux boot on your macbook?

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 8, 2020

I reran the timing on my Linux (same hardware).

What do you mean? Just linux boot on your macbook?

My workstation (i7-8700K, 6c12t, 4.7GHz), dual-booted with macOS and Linux.

BTW, MULTIFRONTAL_CHOLESKY is 2.6s.

I think based on the current benchmarks it is better to get a dataset bigger than BAL to guide further efforts.

@dellaert
Copy link
Member

dellaert commented Jun 8, 2020

You can just use a larger BAL dataset: https://grail.cs.washington.edu/projects/bal/

@ProfFan
Copy link
Collaborator Author

ProfFan commented Jun 8, 2020

Some preliminary observations:

Solver Speed Multithread Memory
GTSAM SEQ Fast (3x cuSparse) No Moderate
GTSAM MULT Very Fast (1.2x SEQ) Yes Very High
Eigen Sparse Medium No Moderate
SuiteSparse Slow (0.5x Eigen) Yes Moderate
CUDA cuSparse Faster than Eigen by 1.5x Yes Moderate

There also seems to be a bug in SEQUENTIAL_CHOLESKY, as the convergence is different from all other solvers:

Normal:

Processing: ../gtsam/examples/Data/problem-394-100368-pre.txt
Initial error: 4.54573e+06, values: 100762
iter      cost      cost_change    lambda  success iter_time
   0  3.441329e+05    4.20e+06    1.00e-04     1    1.95e+01
   1  3.016591e+05    4.25e+04    3.33e-05     1    2.13e+01
   2  2.978071e+05    3.85e+03    1.18e-05     1    1.76e+01
   3  2.977714e+05    3.57e+01    3.92e-06     1    1.87e+01

SEQUENTIAL_CHOLESKY:

Processing: ../gtsam/examples/Data/problem-394-100368-pre.txt
Initial error: 4.54573e+06, values: 100762
iter      cost      cost_change    lambda  success iter_time
   0      inf    6.94e-310    1.00e-04     0    1.72e+00
iter      cost      cost_change    lambda  success iter_time
   0      inf    6.94e-310    2.00e-04     0    1.83e+00
iter      cost      cost_change    lambda  success iter_time
   0      inf    6.94e-310    8.00e-04     0    1.82e+00
^C

dellaert and others added 27 commits January 16, 2023 15:33
@ProfFan ProfFan closed this Jan 20, 2023
@varunagrawal varunagrawal deleted the feature/fan/Eigen branch October 22, 2023 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants