Merging the Eigen Backend and Common Interface for Linear Solvers #111

ProfFan · 2019-09-09T16:32:39Z

This is a placeholder for Travis to run CI tests. Also a place for commenting on the possible design goals and decisions.

This change is

dellaert · 2019-09-09T22:26:47Z

gtsam/linear/GaussianFactorGraph.cpp

@@ -100,26 +100,34 @@ namespace gtsam {
  }

  /* ************************************************************************* */
-  vector<boost::tuple<size_t, size_t, double> > GaussianFactorGraph::sparseJacobian() const {
+  vector<boost::tuple<size_t, size_t, double> >
+  GaussianFactorGraph::sparseJacobian(


Could we do this function (with its tests) in a separate PR? Then this PR can be exclusively about linear solvers.

This change is in 837a09e, which is a late commit in Mandy's work on the Eigen backend. I can definitely separate them into different PRs, but I don't really see it as necessary because they will be rewritten in this PR anyway.

@gchenfc please move these GFG functions and their tests to a separate PR to be merged first?

dellaert · 2019-09-09T22:27:44Z

gtsam/linear/LinearSolver.h

+#include <gtsam/linear/VectorValues.h>
+
+namespace gtsam {
+  class LinearSolver {


gtsam/linear/LinearSolver.h

dellaert · 2019-09-09T22:28:31Z

gtsam/linear/LinearSolverParams.cpp

@@ -0,0 +1,5 @@
+//


doxygen. Does this file need to exist?

Will change if there is nothing to implement

Also, does all the Ordering stuff actually belong to LinearSolverParams?

Yeah, it’s an important property for direct solvers...

gtsam/linear/Scatter.cpp

gtsam/linear/SparseEigenSolver.cpp

gtsam/linear/SparseEigenSolver.h

gtsam/nonlinear/NonlinearOptimizer.cpp

gtsam/nonlinear/NonlinearOptimizerParams.h

ProfFan · 2019-09-10T16:36:34Z

Hopefully (as a KeyVector) Ordering should be assign-able at least when one is empty :)

gtsam/nonlinear/NonlinearOptimizer.cpp

ProfFan · 2019-09-23T15:17:52Z

#121

dellaert · 2020-05-28T14:31:06Z

@ProfFan I propose you do a PR "Linearize straight to sparse matrix" to this branch.

ProfFan · 2020-05-31T19:01:37Z

I'll add some results in my optimization efforts:

Merging the sparse matrix generation

Contrary to initial projections, the speedup gained is minimal. It appears that generation of a sparse matrix with Eigen's setFromTriplets is a magical procedure, so manual tuning is necessary but does not guarantee better performance.

Profiling

I profiled the timeSFMBAL script with the following results:

Flamegraph: https://storage.cloud.google.com/pastebin-bucket/flamegraph-gtsam-eigen.svg

It turns out that the major overheads are:

Sparse matrix multiplication AT * A (25%)
~~Eigen's LDLT Cholesky (22%) computes the ordering even there is no (natural) ordering~~

Nevertheless, the current setup allows us to have other backends as well, so I will start experimenting with SuiteSparse and cuSPARSE to see if there are any benefits.

ProfFan · 2020-05-31T21:38:29Z

Fixed a bug in the sparse Eigen solver, now the result:

/Users/proffan/Projects/Development/CV/SLAM/gtsam_build/timing/timeSFMBAL
native-profiler-starter: waiting for profiler...
native-profiler: starting target executable itself...
Initial error: 4.18566e+06, values: 22122
iter      cost      cost_change    lambda  success iter_time
   0  1.030608e+05    4.08e+06    1.00e-04     1    6.35e+00
   1  6.149065e+04    4.16e+04    3.33e-05     1    6.01e+00
   2  1.956166e+04    4.19e+04    3.33e-05     1    5.97e+00
   3  1.814774e+04    1.41e+03    1.11e-05     1    5.96e+00
   4  1.803417e+04    1.14e+02    3.98e-06     1    6.35e+00
   5  1.803390e+04    2.62e-01    1.33e-06     1    6.16e+00
   6  1.803390e+04    8.14e-05    4.42e-07     1    6.06e+00
-Total: 0 CPU (0 times, 0 wall, 31.23 children, min: 0 max: 0)
|   -optimize: 67.24 CPU (1 times, 67.5212 wall, 31.23 children, min: 67.24 max: 67.24)
|   |   -EigenOptimizer obtainSparseMatrix: 10.64 CPU (7 times, 10.6967 wall, 10.64 children, min: 10.64 max: 10.64)
|   |   |   -GaussianFactorGraph sparseJacobian: 5.8 CPU (7 times, 5.82378 wall, 5.8 children, min: 5.8 max: 5.8)
|   |   |   -EigenOptimizer convertSparse: 4.84 CPU (7 times, 4.87271 wall, 4.84 children, min: 4.84 max: 4.84)
|   |   -EigenOptimizer optimizeEigenCholesky create solver: 19.76 CPU (7 times, 19.8424 wall, 19.76 children, min: 19.76 max: 19.76)
|   |   -EigenOptimizer optimizeEigenCholesky solve: 0.83 CPU (7 times, 0.808024 wall, 0.83 children, min: 0.83 max: 0.83)

Compared to SEQUENTIAL_CHOLESKY:

-Total: 0 CPU (0 times, 0 wall, 77.28 children, min: 0 max: 0)
|   -optimize: 77.28 CPU (1 times, 77.4759 wall, 77.28 children, min: 77.28 max: 77.28)

We are 14% faster with Eigen's solver.

Flamegraph: https://storage.googleapis.com/pastebin-bucket/flamegraph_gtsam_eigen_solver_v2.svg

ProfFan · 2020-06-01T02:15:13Z

This commit further gains 1s timing advantage:

-Total: 0 CPU (0 times, 0 wall, 12.26 children, min: 0 max: 0)
|   -optimize: 66.02 CPU (1 times, 66.1306 wall, 12.26 children, min: 66.02 max: 66.02)
|   |   -EigenOptimizer optimizeEigenCholesky: 32.71 CPU (7 times, 32.7449 wall, 12.26 children, min: 32.71 max: 32.71)
|   |   |   -EigenOptimizer optimizeEigenCholesky create solver: 9.99 CPU (7 times, 9.99557 wall, 9.99 children, min: 9.99 max: 9.99)
|   |   |   -EigenOptimizer optimizeEigenCholesky solve: 2.27 CPU (7 times, 2.25941 wall, 2.27 children, min: 2.27 max: 2.27)

However, profiling shows that further improvement is unlikely under the Eigen Sparse framework.

dellaert

Eigen's LDLT Cholesky (22%) computes the ordering even there is no (natural) ordering

It is crucial that Eigen does not compute it’s own ordering. If needed (really!!??), we need to make a copy of Eigen’s LDL that does not. We cannot compare any solver with any other solver if the ordering changes. And almost the whole point of GTSAM is that we control the ordering, nobody else.

dellaert · 2020-05-21T17:51:56Z

gtsam/linear/LinearSolverParams.h

+ * -------------------------------------------------------------------------- */
+
+/**
+ * @file    LinearSolver.h


dellaert · 2020-05-21T17:52:07Z

gtsam/linear/LinearSolverParams.h

+
+/**
+ * @file    LinearSolver.h
+ * @brief   Common Interface for Linear Solvers


ProfFan · 2020-06-01T03:19:42Z

Eigen's LDLT Cholesky (22%) computes the ordering even there is no (natural) ordering

It is crucial that Eigen does not compute it’s own ordering. If needed (really!!??), we need to make a copy of Eigen’s LDL that does not. We cannot compare any solver with any other solver if the ordering changes. And almost the whole point of GTSAM is that we control the ordering, nobody else.

@dellaert To clarify: what I mean is that eigen spends some time computes the natural (identity) ordering as a part of the solving process, which is an overhead but we still control the ordering because this (Eigen) ordering is identity.

ProfFan · 2020-06-01T03:32:34Z

I did some more tests with the Eigen solver and TBB. TBB is giving me a boost on my macOS 10.15. (1:04 without TBB, 0:54 with TBB).

If we turn on MKL and use MKL's PardisoLDLT the speed is even faster (0:48), but that is out of the scope, as we cannot control the ordering with that.

ProfFan · 2020-06-02T16:54:34Z

I solved the issue that Eigen is computing a "null" ordering. Now we do not have that overhead anymore (saved another 2 seconds)

dellaert · 2020-06-02T17:06:21Z

I solved the issue that Eigen is computing a "null" ordering. Now we do not have that overhead anymore (saved another 2 seconds)

Nice. Let's merge the other PR (after you get rid of O(m) mallocs :-)) and then have another chat about the timing.

ProfFan · 2020-06-04T18:02:18Z

1:00 without TBB with current develop, 0:45 with TBB.

dellaert · 2020-06-04T18:05:04Z

1:00 without TBB with current develop, 0:45 with TBB.

Not fully understanding. TBB with eigen solver, or with sequential cholesky?

Maybe make a colab that does all timings and nicely formats? (all currently supported linear solvers in this branch, ie. old solvers + Eigen cholesky and QR)

ProfFan · 2020-06-07T21:05:12Z

With SuiteSparse (CHOLMOD) solver & TBB ON:

Initial error: 4.18566e+06, values: 22122
iter      cost      cost_change    lambda  success iter_time
   0  1.030608e+05    4.08e+06    1.00e-04     1    4.41e+00
   1  6.149065e+04    4.16e+04    3.33e-05     1    4.38e+00
   2  1.956166e+04    4.19e+04    3.33e-05     1    4.38e+00
   3  1.814774e+04    1.41e+03    1.11e-05     1    4.41e+00
   4  1.803417e+04    1.14e+02    3.98e-06     1    4.42e+00
   5  1.803390e+04    2.62e-01    1.33e-06     1    4.39e+00
   6  1.803390e+04    8.14e-05    4.42e-07     1    4.39e+00
-Total: 0 CPU (0 times, 0 wall, 2.5 children, min: 0 max: 0)
|   -optimize: 58.61 CPU (1 times, 37.785 wall, 2.5 children, min: 58.61 max: 58.61)
|   |   -SuiteSparseSolver optimizeEigenCholesky: 22.09 CPU (7 times, 22.154 wall, 2.5 children, min: 22.09 max: 22.09)
|   |   |   -SuiteSparseSolver optimizeEigenCholesky create solver: 1.31 CPU (7 times, 1.34256 wall, 1.31 children, min: 1.31 max: 1.31)
|   |   |   -SuiteSparseSolver optimizeEigenCholesky solve: 1.19 CPU (7 times, 1.18271 wall, 1.19 children, min: 1.19 max: 1.19)
LD_LIBRARY_PATH=/opt/intel/lib ../gtsam_build/timing/timeSFMBAL  57.24s user 1.82s system 153% cpu 38.480 total

dellaert · 2020-06-07T21:33:07Z

With SuiteSparse (CHOLMOD) solver & TBB ON:

Cool. I’m not totally understanding the timing numbers (wall 22s == CPU 22s so does cholmod not use multicore?)

ProfFan · 2020-06-07T22:09:56Z

@dellaert The actual timing is the last line: 57.24s user 1.82s system 153% cpu 38.480 total.

Just got a mind-blowing result, I reran the timing on my Linux (same hardware), and SEQUENTIAL_CHOLESKY (GTSAM solver) beats SuiteSparse and Eigen sparse by 2x and 3x. Also on Linux the time is 10x faster than on macOS...

On my Linux:

Solver	Time
SEQUENTIAL_CHOLESKY	2.7s
EIGEN_CHOLESKY	5s
SUITESPARSE_CHOLESKY	4.79s

ProfFan · 2020-06-07T22:11:00Z

Exactly the same iterations:

Initial error: 4.18566e+06, values: 22122
iter      cost      cost_change    lambda  success iter_time
   0  1.030740e+05    4.08e+06    1.00e-04     1    3.79e-01
   1  6.148214e+04    4.16e+04    3.33e-05     1    3.23e-01
   2  1.956152e+04    4.19e+04    3.33e-05     1    3.22e-01
   3  1.814774e+04    1.41e+03    1.11e-05     1    3.25e-01
   4  1.803417e+04    1.14e+02    3.98e-06     1    3.19e-01
   5  1.803390e+04    2.62e-01    1.33e-06     1    3.20e-01
   6  1.803390e+04    8.14e-05    4.42e-07     1    3.19e-01
-Total: 0 CPU (0 times, 0 wall, 3.2 children, min: 0 max: 0)
|   -optimize: 3.2 CPU (1 times, 2.64848 wall, 3.2 children, min: 3.2 max: 3.2)
timing/timeSFMBAL  3.13s user 0.20s system 119% cpu 2.786 total

But 20x faster...

dellaert · 2020-06-07T22:38:14Z

@dellaert The actual timing is the last line: 57.24s user 1.82s system 153% cpu 38.480 total.

Just got a mind-blowing result, I reran the timing on my Linux (same hardware), and SEQUENTIAL_CHOLESKY (GTSAM solver) beats SuiteSparse and Eigen sparse by 2x and 3x. Also on Linux the time is 10x faster than on macOS...

On my Linux:

Solver Time
SEQUENTIAL_CHOLESKY 2.7s
EIGEN_CHOLESKY 5s
SUITESPARSE_CHOLESKY 4.79s

Wow! Yeah, malloc on Mac has always been super-slow, which is one guess as to what is going on. Is it also with clang or is this with gcc? Finally, is this with or without TBB? SEQUENTIAL_CHOLESKY does not really exploit multi-threading, for that try MULTIFRONTAL_CHLESKY.

dellaert · 2020-06-07T22:39:58Z

I reran the timing on my Linux (same hardware).

What do you mean? Just linux boot on your macbook?

ProfFan · 2020-06-08T00:02:13Z

I reran the timing on my Linux (same hardware).

What do you mean? Just linux boot on your macbook?

My workstation (i7-8700K, 6c12t, 4.7GHz), dual-booted with macOS and Linux.

BTW, MULTIFRONTAL_CHOLESKY is 2.6s.

I think based on the current benchmarks it is better to get a dataset bigger than BAL to guide further efforts.

dellaert · 2020-06-08T00:05:05Z

You can just use a larger BAL dataset: https://grail.cs.washington.edu/projects/bal/

ProfFan · 2020-06-08T16:55:32Z

Some preliminary observations:

Solver	Speed	Multithread	Memory
GTSAM SEQ	Fast (3x cuSparse)	No	Moderate
GTSAM MULT	Very Fast (1.2x SEQ)	Yes	Very High
Eigen Sparse	Medium	No	Moderate
SuiteSparse	Slow (0.5x Eigen)	Yes	Moderate
CUDA cuSparse	Faster than Eigen by 1.5x	Yes	Moderate

There also seems to be a bug in SEQUENTIAL_CHOLESKY, as the convergence is different from all other solvers:

Normal:

Processing: ../gtsam/examples/Data/problem-394-100368-pre.txt
Initial error: 4.54573e+06, values: 100762
iter      cost      cost_change    lambda  success iter_time
   0  3.441329e+05    4.20e+06    1.00e-04     1    1.95e+01
   1  3.016591e+05    4.25e+04    3.33e-05     1    2.13e+01
   2  2.978071e+05    3.85e+03    1.18e-05     1    1.76e+01
   3  2.977714e+05    3.57e+01    3.92e-06     1    1.87e+01

SEQUENTIAL_CHOLESKY:

Processing: ../gtsam/examples/Data/problem-394-100368-pre.txt
Initial error: 4.54573e+06, values: 100762
iter      cost      cost_change    lambda  success iter_time
   0      inf    6.94e-310    1.00e-04     0    1.72e+00
iter      cost      cost_change    lambda  success iter_time
   0      inf    6.94e-310    2.00e-04     0    1.83e+00
iter      cost      cost_change    lambda  success iter_time
   0      inf    6.94e-310    8.00e-04     0    1.82e+00
^C

…lization constants.

Some Hybrid Improvements

…ation path.

dellaert reviewed Sep 9, 2019

View reviewed changes

ProfFan changed the title ~~[Placeholder, DO_NOT_MERGE] Merging the Eigen Backend and Common Interface for Linear Solvers~~ [DO_NOT_MERGE] Merging the Eigen Backend and Common Interface for Linear Solvers Sep 12, 2019

varunagrawal reviewed Sep 14, 2019

View reviewed changes

gtsam/nonlinear/NonlinearOptimizer.cpp Outdated Show resolved Hide resolved

varunagrawal reviewed Sep 14, 2019

View reviewed changes

gtsam/nonlinear/NonlinearOptimizer.cpp Outdated Show resolved Hide resolved

ProfFan added this to the GTSAM 4.1 milestone Sep 23, 2019

ProfFan force-pushed the feature/fan/Eigen branch from f4000a4 to 4551b65 Compare December 31, 2019 19:42

dellaert reviewed Jun 1, 2020

View reviewed changes

ProfFan changed the title ~~[DO_NOT_MERGE] Merging the Eigen Backend and Common Interface for Linear Solvers~~ Merging the Eigen Backend and Common Interface for Linear Solvers Jun 4, 2020

ProfFan mentioned this pull request Jun 7, 2020

[Feature Request] Add GPU optimizations #291

Open

dellaert and others added 27 commits January 16, 2023 15:33

Compute log-normalization constant as the max of the individual norma…

ca39f99

…lization constants.

Refactored tests and removed incorrect (R not upper-triangular) test.

3b4b048

Implemented the "hidden constant" scheme.

3915dbd

Explicitly implement logNormalizationConstant

44bed35

Fixed toFactorGraph and added test to verify

3ba2464

Fixed test to work with "hidden constant" scheme

c3cb8db

Fix python tests (and expose HybridBayesNet.error)

629a329

Eradicated GraphAndConstant

76a479e

Added DEBUG_MARGINALS flag

60cd900

Added comment

f9bf231

Trap if conditional==null.

9411942

Fix logProbability tests

ee43633

Ratio test succeeds on fg, but not on posterior yet,

74a48a2

Removed obsolete normalizationConstants method

aef9f07

Add math related to hybrid classes

774974c

Merge pull request #1387 from borglab/discreteKeys-vector

f464aab

Some Hybrid Improvements

Added partial elimination test

03a288e

Added correction with the normalization constant in the second elimin…

23da05c

…ation path.

Merge branch 'develop' into hybrid/simplify

3a226c7

Merge pull request #1388 from borglab/hybrid/simplify

7f1ff79

update function names and docs to be correct

fdb7cb6

Switch pruning test to probabilities.

d3852f8

add a TODO about reorder_relinearize

e6a411a

Merge pull request #1390 from borglab/hybrid/simplify-2

72b2633

Merge pull request #1391 from borglab/hybrid/pruning_test

662314e

Merge branch 'develop' into feature/fan/Eigen

c912243

WIP

d94ab70

ProfFan closed this Jan 20, 2023

ProfFan mentioned this pull request Jan 20, 2023

Eigen, cuSparse and SuiteSparse solvers #1396

Open

varunagrawal deleted the feature/fan/Eigen branch October 22, 2023 19:55

Merging the Eigen Backend and Common Interface for Linear Solvers #111

Merging the Eigen Backend and Common Interface for Linear Solvers #111

Conversation

ProfFan commented Sep 9, 2019 • edited by dellaert Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ProfFan commented Sep 10, 2019

ProfFan commented Sep 23, 2019

dellaert commented May 28, 2020

ProfFan commented May 31, 2020 • edited Loading

Merging the sparse matrix generation

Profiling

ProfFan commented May 31, 2020 • edited Loading

ProfFan commented Jun 1, 2020

dellaert left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ProfFan commented Jun 1, 2020

ProfFan commented Jun 1, 2020 • edited Loading

ProfFan commented Jun 2, 2020

dellaert commented Jun 2, 2020

ProfFan commented Jun 4, 2020

dellaert commented Jun 4, 2020

ProfFan commented Jun 7, 2020 • edited Loading

dellaert commented Jun 7, 2020

ProfFan commented Jun 7, 2020

ProfFan commented Jun 7, 2020

dellaert commented Jun 7, 2020

dellaert commented Jun 7, 2020

ProfFan commented Jun 8, 2020

dellaert commented Jun 8, 2020

ProfFan commented Jun 8, 2020

ProfFan commented Sep 9, 2019 •

edited by dellaert

Loading

ProfFan commented May 31, 2020 •

edited

Loading

ProfFan commented May 31, 2020 •

edited

Loading

ProfFan commented Jun 1, 2020 •

edited

Loading

ProfFan commented Jun 7, 2020 •

edited

Loading