[WIP] recursive, depth-first optimizer (#73)

* initial recursive optimal * add recursive greedy implementation * tidy up and replace optimal/greedy update docs * implement memory_limit for optimal & branch * docs and tweaks * update docs, add 'branch-1' to 'auto' mode * clean up docstrings and function signatures
dgasmith · Nov 28, 2018 · e134977 · e134977
1 parent 9d395da
commit e134977
Show file tree

Hide file tree

Showing 15 changed files with 584 additions and 355 deletions.
diff --git a/docs/source/api.rst b/docs/source/api.rst
@@ -9,8 +9,9 @@ Function Reference
     opt_einsum.contract_path
     opt_einsum.contract_expression
     opt_einsum.contract.ContractExpression
+    opt_einsum.contract.PathInfo
     opt_einsum.paths.optimal
     opt_einsum.paths.greedy
-    opt_einsum.paths.eager
+    opt_einsum.paths.branch
     opt_einsum.parser.get_symbol
     opt_einsum.sharing.shared_intermediates
diff --git a/docs/source/branching_path.rst b/docs/source/branching_path.rst
@@ -0,0 +1,42 @@
+==================
+The Branching Path
+==================
+
+While the ``optimal`` path is guaranteed to find the smallest estimate FLOP
+cost, it spends a lot of time exploring paths which are not likely to result in
+an optimal path. For instance, outer products are usually not advantageous
+unless absolutely necessary. Additionally, by trying a 'good' path first, it
+should be possible to quickly establish a threshold FLOP cost which can then be
+used to prune many bad paths.
+
+The **branching** strategy (provided by :func:`~opt_einsum.paths.branch`) does
+this by taking the recursive, depth-first approach of
+:func:`~opt_einsum.paths.optimal`, whilst also sorting potential contractions
+based on a heuristic cost, as in :func:`~opt_einsum.paths.greedy`.
+
+There are two main flavours:
+
+    - ``optimize='branch-all'``: explore **all** inner products, starting with
+      those that look best according to the cost heuristic.
+    - ``optimize='branch-2'``: similar, but at each step only explore the
+      estimated best **two** possible contractions, leading to a maximum of
+      2^N paths assessed.
+
+In both cases, :func:`~opt_einsum.paths.branch` takes an active approach to
+pruning paths well before they hit the best *total* FLOP count, by comparing
+them to the FLOP count (times some factor) achieved by the best path at the
+same point in the contraction.
+
+There is also ``'branch-1'``, which, since it only explores a single path at
+each step does not really 'branch' - this is essentially the approach of
+``'greedy'``.
+In comparison, ``'branch-1'`` will be slower for large expressions, but for
+small to medium expressions it might find slightly higher quality contractions
+due to considering individual flop costs at each step.
+
+The default ``optimize='auto'`` mode of ``opt_einsum`` will use
+``'branch-all'`` for 5 or 6 tensors, though it should be able to handle
+12-13 tensors in a matter or seconds. Likewise, ``'branch-2'`` will be used for
+7 or 8 tensors, though it should be able to handle 20-22 tensors in a matter of
+seconds. Finally, ``'branch-1'`` will be used by ``'auto'`` for expressions of
+up to 14 tensors.
diff --git a/docs/source/eager_path.rst b/docs/source/eager_path.rst
diff --git a/docs/source/greedy_path.rst b/docs/source/greedy_path.rst
@@ -2,21 +2,41 @@
 The Greedy Path
 ===============
 
-The ``greedy`` path iterates through the possible pair contractions and chooses the "best" contraction at every step until all contractions are considered.
-The "best" contraction pair is determined by the smallest of the tuple ``(-removed_size, cost)`` where ``removed_size`` is the size of the contracted tensors minus the size of the tensor created and ``cost`` is the cost of the contraction.
-Effectively, the algorithm chooses the best inner or dot product, Hadamard product, and then outer product at each iteration with a sieve to prevent large outer products.
-This algorithm has proven to be quite successful for general production and only misses a few complex cases that make it slightly worse than the ``optimal`` algorithm.
-Fortunately, these often only lead to increases in prefactor than missing the optimal scaling.
+The ``'greedy'`` approach provides a very efficient strategy for finding
+contraction paths for expressions with large numbers of tensors.
+It does this by eagerly choosing contractions in three stages:
+
+  1. Eagerly compute any **Hadamard** products (in arbitrary order -- this is
+     commutative).
+  2. Greedily contract pairs of remaining tensors, at each step choosing the
+     pair that maximizes ``reduced_size`` -- these are generally **inner**
+     products.
+  3. Greedily compute any pairwise **outer** products, at each step choosing
+     the pair that minimizes ``sum(input_sizes)``.
+
+The cost heuristic ``reduced_size`` is simply the size of the pair of potential
+tensors to be contracted, minus the size of the resulting tensor.
+
+The ``greedy`` algorithm has space and time complexity ``O(n * k)`` where ``n``
+is the number of input tensors and ``k`` is the maximum number of tensors that
+share any dimension (excluding dimensions that occur in the output or in every
+tensor). As such, the algorithm scales well to very large sparse contractions
+of low-rank tensors, and indeed, often finds the optimal, or close to optimal
+path in such cases.
+
+The ``greedy`` functionality is provided by :func:`~opt_einsum.paths.greedy`,
+and is selected by the default ``optimize='auto'`` mode of ``opt_einsum`` for
+expressions with many inputs. Expressions of up to a thousand tensors
+should still take well less than a second to find paths for.
 
-The ``greedy`` approach scales like N^2 rather than factorially, making ``greedy`` much more suitable for large numbers of contractions where the lower prefactor helps decrease latency.
-As :mod:`opt_einsum` can handle an arbitrary number of indices the low scaling is necessary for extensive contraction networks.
-The ``greedy`` functionality is provided by :func:`~opt_einsum.paths.greedy`.
 
 Optimal Scaling Misses
 ----------------------
+
 The greedy algorithm, while inexpensive, can occasionally miss optimal scaling in some circumstances as seen below. The ``greedy`` algorithm prioritizes expressions which remove the largest indices first, in this particular case this is the incorrect choice and it is difficult for any heuristic algorithm to "see ahead" as would be needed here.
 
-It should be stressed these cases are quite rare and by default ``contract`` uses the ``optimal`` path for four and fewer inputs as the cost of evaluating the ``optimal`` path is similar to that of the ``greedy`` path.
+It should be stressed these cases are quite rare and by default ``contract`` uses the ``optimal`` path for four and fewer inputs as the cost of evaluating the ``optimal`` path is similar to that of the ``greedy`` path. Similarly, for 5-8 inputs, ``contract`` uses one of the
+branching strategies which can find higher quality paths.
 
 .. code:: python
 
@@ -57,3 +77,10 @@ It should be stressed these cases are quite rare and by default ``contract`` use
        4          False           xtf,xyf->tfy                      ytpf,fr,tfy->tpr
        4          False          tfy,ytpf->tfp                           fr,tfp->tpr
        4           TDOT            tfp,fr->tpr                              tpr->tpr
+
+
+So we can see that the ``greedy`` algorithm finds a path which is about 16
+times slower than the ``optimal`` one. In such cases, it might be worth using
+one of the more exhaustive optimization strategies: ``'optimal'``,
+``'branch-all'`` or ``branch-2`` (all of which will find the optimal path in
+this example).
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -134,8 +134,8 @@ Table of Contents
 
    path_finding
    optimal_path
+   branching_path
    greedy_path
-   eager_path
 
 .. toctree::
    :maxdepth: 1

diff --git a/docs/source/optimal_path.rst b/docs/source/optimal_path.rst
@@ -3,11 +3,14 @@ The Optimal Path
 ================
 
 The most optimal path can be found by searching through every possible way to contract the tensors together, this includes all combinations with the new intermediate tensors as well.
-While this algorithm scales like N! and can often become more costly to compute than the unoptimized contraction itself, it provides an excellent benchmark.
-The function that computes this path in opt_einsum is called :func:`~opt_einsum.paths.optimal` and works by iteratively finding every possible combination of pairs to contract in the current list of tensors.
-This is iterated until all tensors are contracted together. The resulting paths are then sorted by total FLOP cost and the lowest one is chosen.
-This algorithm runs in about 1 second for 7 terms, 15 seconds for 8 terms, and 480 seconds for 9 terms limiting its overall usefulness for a large number of terms.
-By considering limited memory, this can be sieved and can reduce the cost of computing the optimal function by an order of magnitude or more.
+While this algorithm scales like N!, and can often become more costly to compute than the unoptimized contraction itself, it provides an excellent benchmark.
+The function that computes this path in opt_einsum is called :func:`~opt_einsum.paths.optimal` and works by performing a recursive, depth-first search. By keeping track of the
+best path found so far, in terms of total estimated FLOP count, the search can
+then quickly prune many paths as soon as as they exceed this best.
+This optimal strategy is used by default with the ``optimize='auto'`` mode of
+``opt_einsum`` for 4 tensors or less, though it can handle expressions of up to
+9-10 tensors in a matter of seconds.
+
 
 Let us look at an example:
 

diff --git a/docs/source/path_finding.rst b/docs/source/path_finding.rst
@@ -1,6 +1,37 @@
-==================
+============
+Introduction
+============
+
+Performing an optimized tensor contraction to speed up ``einsum`` involves two
+key stages:
+
+  1. Finding a pairwise contraction order, or **'path'**.
+  2. Performing the sequence of contractions given this path.
+
+The better the quality of path found in the first step, the quicker the actual
+contraction in the second step can be -- often dramatically. However, finding
+the *optimal* path is an NP-hard problem that can quickly become intractable,
+meaning that a  balance must be struck between the time spent finding a path,
+and its quality. :mod:`opt_einsum` handles this by using several path finding
+algorithms, which can be manually specified using the ``optimize`` keyword.
+These are:
+
+  * The ``'optimal'`` strategy - an exhaustive search of all possible paths
+  * The ``'branch'`` strategy - a more restricted search of many likely paths
+  * The ``'greedy'`` strategy - finds a path one step at a time using a cost
+    heuristic
+
+By default (``optimize='auto'``), :func:`~opt_einsum.contract` will select the
+best of these it can while aiming to keep path finding times below around 1ms.
+An analysis of each's performance can be found at the bottom of this page.
+
+Additionally, if you want to find the path separately to performing the
+contraction, or just inspect information about the path found, you can use the
+function :func:`~opt_einsum.contract_path`.
+
+
 Examining the Path
-==================
+~~~~~~~~~~~~~~~~~~
 
 As an example, consider the following expression found in a perturbation theory (one of ~5,000 such expressions):
 
@@ -26,45 +57,52 @@ This is a single possible path to the final answer (and notably, not the most op
     einsum_string = 'bdik,acaj,ikab,ajac,ikbd->'
 
     # Build random views to represent this contraction
-    unique_inds = set(einsum_string.replace(',', ''))
-    index_size = [10, 17, 9, 10, 13, 16, 15, 14]
-    sizes_dict = {c : s for c, s in zip(set(einsum_string), index_size)}
+    unique_inds = set(einsum_string) - {',', '-', '>'}
+    index_size = [10, 17, 9, 10, 13, 16, 15, 14, 12]
+    sizes_dict = dict(zip(unique_inds, index_size))
     views = oe.helpers.build_views(einsum_string, sizes_dict)
 
-    path_info = oe.contract_path(einsum_string, *views)
-    >>> print(path_info[0])
-    [(1, 3), (0, 2), (0, 2), (0, 1)]
+    path, path_info = oe.contract_path(einsum_string, *views)
+
+    >>> print(path)
+    [(0, 4), (1, 3), (0, 1), (0, 1)]
 
-    >>> print(path_info[1])
+    >>> print(path_info)
       Complete contraction:  bdik,acaj,ikab,ajac,ikbd->
              Naive scaling:  7
          Optimized scaling:  4
-          Naive FLOP count:  3.819e+08
-      Optimized FLOP count:  8.000e+04
-       Theoretical speedup:  4773.600
-      Largest intermediate:  1.872e+03 elements
+          Naive FLOP count:  2.387e+8
+      Optimized FLOP count:  8.068e+4
+       Theoretical speedup:  2958.354
+      Largest intermediate:  1.530e+3 elements
     --------------------------------------------------------------------------------
-    scaling   BLAS                  current                                remaining
+    scaling        BLAS                current                             remaining
     --------------------------------------------------------------------------------
-       3     False             ajac,acaj->a                       bdik,ikab,ikbd,a->
-       4     False           ikbd,bdik->bik                             ikab,a,bik->
-       4     False              bik,ikab->a                                    a,a->
-       1       DOT                    a,a->                                       ->
+       4              0         ikbd,bdik->ikb                  acaj,ikab,ajac,ikb->
+       4    GEMV/EINSUM            ikb,ikab->a                         acaj,ajac,a->
+       3              0           ajac,acaj->a                                 a,a->
+       1            DOT                  a,a->                                    ->
+
+
+We can then check that actually performing the contraction produces the expected result:
 
 .. code:: python
 
+    import numpy as np
+
     einsum_result = np.einsum("bdik,acaj,ikab,ajac,ikbd->", *views)
-    contract_result = contract("bdik,acaj,ikab,ajac,ikbd->", *views)
+    contract_result = oe.contract("bdik,acaj,ikab,ajac,ikbd->", *views)
+
     >>> np.allclose(einsum_result, contract_result)
     True
 
-By contracting terms in the correct order we can see that this expression can be computed with N^4 scaling. Even with the overhead of finding the best order or 'path' and small dimensions, opt_einsum is roughly 5000 times faster than pure einsum for this expression.
+By contracting terms in the correct order we can see that this expression can be computed with N^4 scaling. Even with the overhead of finding the best order or 'path' and small dimensions,
+``opt_einsum`` is roughly 3000 times faster than pure einsum for this expression.
 
 
-Details of Path structure
-~~~~~~~~~~~~~~~~~~~~~~~~~
+Format of the Path
+~~~~~~~~~~~~~~~~~~
 
-Finding the optimal order of contraction is a NP-hard problem and the factorial scaling quickly becomes intractable.
 Let us look at the structure of a canonical ``einsum`` path found in NumPy and its optimized variant:
 
 .. code:: python
@@ -75,7 +113,7 @@ Let us look at the structure of a canonical ``einsum`` path found in NumPy and i
 In opt_einsum each element of the list represents a single contraction.
 In the above example the einsum_path would effectively compute the result as a single contraction identical to that of ``einsum``, while the
 opt_path would perform four contractions in order to reduce the overall scaling.
-The first tuple in the opt_path ``(1,3)`` contracts the second and fourth terms together to produce a new term which is then appended to the list of terms, this is continued until all terms are contracted.
+The first tuple in the opt_path, ``(1,3)``, pops the second and fourth terms, then contracts them together to produce a new term which is then appended to the list of terms, this is continued until all terms are contracted.
 An example should illuminate this:
 
 .. code:: python
@@ -91,3 +129,61 @@ An example should illuminate this:
       4     False              bik,ikab->a                                    a,a->
     terms = ['a', 'a'] contraction = (0, 1)
       1       DOT                    a,a->                                       ->
+
+
+A path specified in this format can explicitly be supplied directly to
+:func:`~opt_einsum.contract` using the ``optimize`` keyword:
+
+.. code:: python
+
+    contract_result = oe.contract("bdik,acaj,ikab,ajac,ikbd->", *views, optimize=opt_path)
+
+    >>> np.allclose(einsum_result, contract_result)
+    True
+
+
+Performance Comparison
+~~~~~~~~~~~~~~~~~~~~~~
+
+The following graphs should give some indication of the tradeoffs between path
+finding time and path quality. They are generated by finding paths with each
+possible algorithm for many randomly generated networks of ``n`` tensors with
+varying connectivity.
+
+First we have the time to find each path as a function of the number of terms
+in the expression:
+
+.. image:: path_finding_time.png
+    :width: 750
+
+Clearly the exhaustive (``'optimal'``, ``'branch-all'``) and exponential
+(``'branch-2'``) searches eventually scale badly, but for modest amounts of
+terms they incur only a small overhead.
+
+Next we can look at the average FLOP speedup (as compared to the easiest path
+to find, ``'greedy'``):
+
+.. image:: path_found_flops.png
+    :width: 750
+
+
+One can see that the heirarchy of path qualities is:
+
+1. ``'optimal'`` (used by auto for ``n <= 4``)
+2. ``'branch-all'`` (used by auto for ``n <= 6``)
+3. ``'branch-2'`` (used by auto for ``n <= 8``)
+4. ``'branch-1'`` (used by auto for ``n <= 14``)
+5. ``'greedy'`` (used by auto for anything larger)
+
+There are a few important caveats to note with this graph. Firstly, the
+benefits of more advanced path finding are very dependent on the complexity of
+the expression. For 'simple' contractions, all the different approaches will
+*mostly* find the same path (as here). However, for 'tricky' contractions, there
+will be certain cases where the more advanced algorithms will find much better
+paths. As such, while this graph gives a good idea of the *relative* performance
+of each algorithm, the 'average speedup' is not a perfect indicator since
+worst-case performance might be more critical.
+
+Note that the speedups for any of the methods as compared to a standard
+``einsum`` or a naively chosen path (such as ``path=[(0, 1), (0, 1), ...]``)
+are all exponentially large and not shown.
diff --git a/docs/source/path_finding_time.png b/docs/source/path_finding_time.png
diff --git a/docs/source/path_found_flops.png b/docs/source/path_found_flops.png
diff --git a/docs/source/reusing_paths.rst b/docs/source/reusing_paths.rst
@@ -96,7 +96,8 @@ two contractions to compute the output.
 .. note::
 
     The constant part of an expression is lazily generated upon the first call
-    (specific to each backend), though it can also be explicitly built by calling:meth:`~opt_einsum.contract.ContractExpression.evaluate_constants`.
+    (specific to each backend), though it can also be explicitly built by calling
+    :meth:`~opt_einsum.contract.ContractExpression.evaluate_constants`.
 
 We can confirm the advantage of using expressions and constants by timing the
 following scenarios, first setting