Skip to content

Commit

Permalink
[WIP] recursive, depth-first optimizer (#73)
Browse files Browse the repository at this point in the history
* initial recursive optimal

* add recursive greedy implementation

* tidy up and replace optimal/greedy


update docs

* implement memory_limit for optimal & branch

* docs and tweaks

* update docs, add 'branch-1' to 'auto' mode

* clean up docstrings and function signatures
  • Loading branch information
jcmgray authored and dgasmith committed Nov 28, 2018
1 parent 9d395da commit e134977
Show file tree
Hide file tree
Showing 15 changed files with 584 additions and 355 deletions.
3 changes: 2 additions & 1 deletion docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,9 @@ Function Reference
opt_einsum.contract_path
opt_einsum.contract_expression
opt_einsum.contract.ContractExpression
opt_einsum.contract.PathInfo
opt_einsum.paths.optimal
opt_einsum.paths.greedy
opt_einsum.paths.eager
opt_einsum.paths.branch
opt_einsum.parser.get_symbol
opt_einsum.sharing.shared_intermediates
42 changes: 42 additions & 0 deletions docs/source/branching_path.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
==================
The Branching Path
==================

While the ``optimal`` path is guaranteed to find the smallest estimate FLOP
cost, it spends a lot of time exploring paths which are not likely to result in
an optimal path. For instance, outer products are usually not advantageous
unless absolutely necessary. Additionally, by trying a 'good' path first, it
should be possible to quickly establish a threshold FLOP cost which can then be
used to prune many bad paths.

The **branching** strategy (provided by :func:`~opt_einsum.paths.branch`) does
this by taking the recursive, depth-first approach of
:func:`~opt_einsum.paths.optimal`, whilst also sorting potential contractions
based on a heuristic cost, as in :func:`~opt_einsum.paths.greedy`.

There are two main flavours:

- ``optimize='branch-all'``: explore **all** inner products, starting with
those that look best according to the cost heuristic.
- ``optimize='branch-2'``: similar, but at each step only explore the
estimated best **two** possible contractions, leading to a maximum of
2^N paths assessed.

In both cases, :func:`~opt_einsum.paths.branch` takes an active approach to
pruning paths well before they hit the best *total* FLOP count, by comparing
them to the FLOP count (times some factor) achieved by the best path at the
same point in the contraction.

There is also ``'branch-1'``, which, since it only explores a single path at
each step does not really 'branch' - this is essentially the approach of
``'greedy'``.
In comparison, ``'branch-1'`` will be slower for large expressions, but for
small to medium expressions it might find slightly higher quality contractions
due to considering individual flop costs at each step.

The default ``optimize='auto'`` mode of ``opt_einsum`` will use
``'branch-all'`` for 5 or 6 tensors, though it should be able to handle
12-13 tensors in a matter or seconds. Likewise, ``'branch-2'`` will be used for
7 or 8 tensors, though it should be able to handle 20-22 tensors in a matter of
seconds. Finally, ``'branch-1'`` will be used by ``'auto'`` for expressions of
up to 14 tensors.
13 changes: 0 additions & 13 deletions docs/source/eager_path.rst

This file was deleted.

45 changes: 36 additions & 9 deletions docs/source/greedy_path.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,41 @@
The Greedy Path
===============

The ``greedy`` path iterates through the possible pair contractions and chooses the "best" contraction at every step until all contractions are considered.
The "best" contraction pair is determined by the smallest of the tuple ``(-removed_size, cost)`` where ``removed_size`` is the size of the contracted tensors minus the size of the tensor created and ``cost`` is the cost of the contraction.
Effectively, the algorithm chooses the best inner or dot product, Hadamard product, and then outer product at each iteration with a sieve to prevent large outer products.
This algorithm has proven to be quite successful for general production and only misses a few complex cases that make it slightly worse than the ``optimal`` algorithm.
Fortunately, these often only lead to increases in prefactor than missing the optimal scaling.
The ``'greedy'`` approach provides a very efficient strategy for finding
contraction paths for expressions with large numbers of tensors.
It does this by eagerly choosing contractions in three stages:

1. Eagerly compute any **Hadamard** products (in arbitrary order -- this is
commutative).
2. Greedily contract pairs of remaining tensors, at each step choosing the
pair that maximizes ``reduced_size`` -- these are generally **inner**
products.
3. Greedily compute any pairwise **outer** products, at each step choosing
the pair that minimizes ``sum(input_sizes)``.

The cost heuristic ``reduced_size`` is simply the size of the pair of potential
tensors to be contracted, minus the size of the resulting tensor.

The ``greedy`` algorithm has space and time complexity ``O(n * k)`` where ``n``
is the number of input tensors and ``k`` is the maximum number of tensors that
share any dimension (excluding dimensions that occur in the output or in every
tensor). As such, the algorithm scales well to very large sparse contractions
of low-rank tensors, and indeed, often finds the optimal, or close to optimal
path in such cases.

The ``greedy`` functionality is provided by :func:`~opt_einsum.paths.greedy`,
and is selected by the default ``optimize='auto'`` mode of ``opt_einsum`` for
expressions with many inputs. Expressions of up to a thousand tensors
should still take well less than a second to find paths for.

The ``greedy`` approach scales like N^2 rather than factorially, making ``greedy`` much more suitable for large numbers of contractions where the lower prefactor helps decrease latency.
As :mod:`opt_einsum` can handle an arbitrary number of indices the low scaling is necessary for extensive contraction networks.
The ``greedy`` functionality is provided by :func:`~opt_einsum.paths.greedy`.

Optimal Scaling Misses
----------------------

The greedy algorithm, while inexpensive, can occasionally miss optimal scaling in some circumstances as seen below. The ``greedy`` algorithm prioritizes expressions which remove the largest indices first, in this particular case this is the incorrect choice and it is difficult for any heuristic algorithm to "see ahead" as would be needed here.

It should be stressed these cases are quite rare and by default ``contract`` uses the ``optimal`` path for four and fewer inputs as the cost of evaluating the ``optimal`` path is similar to that of the ``greedy`` path.
It should be stressed these cases are quite rare and by default ``contract`` uses the ``optimal`` path for four and fewer inputs as the cost of evaluating the ``optimal`` path is similar to that of the ``greedy`` path. Similarly, for 5-8 inputs, ``contract`` uses one of the
branching strategies which can find higher quality paths.

.. code:: python
Expand Down Expand Up @@ -57,3 +77,10 @@ It should be stressed these cases are quite rare and by default ``contract`` use
4 False xtf,xyf->tfy ytpf,fr,tfy->tpr
4 False tfy,ytpf->tfp fr,tfp->tpr
4 TDOT tfp,fr->tpr tpr->tpr
So we can see that the ``greedy`` algorithm finds a path which is about 16
times slower than the ``optimal`` one. In such cases, it might be worth using
one of the more exhaustive optimization strategies: ``'optimal'``,
``'branch-all'`` or ``branch-2`` (all of which will find the optimal path in
this example).
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -134,8 +134,8 @@ Table of Contents

path_finding
optimal_path
branching_path
greedy_path
eager_path

.. toctree::
:maxdepth: 1
Expand Down
13 changes: 8 additions & 5 deletions docs/source/optimal_path.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,14 @@ The Optimal Path
================

The most optimal path can be found by searching through every possible way to contract the tensors together, this includes all combinations with the new intermediate tensors as well.
While this algorithm scales like N! and can often become more costly to compute than the unoptimized contraction itself, it provides an excellent benchmark.
The function that computes this path in opt_einsum is called :func:`~opt_einsum.paths.optimal` and works by iteratively finding every possible combination of pairs to contract in the current list of tensors.
This is iterated until all tensors are contracted together. The resulting paths are then sorted by total FLOP cost and the lowest one is chosen.
This algorithm runs in about 1 second for 7 terms, 15 seconds for 8 terms, and 480 seconds for 9 terms limiting its overall usefulness for a large number of terms.
By considering limited memory, this can be sieved and can reduce the cost of computing the optimal function by an order of magnitude or more.
While this algorithm scales like N!, and can often become more costly to compute than the unoptimized contraction itself, it provides an excellent benchmark.
The function that computes this path in opt_einsum is called :func:`~opt_einsum.paths.optimal` and works by performing a recursive, depth-first search. By keeping track of the
best path found so far, in terms of total estimated FLOP count, the search can
then quickly prune many paths as soon as as they exceed this best.
This optimal strategy is used by default with the ``optimize='auto'`` mode of
``opt_einsum`` for 4 tensors or less, though it can handle expressions of up to
9-10 tensors in a matter of seconds.


Let us look at an example:

Expand Down
144 changes: 120 additions & 24 deletions docs/source/path_finding.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,37 @@
==================
============
Introduction
============

Performing an optimized tensor contraction to speed up ``einsum`` involves two
key stages:

1. Finding a pairwise contraction order, or **'path'**.
2. Performing the sequence of contractions given this path.

The better the quality of path found in the first step, the quicker the actual
contraction in the second step can be -- often dramatically. However, finding
the *optimal* path is an NP-hard problem that can quickly become intractable,
meaning that a balance must be struck between the time spent finding a path,
and its quality. :mod:`opt_einsum` handles this by using several path finding
algorithms, which can be manually specified using the ``optimize`` keyword.
These are:

* The ``'optimal'`` strategy - an exhaustive search of all possible paths
* The ``'branch'`` strategy - a more restricted search of many likely paths
* The ``'greedy'`` strategy - finds a path one step at a time using a cost
heuristic

By default (``optimize='auto'``), :func:`~opt_einsum.contract` will select the
best of these it can while aiming to keep path finding times below around 1ms.
An analysis of each's performance can be found at the bottom of this page.

Additionally, if you want to find the path separately to performing the
contraction, or just inspect information about the path found, you can use the
function :func:`~opt_einsum.contract_path`.


Examining the Path
==================
~~~~~~~~~~~~~~~~~~

As an example, consider the following expression found in a perturbation theory (one of ~5,000 such expressions):

Expand All @@ -26,45 +57,52 @@ This is a single possible path to the final answer (and notably, not the most op
einsum_string = 'bdik,acaj,ikab,ajac,ikbd->'
# Build random views to represent this contraction
unique_inds = set(einsum_string.replace(',', ''))
index_size = [10, 17, 9, 10, 13, 16, 15, 14]
sizes_dict = {c : s for c, s in zip(set(einsum_string), index_size)}
unique_inds = set(einsum_string) - {',', '-', '>'}
index_size = [10, 17, 9, 10, 13, 16, 15, 14, 12]
sizes_dict = dict(zip(unique_inds, index_size))
views = oe.helpers.build_views(einsum_string, sizes_dict)
path_info = oe.contract_path(einsum_string, *views)
>>> print(path_info[0])
[(1, 3), (0, 2), (0, 2), (0, 1)]
path, path_info = oe.contract_path(einsum_string, *views)
>>> print(path)
[(0, 4), (1, 3), (0, 1), (0, 1)]
>>> print(path_info[1])
>>> print(path_info)
Complete contraction: bdik,acaj,ikab,ajac,ikbd->
Naive scaling: 7
Optimized scaling: 4
Naive FLOP count: 3.819e+08
Optimized FLOP count: 8.000e+04
Theoretical speedup: 4773.600
Largest intermediate: 1.872e+03 elements
Naive FLOP count: 2.387e+8
Optimized FLOP count: 8.068e+4
Theoretical speedup: 2958.354
Largest intermediate: 1.530e+3 elements
--------------------------------------------------------------------------------
scaling BLAS current remaining
scaling BLAS current remaining
--------------------------------------------------------------------------------
3 False ajac,acaj->a bdik,ikab,ikbd,a->
4 False ikbd,bdik->bik ikab,a,bik->
4 False bik,ikab->a a,a->
1 DOT a,a-> ->
4 0 ikbd,bdik->ikb acaj,ikab,ajac,ikb->
4 GEMV/EINSUM ikb,ikab->a acaj,ajac,a->
3 0 ajac,acaj->a a,a->
1 DOT a,a-> ->
We can then check that actually performing the contraction produces the expected result:

.. code:: python
import numpy as np
einsum_result = np.einsum("bdik,acaj,ikab,ajac,ikbd->", *views)
contract_result = contract("bdik,acaj,ikab,ajac,ikbd->", *views)
contract_result = oe.contract("bdik,acaj,ikab,ajac,ikbd->", *views)
>>> np.allclose(einsum_result, contract_result)
True
By contracting terms in the correct order we can see that this expression can be computed with N^4 scaling. Even with the overhead of finding the best order or 'path' and small dimensions, opt_einsum is roughly 5000 times faster than pure einsum for this expression.
By contracting terms in the correct order we can see that this expression can be computed with N^4 scaling. Even with the overhead of finding the best order or 'path' and small dimensions,
``opt_einsum`` is roughly 3000 times faster than pure einsum for this expression.


Details of Path structure
~~~~~~~~~~~~~~~~~~~~~~~~~
Format of the Path
~~~~~~~~~~~~~~~~~~

Finding the optimal order of contraction is a NP-hard problem and the factorial scaling quickly becomes intractable.
Let us look at the structure of a canonical ``einsum`` path found in NumPy and its optimized variant:

.. code:: python
Expand All @@ -75,7 +113,7 @@ Let us look at the structure of a canonical ``einsum`` path found in NumPy and i
In opt_einsum each element of the list represents a single contraction.
In the above example the einsum_path would effectively compute the result as a single contraction identical to that of ``einsum``, while the
opt_path would perform four contractions in order to reduce the overall scaling.
The first tuple in the opt_path ``(1,3)`` contracts the second and fourth terms together to produce a new term which is then appended to the list of terms, this is continued until all terms are contracted.
The first tuple in the opt_path, ``(1,3)``, pops the second and fourth terms, then contracts them together to produce a new term which is then appended to the list of terms, this is continued until all terms are contracted.
An example should illuminate this:

.. code:: python
Expand All @@ -91,3 +129,61 @@ An example should illuminate this:
4 False bik,ikab->a a,a->
terms = ['a', 'a'] contraction = (0, 1)
1 DOT a,a-> ->
A path specified in this format can explicitly be supplied directly to
:func:`~opt_einsum.contract` using the ``optimize`` keyword:

.. code:: python
contract_result = oe.contract("bdik,acaj,ikab,ajac,ikbd->", *views, optimize=opt_path)
>>> np.allclose(einsum_result, contract_result)
True
Performance Comparison
~~~~~~~~~~~~~~~~~~~~~~

The following graphs should give some indication of the tradeoffs between path
finding time and path quality. They are generated by finding paths with each
possible algorithm for many randomly generated networks of ``n`` tensors with
varying connectivity.

First we have the time to find each path as a function of the number of terms
in the expression:

.. image:: path_finding_time.png
:width: 750

Clearly the exhaustive (``'optimal'``, ``'branch-all'``) and exponential
(``'branch-2'``) searches eventually scale badly, but for modest amounts of
terms they incur only a small overhead.

Next we can look at the average FLOP speedup (as compared to the easiest path
to find, ``'greedy'``):

.. image:: path_found_flops.png
:width: 750


One can see that the heirarchy of path qualities is:

1. ``'optimal'`` (used by auto for ``n <= 4``)
2. ``'branch-all'`` (used by auto for ``n <= 6``)
3. ``'branch-2'`` (used by auto for ``n <= 8``)
4. ``'branch-1'`` (used by auto for ``n <= 14``)
5. ``'greedy'`` (used by auto for anything larger)

There are a few important caveats to note with this graph. Firstly, the
benefits of more advanced path finding are very dependent on the complexity of
the expression. For 'simple' contractions, all the different approaches will
*mostly* find the same path (as here). However, for 'tricky' contractions, there
will be certain cases where the more advanced algorithms will find much better
paths. As such, while this graph gives a good idea of the *relative* performance
of each algorithm, the 'average speedup' is not a perfect indicator since
worst-case performance might be more critical.

Note that the speedups for any of the methods as compared to a standard
``einsum`` or a naively chosen path (such as ``path=[(0, 1), (0, 1), ...]``)
are all exponentially large and not shown.
Binary file added docs/source/path_finding_time.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/source/path_found_flops.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion docs/source/reusing_paths.rst
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,8 @@ two contractions to compute the output.
.. note::

The constant part of an expression is lazily generated upon the first call
(specific to each backend), though it can also be explicitly built by calling:meth:`~opt_einsum.contract.ContractExpression.evaluate_constants`.
(specific to each backend), though it can also be explicitly built by calling
:meth:`~opt_einsum.contract.ContractExpression.evaluate_constants`.

We can confirm the advantage of using expressions and constants by timing the
following scenarios, first setting
Expand Down
Loading

0 comments on commit e134977

Please sign in to comment.