Need FP64 tensor contractions and can't buy a datacenter GPU because you already maxed out your home equity line of credit?
Set your CPU on fire with TBLIS!
pytblis.einsum and pytblis.tensordot are drop-in replacements for
numpy.einsum and numpy.tensordot.
In addition, low level wrappers are provided for
tblis_tensor_add, tblis_tensor_mult, tblis_tensor_reduce, tblis_tensor_shift, and tblis_tensor_dot.
These are named pytblis.add, pytblis.mult, et cetera.
Finally, there are mid-level convenience wrappers for tblis_tensor_mult and
tblis_tensor_add:
def contract(
subscripts: str,
a: ArrayLike,
b: ArrayLike,
alpha: scalar = 1.0,
beta: scalar = 0.0,
out: Optional[npt.ArrayLike] = None,
conja: bool = False,
conjb: bool = False,
) -> ArrayLikeand
def transpose_add(
subscripts: str,
a: ArrayLike,
alpha: scalar = 1.0,
beta: scalar = 0.0,
out: Optional[ArrayLike] = None,
conja: bool = False,
conjout: bool = False,
) -> ArrayLikeThese are used as follows:
C = pytblis.contract("ij,jk->ik", A, B, alpha=1.0, beta=0.5, out=C, conja=True, conjb=False)does
B = pytblis.tensor_add("iklj->ijkl", A, alpha=-1.0, beta=1.0, out=B)does
Some additional documentation (work in progress) is available at pytblis.readthedocs.io.
Supported datatypes: np.float32, np.float64, np.complex64,
np.complex128. Mixing arrays of different precisions isn't yet supported.
New in version v0.0.11: pytblis.contract fully supports contractions between
complex and/or real tensors of the same floating point precision, provided that
alpha and beta are both real. This just contracts the real and imaginary
parts separately with TBLIS. As of v0.0.14, this feature is enabled by default.
It can be turned off in pytblis.contract and pytblis.einsum by passing
complex_real_contractions=False.
I will try to get this package added to conda-forge. In the meantime, conda packages may be downloaded from my personal channel.
conda install pytblis -c conda-forge -c chillenb
The pre-built Mac OS wheels on PyPI use pthreads for multithreading. The Linux wheels now use OpenMP, which is much more efficient. Mac users who want OpenMP should install from source or use the conda packages.
pip install pytblis (not as performant)
Don't use OpenBLAS configured with pthreads.
It causes oversubscription when used with other multithreaded libraries, in
particular anything that uses OpenMP. Instead, use MKL (libblas=*=*mkl) or the
OpenMP variant
of OpenBLAS (libopenblas=*=*openmp*).
pip install --no-binary pytblis pytblis
The default compile options will give good performance. OpenMP is the default
thread model when building from source. You can pass additional options to CMake
via CMAKE_ARGS, change the thread model, compile for other
CPU microarchitectures,
etc.
- Install TBLIS.
- Run
CMAKE_ARGS="-DTBLIS_ROOT=wherever_tblis_is_installed" pip install .
See dev_install.sh for an example. This script installs TBLIS
in ./local_tblis_prefix and then links pytblis against it.
If you use TBLIS in your academic work, it's a good idea to cite:
- High-Performance Tensor Contraction without Transposition
- Strassen's Algorithm for Tensor Contraction
TBLIS is not my work, and its developers are not responsible for flaws in these Python bindings.
The implementation of einsum and the tests are modified versions of those from opt_einsum.
pytblis was developed in the Zhu Group, Department of Chemistry, Yale University.