This program evaluates restricted RI-CCSD energy. By pure Rust code.
To non-chemists: RI-CCSD can be seen as a group of dense 2-4 dimension tensor numerical computations. Most of tasks in RI-CCSD can be converted to matrix multiplication, so it is mostly compute-bounded.
RI-CCSD is 4-dimension problem in it's nature; coding with libraries that focus on 2-dimension matrices is usually not convenient, if not impossible.
Refer incomplete article list (Q-Chem, Psi4 fnocc, FHI-Aims, Gamess US, to name a few) to interested readers.
~ 600 lines of code riccsd.rs, with comments that how numpy implements with np.einsum
.
This showcase uses RSTSR as tensor library with commit 403e815. This crate is under development, and currently has not been published to crates.io.
This showcase hope to show that, with some sugar from tensor (n-dimensional array) library, rust can handle problems that most FLOPs come from matmul of large dense matrices, with
- acceptable lines of code (maybe 1.5-4 times of numpy if
np.einsum
is not allowed) - acceptable efficiency
- rayon parallel with acceptable unsafe mutables
In other words, rust is a proper candidate for a few scientific computing tasks, to balance program efficiency / code readibility / development efficiency / memory reliablity.
This conclusion is not trivial, and existing tensor libraries in rust language seem not fully prepared for this task; so a showcase project is here to demonstrate the possibility.
Computation device information:
- personal computer
- AMD Ryzen 7945HX, 16 physical cores
- only one NUMA node
- 2 x 32 GB memory (5600 MT/s)
System information:
- (H2O)10 cluster, PP5 structure from 10.1021/jp104865w.
- basis: cc-pVDZ
- auxiliary basis: cc-pVDZ-ri (both for SCF and CCSD, which is not recommended for real-world evaluation, but this project is only efficiency benchmark)
-
$n_\mathrm{occ} = 40$ (frozen core),$n_\mathrm{vir} = 190$ ,$n_\mathrm{aux} = 820$ .
this showcase | Psi4 | PySCF | |
---|---|---|---|
corr eng (a.u.) | -2.1735512 | -2.1735494 | -2.1735499 |
time each iter (sec) | ~ 18.5 | ~ 19.0 | ~ 29.5 |
version | - | 1.9.1 (conda-forge) | 2.7.0 (pypi) |
math library | OpenBLAS (compiled) | Intel OneAPI (conda-forge) | OpenBLAS (pypi) |
math library threading | pthread | TBB | serial |
algorithms | DF | DF (fnocc) | Conv |
DF refers to density fitting integral and algorithms, Conv refers to conventional integral.
Some important notes:
- Psi4 uses Intel OneAPI (MKL), which is not very efficienct on AMD CPUs, so this is actually not fair comparasion. Estimated 20% efficiency boost if using OpenBLAS.
- Psi4 have multiple CC engines. FNOCC is more efficient, while OCC have more functionalities.
- By comparing to conventional integral algorithms, density fitting (RI-CCSD) actually increases FLOPs, but only decreases memory footprints for large species (if I get both RI-CCSD and Conv-CCSD algorithms correctly).
- Time of each iter: ~ 18.5 sec
-
$O(n_\mathrm{occ}^3 n_\mathrm{vir}^3)$ term: ~ 6.0 sec- FLOPs estimation: slightly larger than
$2 \times 4 \times n_\mathrm{occ}^3 n_\mathrm{vir}^3 = 3.27 \ \mathrm{T}$ - ~ 540 GFLOP/sec, 48% CPU maximum L1 bandwidth
- FLOPs estimation: slightly larger than
-
$O(n_\mathrm{occ}^2 n_\mathrm{vir}^4)$ term (pp-Ladder): ~ 8.8 sec- FLOPs estimation: slightly larger than
$2 \times (n_\mathrm{vir}^4 n_\mathrm{aux} + 0.5 \times n_\mathrm{occ}^2 n_\mathrm{vir}^4) = 3.98 \ \mathrm{T}$ - ~ 450 GFLOP/sec, 40% CPU maximum L1 bandwidth
- FLOPs estimation: slightly larger than
We expect 50% efficiency usage is achievable, but that requires more fine-tuned code.
This project has not optimized for lowering memory footprints. This code accepts dupilcating some
This project is only for efficiency and code style demonstration. Usability is not the first concern.
Binary file is available for this showcase. Refer to release page.
To use this binary, some preparation is required:
export RAYON_NUM_THREADS=16 # number of parallel
export RUST_MIN_STACK=16777216 # could be larger if stack overflow
export LD_LIBRARY_PATH=<your pthread openblas directory>:$LD_LIBRARY_PATH
./showcase_rust_riccsd_glibc_2.17 <directory of your npy files>
This project requires the user (more details in env file or vscode setting)
- Provide
libopenblas.so
(pthread scheme) in$LD_LIBRARY_PATH
. Due to how rust's FFI works, OpenMP compiled OpenBLAS does not work; - Provide
*.npy
files, in pyscf convention (see python_scripts for details):- mo_coeff.npy (in c-contiguous, shape (nao, nmo))
- mo_energy.npy
- mo_coeff.npy
- cderi.npy (in lower-triangular packed AO basis)