Spin-weighted spherical harmonics and refactored use of SIMD #43

MikaelSlevinsky · 2020-04-26T02:02:18Z

New features in this PR:

Support for spin-weighted spherical harmonic transforms. They are orthonormalized in L^2, with the complex Fourier series as longitudinal basis and with complex coefficients.
Three new functions (x2 for float/double) for Horner's rule and Clenshaw's algorithm for Chebyshev series and for orthogonal polynomial series as well.

Improvements in this PR:

The code is now designed with a cross-compiler in mind. For performance-critical tasks, SIMD is hidden from the user interface and instead is dispatched based on CPU ID. This allows a cross-compiler to include functions with more advanced SIMD than legal for the host computer, but a runtime check ensures that only the best SIMD level is dispatched (closes Do not declare backups if Intel instruction sets are not available #12 and Use runtime checks to dispatch on SIMD #41).
The computational kernels for the spherical/triangular/disk harmonics are refactored to not only use the correct types of registers, but also help the compiler maximize throughput. This relies on a property of Givens rotations that two adjacent rotations commute if they do not act on the same rows. This property allows one to re-order the Givens rotations to increase the ratio of computation to memory loads/stores. The computational kernels and execute drivers are largely generated by a macro, which means the code may already be prepared for AVX-1024 when the instruction sets are available in GCC. Part of this is the introduction of the ft_simd struct to store a bit-field of a variety of SIMD extensions.
The real-to-real FFTW routines now use fftw_execute_dft_r2c and fftw_execute_dft_r2c instead of FFTW_R2HC and FFTW_HC2R-type real-to-real transforms to avoid a global transpose of the data.
The performance benchmark timings were not scaling as O(n³) because one needs to call a function a few times, typically at least twice, before peak performance is realized. These are now updated and the macro FT_TIME helps to bring this support system-wide.

New Examples in this PR:

spinweighted.c is a basic tutorial on how to use spin-weighted spherical harmonic transforms.

Releases no longer trigger the attachment of binaries, as compilation with -march=native may fail on a host computer.

The template for Horner's rule is: ft_horner(n, c, incc, m, x, f) For Clenshaw's algorithm, it's ft_clenshaw(n, c, incc, m, x, f) The coefficients have stride incc, but the points and the output array must be contiguous. The points and output pointers may be aliased so that in-place evaluation can work with one array for x and f. The design is as follows: 1. Find the native SIMD flags on the machine that is compiling the code. 2. Define as many internal functions using the highest level of SIMD with SIMD flags. Otherwise, define fallbacks. 3. Define the exported functions with SIMD dispatch based on GCC's cpuid. This approach allows one to cross-compile and generate optimal code for machines that are as good as the cross compiler. It also allows the code to be callable from machines that are newer than the cross compiler (in case the cross compiler is not state-of-the-art) due to the fallbacks, though newer SIMD is not accessible.

The reported error is: /usr/bin/ld: /tmp/ccuNTFaV.o: relocation R_X86_64_PC32 against undefined symbol `memset@@GLIBC_2.2.5' can not be used when making a shared object; recompile with -fPIC 500/usr/bin/ld: final link failed: Bad value

apparently, linux build machines already have it

…cpuid_count(7, ...)

…er to source?

not just degree 4

The dft_c2r obviates the need for a global transpose (in colswap) outside the fftw call. Instead, it requires setting and extracting some data from fftw_complex arrays.

up error tolerance for triangular transforms for Windows

no warning in gcc-8 is a warning in gcc-7 and vice versa

Add complex conjugation to docs remove deploy binaries

…ect that the degrees are all N-1 \dh unrecognized => tab complete

MikaelSlevinsky added 30 commits April 4, 2020 12:13

attempt to fix appveyor

6b10664

attempt to fix linux builds

8b5047e

The reported error is: /usr/bin/ld: /tmp/ccuNTFaV.o: relocation R_X86_64_PC32 against undefined symbol `memset@@GLIBC_2.2.5' can not be used when making a shared object; recompile with -fPIC 500/usr/bin/ld: final link failed: Bad value

windows fix bis

99d46df

link too?

0785e0a

add tests for AVX512F

e79da25

apparently, linux build machines already have it

check SIMD checks

720988b

AVX512F is an extended feature, which needs to be checked with __get_…

5218346

…cpuid_count(7, ...)

attempt cpuid

74e031a

fix spacing

17b8307

temporarily remove the recurrence module

0fc3695

separate assembly from source in an attempt to get library flags clos…

e900e57

…er to source?

flatten the directory tree?

7176171

specify the assembly wihout *

864ac1e

gotta pump up those levels

7b0eaa4

Le commencement du grand dérangement

9b21c11

double4 => double8

75fbfe0

restore all drivers tests

42a43ce

make the addition theorem work for any degree-n Legendre polynomial

01d2400

not just degree 4

cleanup execute_sph and kernel_sph

0cd54fc

does this hack fix the bug?

4739a3b

fftw_execute_r2r => fftw_execute_dft_c2r

deaaea7

The dft_c2r obviates the need for a global transpose (in colswap) outside the fftw call. Instead, it requires setting and extracting some data from fftw_complex arrays.

complex.h is apparently not always loaded in cblas.h

a75dcb9

add spin-weighted spherical harmonic transforms!

3603fd8

try to remove most warnings across gcc versions & platforms

ca9595e

up error tolerance for triangular transforms for Windows

ok better fix

daf7bfa

revert fix

44cfc8a

no warning in gcc-8 is a warning in gcc-7 and vice versa

refactor disk harmonic drivers!

6e3e33b

remove commented code

7d95c54

finish cleaning up test_drivers and test_fftw

13dfcb0

MikaelSlevinsky added 5 commits April 25, 2020 15:42

add spin-weighted example

c38267c

Add complex conjugation to docs remove deploy binaries

the algorithms work with N x M arrays, therefore the docs should refl…

835e46a

…ect that the degrees are all N-1 \dh unrecognized => tab complete

hide SIMD tetrahedral prototypes from interface

8d14dec

update perf benchmark

a0d2cd8

add with

e89e163

MikaelSlevinsky merged commit decaec0 into master Apr 26, 2020

MikaelSlevinsky deleted the feat-recurrence-simd branch April 26, 2020 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Spin-weighted spherical harmonics and refactored use of SIMD #43

Spin-weighted spherical harmonics and refactored use of SIMD #43

Uh oh!

MikaelSlevinsky commented Apr 26, 2020

Uh oh!

Uh oh!

Spin-weighted spherical harmonics and refactored use of SIMD #43

Spin-weighted spherical harmonics and refactored use of SIMD #43

Uh oh!

Conversation

MikaelSlevinsky commented Apr 26, 2020

Uh oh!

Uh oh!