Skip to content

Spin-weighted spherical harmonics and refactored use of SIMD #43

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Apr 26, 2020

Conversation

MikaelSlevinsky
Copy link
Owner

New features in this PR:

  • Support for spin-weighted spherical harmonic transforms. They are orthonormalized in L^2, with the complex Fourier series as longitudinal basis and with complex coefficients.
  • Three new functions (x2 for float/double) for Horner's rule and Clenshaw's algorithm for Chebyshev series and for orthogonal polynomial series as well.

Improvements in this PR:

  • The code is now designed with a cross-compiler in mind. For performance-critical tasks, SIMD is hidden from the user interface and instead is dispatched based on CPU ID. This allows a cross-compiler to include functions with more advanced SIMD than legal for the host computer, but a runtime check ensures that only the best SIMD level is dispatched (closes Do not declare backups if Intel instruction sets are not available #12 and Use runtime checks to dispatch on SIMD #41).
  • The computational kernels for the spherical/triangular/disk harmonics are refactored to not only use the correct types of registers, but also help the compiler maximize throughput. This relies on a property of Givens rotations that two adjacent rotations commute if they do not act on the same rows. This property allows one to re-order the Givens rotations to increase the ratio of computation to memory loads/stores. The computational kernels and execute drivers are largely generated by a macro, which means the code may already be prepared for AVX-1024 when the instruction sets are available in GCC. Part of this is the introduction of the ft_simd struct to store a bit-field of a variety of SIMD extensions.
  • The real-to-real FFTW routines now use fftw_execute_dft_r2c and fftw_execute_dft_r2c instead of FFTW_R2HC and FFTW_HC2R-type real-to-real transforms to avoid a global transpose of the data.
  • The performance benchmark timings were not scaling as O(n3) because one needs to call a function a few times, typically at least twice, before peak performance is realized. These are now updated and the macro FT_TIME helps to bring this support system-wide.

New Examples in this PR:

  • spinweighted.c is a basic tutorial on how to use spin-weighted spherical harmonic transforms.

Releases no longer trigger the attachment of binaries, as compilation with -march=native may fail on a host computer.

The template for Horner's rule is:

ft_horner(n, c, incc, m, x, f)

For Clenshaw's algorithm, it's

ft_clenshaw(n, c, incc, m, x, f)

The coefficients have stride incc, but the points and the output array must be contiguous.

The points and output pointers may be aliased so that in-place evaluation can work with one array for x and f.

The design is as follows:
1. Find the native SIMD flags on the machine that is compiling the code.
2. Define as many internal functions using the highest level of SIMD with SIMD flags. Otherwise, define fallbacks.
3. Define the exported functions with SIMD dispatch based on GCC's cpuid.

This approach allows one to cross-compile and generate optimal code for machines that are as good as the cross compiler. It also allows the code to be callable from machines that are newer than the cross compiler (in case the cross compiler is not state-of-the-art) due to the fallbacks, though newer SIMD is not accessible.
The reported error is:

/usr/bin/ld: /tmp/ccuNTFaV.o: relocation R_X86_64_PC32 against undefined symbol `memset@@GLIBC_2.2.5' can not be used when making a shared object; recompile with -fPIC
500/usr/bin/ld: final link failed: Bad value
apparently, linux build machines already have it
The dft_c2r obviates the need for a global transpose (in colswap) outside the fftw call. Instead, it requires setting and extracting some data from fftw_complex arrays.
up error tolerance for triangular transforms for Windows
no warning in gcc-8 is a warning in gcc-7 and vice versa
@MikaelSlevinsky MikaelSlevinsky merged commit decaec0 into master Apr 26, 2020
@MikaelSlevinsky MikaelSlevinsky deleted the feat-recurrence-simd branch April 26, 2020 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not declare backups if Intel instruction sets are not available
1 participant