Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crosscheck with Paper, complex matrix optimizations #137

Merged
merged 59 commits into from
Nov 1, 2021

Conversation

ffreyer
Copy link
Collaborator

@ffreyer ffreyer commented Sep 12, 2021

This pr will add a crosscheck/example comparing to https://arxiv.org/pdf/1912.08848.pdf.

TODO

  • finish complex linalg optimizations
  • implement model
  • get matching observables (t5 = 0, L = 8 for simplicity)
    • Z - spin susceptibility (figure 1d red)
    • Superfluid Stiffness (figure 2b)
    • s-wave pairing correlation (figure 3a, solid blue)
    • charge susceptibility (figure 3a, solid red)
  • document the crosscheck as an example
  • update other documentation

General Changes:

  • switch to $\Delta(i) \Delta^\dagger(k) + \Delta^\dagger(i) \Delta(k)$ for pairing correlation. This matches the paper above (removing terms that are 0 in DQMC) and some others (e.g. the triangular crosscheck).
  • fix some issues with loading moved BufferedConfigRecorder files and saving uninitialized BufferedConfigRecorder
  • better heuristic for counting the number of NNs (actually counting equal length distances)
  • add warning if T is not (approximately) hermitian
  • rename dagger to swapop (see Better syntax for DQMC measurements #135)
  • add superfluid_stiffness based on this paper

Complex Linear Algebra

The model from the reference paper uses complex hoppings. To be able to run it efficiently this pr implements all the neglected complex linear algebra. It also cleans up BlockDiagonal so that it works with ComplexF64 and can be combined with complex StructArrays. Closes #78

  • cleanup/fix BlockDiagonal for generic matrices (i.e. complex)
    • let all vmul! (and alike) methods fall back onto other vmul! methods. This performs a little bit worse, but makes things generic. Maybe @inline would help?
    • remove some resulting redundancies
    • generalize most methods to arbitrarily sized blocks (though typically still square)
  • finish implementing Complex StructArrays
    • closes complex matrices are not optimized #78
    • note that BLAS will use threads by default while our LoopVectorization methods do not. With multiple threads BLAS will outperform, with BLAS.set_num_threads(1) our methods win. I assume in the one simulation per core setting (i.e. on a supercomputer) the single threaded BLAS performance is relevant.

Lattice Iterator revamp/changes

The reference paper calculates the current current correlations with unsynchronized directions. However it's not just every bond up to the K-th farthest. It's every bond that has a hopping associated to it without reversed bonds (these are explicitly included in the formula pre Wicks) and without bonds with dir[bond][1] == 0. This gets rid of quite a few bonds. With out current iterators and NN and NNN bonds, for example, we'd include 9 directions (on-site, 4x NN, 4x NNN) but only really need 3 (+x NN, 2 +x NNN). And because it'S usnyhcronized the number of directions goes in squared, so we do 9²/3² = 9 times the work.

To fix this I've revamped the lattice iterator system here. The main goal was to allow passing directional indices directly, e.g. EachLocalQuadByDistance([2, 5, 9]). I also wanted to stop relying on only one of each lattice iterator type being created at the start of the simulation to memory usage down. So I made the following changes:

There is now a LatticeIteratorCache attached to DQMC. Perhaps later I will put this together with the lattice. The cache contains a dict that gets filled with maps such as dir_idx -> (src, trg) which are accessible via cache[Dir2SrcTrg()] and can be created via push!(cache, Dir2SrcTrg(), lattice). Every lattice iterator will create the maps it needs (no duplicated) and holds references to it for iteration. So now the difference between 1 and 1000 EachLocalQuadByDistance iterators is just 3000 integers and 3000 references.

The second change I made was that every iterator now has a template version and a runtime version. The template is light, used as a field in DQMCMeasurements and can be saved without extra care. The runtime version (same type name with _ in front) follows from the template and will construct the maps it needs during creation.

@ffreyer
Copy link
Collaborator Author

ffreyer commented Sep 21, 2021

https://carstenbauer.github.io/MonteCarlo.jl/dev/examples/triangular_Hubbard/ clearly does not \Delta^\dagger \Delta + \Delta |delta^\dagger for its pairing correlation. It fits way worse than \Delta \Delta^\dagger. Needs to be adjusted if I go through with switching the default from pc_kernel to pc_combined_kernel.
The reference for this PR clearly does use the combined kernel though...

@ffreyer
Copy link
Collaborator Author

ffreyer commented Sep 22, 2021

I benchmarked the new model with StructArrays - thermalization is twice as fast (roughly 0.5s -> 0.25s in my test), but measurements are just as slow (2s -> 2s, occupation and pairing susceptibility).

I see two potential reasons for this:

  1. The real and imaginary parts of the Greens function are not indexed in separate loops, which is not cache friendly.
  2. Lattice iterators themselves are probably not cache friendly. They're essentially semi-random lists of indices. But given that they are static per simulation there is probably some optimization potential here as well...

@ffreyer
Copy link
Collaborator Author

ffreyer commented Sep 27, 2021

Making use of StructArrays in measurements is tough atm. There is no left vs right multiplier like in mul!, matrix elements could technically be multiplied by i or conjugates could be taken. So instead I'm just converting complex StructArrays back to plain complex matrices. This isn't optimal, but still a lot better... (~2s -> ~1.33s)

I think LoopVectorization is what's making tests slow. Precompilation doesn't help, but puts the time spent on the testset rather than include...
@ffreyer ffreyer changed the title Crosscheck with Paper Crosscheck with Paper, complex matrix optimizations Oct 12, 2021
@ffreyer ffreyer merged commit b348442 into carstenbauer:master Nov 1, 2021
@ffreyer ffreyer mentioned this pull request Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

complex matrices are not optimized
1 participant