-
Notifications
You must be signed in to change notification settings - Fork 100
/
NEWS
982 lines (706 loc) · 40.2 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
Version 1.1.0 - October 2021
- Add support for NVSHMEM communication for the Dslash operators, for
significantly improved strong scaling. See
https://github.com/lattice/quda/wiki/Multi-GPU-with-NVSHMEM for more
details.
- Addition of the MSPCG preconditioned CG solver for Möbius
fermions. See
https://github.com/lattice/quda/wiki/The-Multi-Splitting-Preconditioned-Conjugate-Gradient-(MSPCG),-an-application-of-the-additive-Schwarz-Method
for more details.
- Addition of the Exact One Flavor Algorithm (EOFA) for Möbius
fermions. See
https://github.com/lattice/quda/wiki/The-Exact-One-Flavor-Algorithm-(EOFA)
for more details.
- Addition of a fully GPU native Implicitly Restarted Arnoldi
eigensolver (as opposed to partially relying on ARPACK). See
https://github.com/lattice/quda/wiki/QUDA%27s-eigensolvers#implicitly-restarted-arnoldi-eigensolver
for more details.
- Significantly reduced latency for reduction kernels through the use
of heterogeneous atomics. Requires CUDA 11.0+.
- Addition of support for a split-grid multi-RHS solver. See
https://github.com/lattice/quda/wiki/Split-Grid for more details.
- Continued work on enhancing and refining the staggered multigrid
algorithm. The MILC interface can now drive the staggered multigrid
solver.
- Multigrid setup can now use tensor cores on Volta, Turing and Ampere
GPUs to accelerate the calculation. Enable with the
`QudaMultigridParam::use_mma` parameter.
- Improved support of managed memory through the addition of a
prefetch API. This can dramatically improve the performance of the
multigrid setup when oversubscribing the memory.
- Improved the performance of using MILC RHMC with QUDA
- Add support for a new internal data order FLOAT8. This is the
default data order for nSpin=4 half and quarter precision fields,
though the prior FLOAT4 order can be enabled with the cmake option
QUDA_FLOAT8=OFF.
- Remove of the singularity from the reconstruct-8 and reconstruct-9
compressed gauge field ordering. This enables support for free
fields with these orderings.
- The clover parameter convention has been codified: one can either
1.) pass in QudaInvertParam::kappa and QudaInvertParam::csw
separately, and QUDA will infer the necessary clover coefficient, or
2.) pass an explicit value of QudaInvertParam::clover_coeff
(e.g. CHROMA's use case) and that will override the above inference.
- QUDA now includes fast-compilation options (QUDA_FAST_COMPILE_DSLASH
and QUDA_FAST_COMPILE_REUDCE) which enable much faster build times
for development at the expense of reduced performance.
- Add support for compiling QUDA using clang for both the host and
device compiler.
- While the bulk of the work associated with making QUDA portable to
different architectures will form the soul of QUDA 2.0, some of the
initial refactoring associated with this has been applied.
- Significant cleanup of the tests directory to reduce boiler plate.
- General improvements to the cmake build system using modern cmake
features. We now require cmake 3.15.
- Extended the ctest list to include some optional benchmarks.
- Fix a long-standing issue with multi-node Kepler GPU and Intel dual
socket systems.
- Improved ASAN integration: SANITIZE builds now work out of the box
with no need to set the ASAN_OPTIONS environment variable.
- Add support for the extended QIO branch (now required for MILC).
- Bump QMP version to 2.5.3.
- Updated to Eigen 3.3.9.
- Multiple bug fixes and clean up to the library. Many of these are
listed here: https://github.com/lattice/quda/milestone/24?closed=1
Version 1.0.0 - 10 January 2020
- Add support for CUDA 10.2: QUDA 1.0.0 is supported on CUDA 7.5-10.2
using either GCC or clang compilers. CUDA 10.x and either GCC >=
6.x or clang >= 6.x are highly recommended.
- Significant improvements to the CMake build system and removal of the
legacy configure build.
- Added more targeted compilation options to constrain which
precisions and reconstruct types are compiled. QUDA_PRECISION is a
cmake parameter that is a 4-bit number corresponding to which
precisions are enabled, with 1 = quarter, 2 = half, 4 = single and 8
= double, the default is 14 which enables double, single and half
precision. QUDA_RECONSTRUCT is a 3-bit number corresponding to
which reconstruct types are enabled, with 1 = reconstruct-8/9, 2 =
reconstruct-12/13 and 4 = reconstruct-18, the default is 7 which
enables all reconstruct types.
- Completely rewritten all dslash kernels using the accessor
framework. This dramatically reduces code complexity and improve
performance.
- New physics functionality added: gauge Laplace kernel, Gaussian
quark smearing, topological charge density.
- QUDA can now be built to either utilize texture-memory reads or to
use direct memory accessing (cmake option QUDA_TEX). The default
has textures on, though we note that since Pascal it can be
advantageous to disable textures and utilize direct reads.
- QUDA is no longer supported on the Fermi generation of GPUs (sm_20
and sm_21). Compilation and running should still be possible but
will require compilation with texture objects disabled.
- Added supported for quarter precision (QUDA_QUARTER_PRECISION) for
the linear operator and associated solvers.
- Implemented both CA-CG and CA-GCR communication avoid solvers, for
use either as stand-alone solvers or as a means to accelerate
multigrid.
- Continued evolution and optimization of the multigrid framework.
Regardless, we advise users to use the latest develop branch when
using multigrid, since it continues to be a fast-moving target with
continual focus on optimization and improvement.
- An implementation of the Thick Restarted Lanczos Method (TRLM) for
eigenvector solving of the normal operator.
- Lanczos-accelerated multigrid through the use of coarse-grid
deflation and / or using singular vectors to define the prolongator.
- Removal of the legacy contraction and co-variant derivative
algorithms, and replacement with accessor-based rewrites.
- Improved heavy-quark residual convergence which ensure correct
convergence for MILC heavy quark observables.
- Experimental support for Just-In-Time (JIT) compilation using Jitify.
- Significantly improved unit testing framework using ctest.
- QUDA can now be built to target Google's address sanitizer
(CMAKE_BUILD_TYPE option is SANITIZE) for improved debugging.
- QUDA can now download and install the USQCD libraries QMP and QIO
automatically as part of the compilation process. To enable this,
the option QUDA_DOWNLOAD_USQCD=ON should be set. Similarly to Eigen
installation this requires access to the outside internet.
- QUDA can now download and install the ARPACK library automatically
if the QUDA_DOWNLOAD_ARPACK option is enabled.
- Updated to CUB 1.8.
- Multiple bug fixes and clean up to the library. Many of these are
listed here: https://github.com/lattice/quda/milestone/21?closed=1
Version 0.9.0 - 24 July 2018
- Add support for CUDA 9.x: QUDA 0.9.0 is supported on CUDA 7.0-9.2.
- Continued focus on optimization of multi-GPU execution, with
particular emphasis on Dslash scaling. For more details on
optimizing multi-GPU performance, see
https://github.com/lattice/quda/wiki/Multi-GPU-Support
- On systems that support it, QUDA now uses direct peer-to-peer
communication between GPUs with in the same node. The Dslash policy
autotuner will ascertain the optimal communication route to take,
whether it be to route through CPU memory, use DMA copy engines or
directly write the halo buffer to neighboring GPUs.
- On systems that support it, QUDA will take advantage of GPU Direct
RDMA. This is enabled through setting the environment variable
QUDA_ENABLE_GDR=1 which will augment the dslash tuning policies to
include policies using GPU-aware MPI to facilitate direct GPU-NIC
communication. This can improve strong scaling by up to 3x.
- Improved precision when using half precision (use rounding instead
of truncation when converting to/from float).
- Add support for symmetric preconditioning for 4-d preconditioned
Shamir and Möbius Dirac operators.
- Added initial support for multi-right-hand-side staggered Dirac
operator (treat the rhs index as a fifth dimension).
- Added initial implementation of block CG linear solver.
- Added BiCGStab(l) linear solver. The parameter "l" corresponds to
the size of the space to perform GCR-style residual minimization.
This is typically much better behaved than BiCGStab for the Wilson
and Wilson-clover linear systems.
- Initial version of adaptive multigrid fully implemented into QUDA.
- Creation of multi-blas and multi-reduction framework, this is
essential for high performance for pipelined, block and
communication-avoiding solvers that work on "matrices of vectors" as
opposed to "scalars of vectors". The max tile size used by the
multi-blas framework is set by QUDA_MAX_MULTI_BLAS_N cmake
parameter, which default to 4 for reduced compile time. For
production use of such solvers, this should be increase to 8..16.
- Optimization of multi-shift solver using multi-blas framework to permit
kernel fusion of all shift updates.
- Complete rewrite and optimization of clover inversion, HISQ force
kernels, HISQ link fattening algorithms using accessors.
- QUDA can now directly load/store from MILC's site structure array.
This removes the need to unpack and pack data prior to calling QUDA,
and dramatically reduces CPU overhead.
- Removal of legacy data structures and kernels. In particular
original single-GPU only ASQTAD fermion force has been removed.
- Implementation of STOUT fattening kernel.
- Significant improvement to the cmake build system to improve
compilation speed and aid productivity. In particular, QUDA now
supports being built as a shared library which greatly reduces link
time.
- Autoconf and configure build system is no longer supported.
- Automated unit testing of dslash_test and blas_test are now enabled
using ctest.
- Adds support for MPS, enabled through setting the environment
variable QUDA_ENABLE_MPS=1. This allow GPUs to be oversubscribed by
multiple processes, which can improve overall job throughput.
- Implemented self-profiler that builds on top of autotuning
framework. Kernel profile is output to profile_n.tsv, where n=0,
with n incremented with each call to saveProfile (which dumps the
profile to disk). An equivalent algorithm policy profile is output
to profile_async_n.tsv which contains policies such as a complete
dslash. Filename prefix and path can be overridden using
QUDA_PROFILE_OUTPUT_BASE environment variable.
- Implemented simple tracing facility that dumps the flow of kernels
called through a single execution to trace.tsv. Enabled with
environment variable QUDA_ENABLE_TRACE=1.
- Multiple bug fixes and clean up to the library. Many of these are
listed here: https://github.com/lattice/quda/milestone/15?closed=1
Version 0.8.0 - 1st February 2016
- Removed all Tesla-generation GPU support from QUDA (sm_1x). As a
result, QUDA now requires a minimum of the Fermi-generation GPUs.
- Added support for building QUDA using cmake. This gives a much more
flexible and extensible build system as well as allowing
out-of-source-directory building. For details see:
https://github.com/lattice/quda/wiki/Building-QUDA-with-cmake
- Improved strong scaling of the multi-shift solver by overlapping the
shift updates with the subsequent iteration's dslash comms waiting.
- Improved performance of multi-shift solver by preventing unnecessary
refinement of shifted solutions once the residual falls below
floating point precision.
- Significantly improved performance of FloatNOrder accessor functors
to ensure vectorized memory accesses as well as removal of
unnecessary type conversions. This gives a significant speedup to
all algorithms that use these accessors.
- Significant improvement in compilation time using C++ traits to
prune build options.
- Added support for gauge-field reconstruction to naive staggered
fermions.
- Added hyper-cubic random number generator with multi-GPU support.
- Added topological charge computation.
- Added final computational routines to allow for complete off-load of
MILC staggered RHMC to QUDA (momActionQuda - compute the momentum
contribution to the action, projectSU3Quda - project the gauge field
back onto the SU(3) manifold).
- In the MILC interface staggered solver, the resident gauge field is
reused until it is invalidated by constructing new links (or
overridden with the `num_iters` back door flag).
- Improved gauge field unitarization robustness and added check for
NaN in the results.
- Some cleanup and kernel-fusion optimization of gauge force HISQ
force kernels. This also improves compilation time and reduces
library size.
- Added support for imaginary chemical potential to the staggered phase
application / removal kernel, as well as fixing bugs in this
routine.
- Algorithms that previously used double-precision atomics now use a
cub reduction. This drastically improves performance of such
routines.
- QUDA can now be configured to enable NVTX markup on the TimeProfile
class and MILC interface to give improved visual profiling.
- All gauge field copies now check for NaN when `HOST_DEBUG=yes` to
improve debugging.
- Set tunecache.tsv to be invalid if git id changes to ensure a valid
tune cache is used.
- Reduced BLAS tuning overhead, by setting the maximum grid size to be
twice the SM count to avoid an unnecessarily large parameter sweep.
- Added new profile that records total time spent in QUDA.
- Fixed bugs in long-link field generation.
- Multiple bug fixes to the library. Many of the fixes are listed here:
https://github.com/lattice/quda/pulls?q=is%3Apr+is%3Aclosed+milestone%3A%22QUDA+0.8.0%22
https://github.com/lattice/quda/issues?q=is%3Aissue+milestone%3A%22QUDA+0.8.0%22+is%3Aclosed
Version 0.7.2 - 07th October 2015
- Add support for separate temporal-spatial plaquette
- Fixed memory leak in MPI communications
- Fixed issues with assignment of GPUs to processes when using the QMP
backend with multiple nodes with multiple GPUs
- Fixed bug in MR solver which led to incorrect convergence
- Similar to the NVTX markup support for MPI added in 0.7.1 we now
support NVTX markup for calls to the MILC interface. Enabled by using
"--enable-milc-nvtx" when configuring QUDA.
- Multiple bug fixes to the library. Many of the fixes are listed here:
https://github.com/lattice/quda/issues?q=milestone%3A%22QUDA+0.7.2%22+is%3Aclosed
Version 0.7.1 - 11th June 2015
- Added Maxwell-generation GPU support.
- Added automatic support for NVTX markup of MPI calls for visualizing
MPI calls in the visual profiler. Enabled by using
"--enable-mpi-nvtx" when configuring QUDA.
- Modified clover derivative code to use gauge::FloatNorder structs,
which in the process adds support for different reconstruct types.
- Added autotuning support to clover derivative and sigma trace
computations.
- Multiple fixes and improvements to GPU_COMMS feature of QUDA: fixed
a bug when using full-field fermions, improved support on Cray
systems, and added much more robust memory checking of message memory when
host debugging is enabled.
- Multi-GPU dslash now correctly report flops and bandwidth when
autotuning.
- Fixed a bug where by the 5-d domain wall dslash was called twice
every time it was called.
- Fixed a bug when using both improved staggered fermions and naive
staggered fermions with auto-tuning enabled.
- Fixed a bug with using fused exterior kernels with auto tuning that
could result in incorrect results.
- To aid debugging, QUDA now prints its version, including a git id
tag, when initialized.
- Drastically improved Doxygen markup of the MILC interface.
- Multiple bug fixes that affects stability and correctness throughout
the library. Many of these fixes are listed here:
https://github.com/lattice/quda/issues?q=milestone%3A%22QUDA+0.7.1%22+is%3Aclosed
Version 0.7.0 - 4th February 2015
- Added support for twisted-clover, 4-d preconditioned domain wall and
4-d preconditioned Mobius fermions.
- Reworked auto-tuning framework to drastically reduce the lookup
overhead of querying the tune cache. This has the effect of
improving the strong scaling (greater than 10% improvement in solver
performance seen at scale).
- Support for GPU-aware MPI and GPUDirect RDMA for faster multi-GPU
communication. This option is enabled using the --enable-gpu-comms
option (GPU_COMMS in make.in), and requires a GPU-aware MPI stack
(MVAPICH or OpenMPI).
- Reduction in communication latency for half-precision dslash through
merging the main quark and norm fields into a contiguous buffer for
host to device transfers. This reduces API overhead and increases
sustained PCIe bandwidth.
- Added support for double buffering of the MPI receive buffers in the
multi-GPU dslash to allow for early preposting of MPI_Recv.
- Implemented an initial multi-threaded dslash (parallelizing between
MPI and CUDA API calls) to reduce overall CPU frequency sensitivity.
This implementation is embryonic: it simply provides for early
preposting of MPI_Recv and will be extended to parallelize between
MPI_Test and CUDA event querying.
- Added an alternative multi-GPU dslash where the update of the
boundary regions is deployed in a single kernel after all
communication is complete. This reduces kernel launch overhead and
ensures communication is done with maximum priority.
- Reworked multi-GPU dslash interface: there are now different
policies supported for a variety of execution flows. Supported
policies at the moment are QUDA_DSLASH (legacy multi-gpu that
utilizes face buffers for communication buffers), QUDA_DSLASH2 (the
default - regular multi-GPU dslash with CPU-routed communication),
QUDA_FUSED_DSLASH (use a single kernel to update all boundaries
after all communication has finished), QUDA_GPU_COMMS_DSLASH (all
communication emanates directly from GPU memory locations),
QUDA_PTHREADS_DSLASH (multi-threaded dslash). This can be described
as experimental, and changing the policy type has yet to be exposed
to the interface.
- New routines for construction of the clover matrix field and
inversion of the clover matrices (with optional computation of the
trace log of the clover field). Presently exposed by using
loadCloverQuda with NULL pointers to host fields to force
construction instead of download of the clover field.
- Implemented support for exact momentum exponentiation to
complement the pre-existing Taylor expanded variant
(updateGaugeFieldQuda).
- Partial implementation of the clover-field force terms
(clover_deriv_quda.cu and clover_trace_quda.cu).
- All extended gauge field creation routines have been offloaded to
QUDA, minimizing PCIe traffic and minimizing CPU time. This has
lead to a significant speedup in routines that need this, e.g., the
gauge force.
- Initial support for extended fermion-field creation routines (only
supports staggered fields).
- Fermion field outer product implemented in QUDA. Only exposed for
staggered fermions at present (computeStaggeredOprodQuda).
- EigCG eigenvector deflation algorithm and subsequent initCG
implemented for the preconditioned normal operator. Added a
deflation_test to demonstrate the use of this algorithm.
- Implemented Lanczos eigenvector solver (no unit test yet for
demonstrating this - presently only hooked into the CPS).
- Implemented initial support for communication-avoiding s-step
solvers: CG - QUDA_MPCG_INVERTER and BiCGstab -
QUDA_MPBICGSTAB_INVERTER. Only proof of concept at the moment and
need to be optimized.
- Implemented initial support for overlapping domain-decomposition
preconditioners. Presently only proof of concept and needs further
development.
- Implemented initial support for applying different phases to a gauge
field. Presently only proof of concept and needs further
development. Will be useful for minimizing memory and PCIe traffic
in staggered HMC.
- Implemented support for computation of the gauge field plaquette.
- Implemented initial support for fermion-field contractions.
- Added support for the CGNE solver, to complement the already
existing CGNR.
- Improvements to stability and robustness of the solvers in mixed
precision. QUDA will default to always using a high precision
solution accumulator since this drastically improves convergence,
especially using half precision.
- Improved the stability and robustness of CG when used in combination
with the Fermilab heavy-quark residual stopping criterion. This has
been validated against the MILC implementation.
- Separated dslash_quda.cu into multiple files to allow for parallel
building to increase compilation speed.
- Added interface support for Luescher's chiral basis for fermion
fields: page 11 of doc/dirac.ps in the DD-HMC code package
http://luscher.web.cern.ch/luscher/DD-HMC. This is selected through
setting QudaInvertParam::gamma_basis = QUDA_CHIRAL_GAMMA_BASIS.
- QUDA will now complain and exit if it detects that a stale tunecache
is being used.
- Removed official support for obsolete compute capabilities 1.1 and
1.2. This makes the minimum supported device compute capability 1.3
(GT200).
- Multiple bug fixes that affects stability and correctness throughout
the library. Many of these fixes are listed here:
https://github.com/lattice/quda/issues?q=milestone%3A%22QUDA+0.7.0+%22+is%3Aclosed.
- Although not strictly related to this release, we have started to
collect common running settings and hints in the QUDA wiki:
https://github.com/lattice/quda/wiki.
Version 0.6.1 - 10th March 2014
- All unit tests now enable/disable CPU-side verification with the "--verify
true/false" flag. The default is true.
- The google test API is now used in some of the unit tests
(dslash_test, staggered_dslash_test and blas_test). (Eventually all
unit tests will be built using this.)
- Various bugs have been fixed in fermion_force_test,
hisq_paths_force_test, hisq_unitarize_force_test and
unitarize_link_test
Version 0.6.0 - 23rd January 2014
- Support for reconstruct 9/13 for the long link in HISQ fermions.
This provides up to a 25% speedup over using no reconstruction.
Owing to architecture constraints, reconstruct 9/13 is not supported
on "Tesla" architectures, and is only supported on superseding
architectures (Fermi, Kepler, etc.).
- Implemented the long link calculation for HISQ and asqtad fermions.
This has the net result of speeding up the gauge fattening by
about a factor 1.6x.
- Implemented a gauge field update routine that evolves the gauge
field by a given step size using a momentum field. This is exposed
as the function updateGaugeFieldQuda(...).
- Added support for qdpjit field ordering. When used in conjunction
with the device interface, this allows Chroma (when compiled using
qdpjit) to avoid all CPU <-> GPU transfers.
- Completely rewritten gauge and clover field copying routines using a
generic template-driven approach. Due the large number of possible
input / output combinations to keep the compilation time under
control, the different interfaces need to be opted in at configure
time (MILC and QDP interfaces are enabled by default).
- The QUDA interface (loadGaugeQuda, loadCloverQuda, invertQuda and
invertMultishiftQuda) now supports device-side pointers as well as
host-size pointers. The location of a given pointer is set by the
QudaFieldLocation members of QudaGaugeParam (location) and
QudaInvertParam (input_location, output_location, clover_location).
- Added new interface support for QDPJIT ordered fields (dirac, clover
and gauge fields).
- When doing mixed-precision solvers, all low-precision copies of
gauge and clover fields are created from the pre-existing GPU copies
instead of re-copying from the CPU. This lowers the PCIe overhead
by up to 1.75x.
- Significantly improved performance of both degenerate and
non-degenerate twisted-mass CG solver (up to 17% and 32%, respectively).
- ColorSpinorField is now derived from LatticeField, with all
LatticeField derivations now using common page-locked and device memory
buffers. This has the effect of reducing the overall packed-locked
memory footprint.
- The source vector is now scaled such that it is equal to unity.
This prevents underflow from occurring when the source vector is too
small.
- Fixed double precision definition of *= vector operator, which caused a
truncation to single precision for certain solver types.
- Fixed memory over allocation when doing clover fermions in half precision.
- Memory leak fix to clover fermions.
- Added work around to allow QUDA to compile with GCC 4.7.x.
- Many small fixes and overall code cleanup.
Version 0.5.0 - 20 March 2013
- Added full support for CUDA 5.0, including the Tesla K20 and other
GK110 ("Kepler 2") GPUs. QUDA has yet to be fully optimized for
GK110, however.
- Added multi-GPU support for the domain wall action, to be further
optimized in a future release.
- Added official support for the QDP-JIT library, enabled via the
"--enable-qdp-jit" configure option. With the combination of QUDA
and QDP-JIT, Chroma runs almost entirely on the GPU.
- Added a fortran interface, found in include/quda_fortran.h and
lib/quda_fortran.F90.
- QUDA is now compatible with the Berlin QCD (BQCD) package,
supporting both Wilson and Clover solvers, including support for
multiple GPUs. This currently requires a specific branch of BQCD
(https://github.com/lattice/bqcd-r399-quda).
- Added a new interface function, initCommsGridQuda(), for declaring
the mapping of MPI ranks (or QMP node IDs) to the logical grid used
for communication. This finally completes the MPI interface, which
previously relied on an undocumented function internal to QUDA.
- Added a new interface function, setVerbosityQuda(), to allow for
finer-grained control of status reporting. See the description in
include/quda.h for details.
- Merged wilson_dslash_test and domain_wall_dslash_test together into
a unified dslash_test, and likewise for invert_test. The staggered
tests are still separate for now.
- Moved all internal symbols behind a namespace, "quda", for better
insulation from external applications and libraries.
- Vastly improved the stability and accuracy of the multi-shift CG
solver. The invertMultiShiftQuda() interface function now supports
mixed precision and implements per-shift refinement after the
multi-shift solver completes to ensure accuracy of the final result.
The old invertMultiShiftQudaMixed() interface function has been
removed. In addition, the multi-shift solver now supports setting
the convergence tolerance on a per-pole basis via the tol_offset[]
member of QudaInvertParam.
- Improved the stability and accuracy of mixed-precision CG. As a
result, mixed double/single CG yields a virtually identical iteration
count to pure double CG, and using half precision is now a win.
- Added support for the Fermilab heavy-quark residual as a stopping
condition in BiCGstab, CG, and GCR. To minimize the impact on
performance, the heavy-quark residual is only measured every 10
iterations (for BiCGstab and CG) or only when the solution is computed
(for GCR). This stopping condition has also been incorporated into the
sequential CG refinement stage of the multi-shift solver. The
tolerance for the heavy-quark residual is set via the "tol_hq"
member of QudaInvertParam (and "tol_hq_offset" for the
multi-shift solver). The "residual_type" member selects the
desired stopping condition(s): L2 relative residual, Fermilab
heavy-quark residual, or both. Note that the heavy-quark residual
is not supported on cards with compute capability 1.1, 1.2, or 1.3
(i.e., those predating the "Fermi" architecture) due to hardware
limitations.
- The value of the true residual(s) are now returned in the true_res
and (for multi-shift) true_res_offset members of the QudaInvertParam
struct. When using heavy quark residual stopping condition, the
true_res_hq and true_res_hq_offset members are additionally filled
with the heavy-quark residual value(s).
- The BiCGstab solver now supports an initial-guess strategy. This is
presently only supported when employing a one-pass solve and does
not yet work for a two-pass solve (e.g., of the normal equations).
- Enabled by default double-precision textures since the Fermi double
precision instability has been fixed in the driver accompanying the
CUDA 5.0 production release.
- Fixed a bug related to the sharing of page-locked (pinned) memory
between CUDA and Infiniband that affected correct operation of both
Chroma and MILC on some systems.
- Renamed the "QUDA_NORMEQ_SOLVE" solve_type to "QUDA_NORMOP_SOLVE",
and likewise for "QUDA_NORMOP_PC_SOLVE". This better reflects their
behavior, since a "NORMOP" solve will always involve the normal operator
(A^dag A) but might not correspond to solving the normal equations
of the original system.
- Fixed a long-standing issue so that solve_type and solution_type are
now interpreted as described in the NEWS entry for QUDA 0.3.0 below.
More specifically,
solution_type specifies *what* linear system is to be solved.
solve_type specifies *how* the linear system is to be solved.
We have the following four cases (plus preconditioned variants):
solution_type solve_type Effect
------------- ---------- ------
MAT DIRECT Solve Ax=b
MATDAG_MAT DIRECT Solve A^dag y = b, followed by Ax=y
MAT NORMOP Solve (A^dag A) x = (A^dag b)
MATDAG_MAT NORMOP Solve (A^dag A) x = b
An even/odd preconditioned (PC) solution_type generally requires a PC
solve_type and vice versa. As an exception, the un-preconditioned
MAT solution_type may be used with any solve_type, including
DIRECT_PC and NORMOP_PC.
As also noted in the entry for 0.3.0 below, with the CG inverter,
solve_type should generally be set to 'QUDA_NORMOP_PC_SOLVE',
which will solve the even/odd-preconditioned normal equations via
CGNR. (The full solution will be reconstructed if necessary based
on solution_type.) For BiCGstab (with Wilson or Wilson-clover
fermions), 'QUDA_DIRECT_PC_SOLVE' is generally best.
- General cleanup and other minor fixes. See
https://github.com/lattice/quda/issues?milestone=7 for a breakdown
of all issues closed in this release.
Version 0.4.0 - 4 April 2012
- CUDA 4.0 or later is now required to build the library.
- The "make.inc.example" template has been replaced by a configure script.
See the README file for build instructions and "configure --help" for
a list of configure options.
- Emulation mode is no longer supported.
- Added support for using multiple GPUs in parallel via MPI or QMP.
This is supported by all solvers for the Wilson, clover-improved
Wilson, twisted mass, and improved staggered fermion actions.
Multi-GPU support for domain wall will be forthcoming in a future
release.
- Reworked auto-tuning so that BLAS kernels are tuned at runtime,
Dirac operators are also tuned, and tuned parameters may be cached
to disk between runs. Tuning is enabled via the "tune" member of
QudaInvertParam and is essential for achieving optimal performance
in the solvers. See the README file for details on enabling
caching, which avoids the overhead of tuning for all but the first
run at a given set of parameters (action, precision, lattice volume,
etc.).
- Added NUMA affinity support. Given a sufficiently recent Linux
kernel and a system with dual I/O hubs (IOHs), QUDA will attempt to
associate each GPU with the "closest" socket. This feature is
disabled by default under OS X and may be disabled under Linux via
the "--disable-numa-affinity" configure flag.
- Improved stability on Fermi-based GeForce cards by disabling double
precision texture reads. These may be re-enabled on Fermi-based
Tesla cards for improved performance, as described in the README
file.
- As of QUDA 0.4.0, support has been dropped for the very first
generation of CUDA-capable devices (implementing "compute
capability" 1.0). These include the Tesla C870, the Quadro FX 5600
and 4600, and the GeForce 8800 GTX.
- Added command-line options for most of the tests. See, e.g.,
"wilson_dslash_test --help"
- Added CPU reference implementations of all BLAS routines, which allows
tests/blas_test to check for correctness.
- Implemented various structural and performance improvements
throughout the library.
- Deprecated the QUDA_VERSION macro (which corresponds to an integer
in octal). Please use QUDA_VERSION_MAJOR, QUDA_VERSION_MINOR, and
QUDA_VERSION_SUBMINOR instead.
Version 0.3.2 - 18 January 2011
- Fixed a regression in 0.3.1 that prevented the BiCGStab solver from
working correctly with half precision on Fermi.
Version 0.3.1 - 22 December 2010
- Added support for domain wall fermions. The length of the fifth
dimension and the domain wall height are set via the 'Ls' and 'm5'
members of QudaInvertParam. Note that the convention is to include
the minus sign in m5 (e.g., m5 = -1.8 would be a typical value).
- Added support for twisted mass fermions. The twisted mass parameter
and flavor are set via the 'mu' and 'twist_flavor' members of
QudaInvertParam. Similar to clover fermions, both symmetric and
asymmetric even/odd preconditioning are supported. The symmetric
case is better optimized and generally also exhibits faster
convergence.
- Improved performance in several of the BLAS routines, particularly
on Fermi.
- Improved performance in the CG solver for Wilson-like (and domain
wall) fermions by avoiding unnecessary allocation and deallocation
of temporaries, at the expense of increased memory usage. This will
be improved in a future release.
- Enabled optional building of Dirac operators, set in make.inc, to
keep build time in check.
- Added declaration for MatDagMatQuda() to the quda.h header file and
removed the non-existent functions MatPCQuda() and
MatPCDagMatPCQuda(). The latter two functions have been absorbed
into MatQuda() and MatDagMatQuda(), respectively, since
preconditioning may be selected via the solution_type member of
QudaInvertParam.
- Fixed a bug in the Wilson and Wilson-clover Dirac operators that
prevented the use of MatPC solution types.
- Fixed a bug in the Wilson and Wilson-clover Dirac operators that
would cause a crash when QUDA_MASS_NORMALIZATION is used.
- Fixed an allocation bug in the Wilson and Wilson-clover
Dirac operators that might have led to undefined behavior for
non-zero padding.
- Fixed a bug in blas_test that might have led to incorrect autotuning
for the copyCuda() routine.
- Various internal changes: removed temporary cudaColorSpinorField
argument to solver functions; modified blas functions to use C++
complex<double> type instead of cuDoubleComplex type; improved code
hygiene by ensuring that all textures are bound in dslash_quda.cu
and unbound after kernel execution; etc.
Version 0.3.0 - 1 October 2010
- CUDA 3.0 or later is now required to build the library.
- Several changes have been made to the interface that require setting
new parameters in QudaInvertParam and QudaGaugeParam. See below for
details.
- The internals of QUDA have been significantly restructured to facilitate
future extensions. This is an ongoing process and will continue
through the next several releases.
- The inverters might require more device memory than they did before.
This will be corrected in a future release.
- The CG inverter now supports improved staggered fermions (asqtad or
HISQ). Code has also been added for asqtad link fattening, the asqtad
fermion force, and the one-loop improved Symanzik gauge force, but
these are not yet exposed through the interface in a consistent way.
- A multi-shift CG solver for improved staggered fermions has been
added, callable via invertMultiShiftQuda(). This function does not
yet support Wilson or Wilson-clover.
- It is no longer possible to mix different precisions for the
spinors, gauge field, and clover term (where applicable). In other
words, it is required that the 'cuda_prec' member of QudaGaugeParam
match both the 'cuda_prec' and 'clover_cuda_prec' members of
QudaInvertParam, and likewise for the "sloppy" variants. This
change has greatly reduced the time and memory required to build the
library.
- Added 'solve_type' to QudaInvertParam. This determines how the linear
system is solved, in contrast to solution_type which determines what
system is being solved. When using the CG inverter, solve_type should
generally be set to 'QUDA_NORMEQ_PC_SOLVE', which will solve the
even/odd-preconditioned normal equations via CGNR. (The full
solution will be reconstructed if necessary based on solution_type.)
For BiCGStab, 'QUDA_DIRECT_PC_SOLVE' is generally best. These choices
correspond to what was done by default in earlier versions of QUDA.
- Added 'dagger' option to QudaInvertParam. If 'dagger' is set to
QUDA_DAG_YES, then the matrices appearing in the chosen solution_type
will be conjugated when determining the system to be solved by
invertQuda() or invertMultiShiftQuda(). This option must also be set
(typically to QUDA_DAG_NO) before calling dslashQuda(), MatPCQuda(),
MatPCDagMatPCQuda(), or MatQuda().
- Eliminated 'dagger' argument to dslashQuda(), MatPCQuda(), and MatQuda()
in favor of the new 'dagger' member of QudaInvertParam described above.
- Removed the unused blockDim and blockDim_sloppy members from
QudaInvertParam.
- Added 'type' parameter to QudaGaugeParam. For Wilson or Wilson-clover,
this should be set to QUDA_WILSON_LINKS.
- The dslashQuda() function now takes takes an argument of type
QudaParityType to determine the parity (even or odd) of the output
spinor. This was previously specified by an integer.
- Added support for loading all elements of the gauge field matrices,
without SU(3) reconstruction. Set the 'reconstruct' member of
QudaGaugeParam to 'RECONSTRUCT_NO' to select this option, but note
that it should not be combined with half precision unless the
elements of the gauge matrices are bounded by 1. This restriction
will be removed in a future release.
- Renamed dslash_test to wilson_dslash_test, renamed invert_test to
wilson_invert_test, and added staggered variants of these test
programs.
- Improved performance of the half-precision Wilson Dslash.
- Temporarily removed 3D Wilson Dslash.
- Added an 'OS' option to make.inc.example, to simplify compiling for
Mac OS X.
Version 0.2.5 - 24 June 2010
- Fixed regression in 0.2.4 that prevented the library from compiling
when GPU_ARCH was set to sm_10, sm_11, or sm_12.
Version 0.2.4 - 22 June 2010
- Added initial support for CUDA 3.x and Fermi (not yet optimized).
- Incorporated look-ahead strategy to increase stability of the BiCGStab
inverter.
- Added definition of QUDA_VERSION to quda.h. This is an integer with
two digits for each of the major, minor, and subminor version
numbers. For example, QUDA_VERSION is 000204 for this release.
Version 0.2.3 - 2 June 2010
- Further improved performance of the blas routines.
- Added 3D Wilson Dslash in anticipation of temporal preconditioning.
Version 0.2.2 - 16 February 2010
- Fixed a bug that prevented reductions (and hence the inverter) from working
correctly in emulation mode.
Version 0.2.1 - 8 February 2010
- Fixed a bug that would sometimes cause the inverter to fail when spinor
padding is enabled.
- Significantly improved performance of the blas routines.
Version 0.2 - 16 December 2009
- Introduced new interface functions newQudaGaugeParam() and
newQudaInvertParam() to allow for enhanced error checking. See
invert_test for an example of their use.
- Added auto-tuning blas to improve performance (see README for details).
- Improved stability of the half precision 8-parameter SU(3)
reconstruction (with thanks to Guochun Shi).
- Cleaned up the invert_test example to remove unnecessary dependencies.
- Fixed bug affecting saveGaugeQuda() that caused su3_test to fail.
- Tuned parameters to improve performance of the half-precision clover
Dslash on sm_13 hardware.
- Formally adopted the MIT/X11 license.
Version 0.1 - 17 November 2009
- Initial public release.