-
Notifications
You must be signed in to change notification settings - Fork 7
/
concurrency-primer.tex
1690 lines (1526 loc) · 59.1 KB
/
concurrency-primer.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Build this document with LuaTeX, a modern Unicode-aware LaTeX engine
% that uses system TTF and OTF font files.
% This is needed for the fontspec, microtype, and nolig packages.
\newcommand{\bodyfontsize}{10bp}
\newcommand{\bodybaselineskip}{12bp}
\newif\ifebook
% Uncomment the line below to format as an ebook
% \ebooktrue
% We're using KOMA Script to hand-tune footnotes and TOC appearance.
% It should be available in your texlive distribution,
% which is how most distros package LaTeX.
\documentclass[fontsize=\bodyfontsize, numbers=endperiod]{scrartcl}
% Margins: see http://practicaltypography.com/page-margins.html and
% http://practicaltypography.com/line-length.html
% We're aiming for 80-ish characters per line.
\ifebook
\usepackage[
papersize={4.5in,6in},
footnotesep=\bodybaselineskip,
left=0.5in,right=0.5in,top=0.75in,bottom=0.75in,
]{geometry}
\else
\usepackage[
letterpaper,
footnotesep=\bodybaselineskip,
left=0.75in,right=0.75in,top=1in,bottom=1in,
]{geometry}
\fi
% Font specification.
% Use Equity for the body text,
% Neue Haas Grotesk for sans serif,
% and mononoki for monospace text.
% Feel free to pick your own fonts or comment these lines out to use TeX's
% traditional Computer Modern.
\usepackage[no-math]{fontspec}
\usepackage[fleqn]{amsmath}
\setmainfont[
Ligatures=TeX,
Numbers=Lowercase,
SmallCapsFeatures={StylisticSet=10}, % Set everything in \textsc{} as small caps
StylisticSet=01, % Slightly smaller quotes, asterisks, etc.
UprightFeatures={
SizeFeatures={
{Size={-12},Font=Equity Text A, SmallCapsFont=Equity Caps A},
{Size={12-},Font=Equity Text B, SmallCapsFont=Equity Caps B},
}
},
BoldFeatures={
SizeFeatures={
{Size={-12},Font=Equity Text A Bold, SmallCapsFont=Equity Caps A Bold},
{Size={12-},Font=Equity Text B Bold, SmallCapsFont=Equity Caps B Bold},
}
},
ItalicFeatures={
SizeFeatures={
{Size={-12},Font=Equity Text A Italic},
{Size={12-},Font=Equity Text B Italic},
}
},
BoldItalicFeatures={
SizeFeatures={
{Size={-12},Font=Equity Text A Bold Italic},
{Size={12-},Font=Equity Text B Bold Italic},
}
},
]{Equity Text A}
\setsansfont[Ligatures=TeX,
Style=Alternate, % Straight-legged R
UprightFont = *-55Rg,
ItalicFont = *-56It,
BoldFont = *-65Md,
BoldItalicFont = *-66MdIt
]{NHaasGroteskTXPro}
\setmonofont[Scale=MatchLowercase]{mononoki}
\usepackage[italic]{mathastext}
\usepackage{polyglossia}
\setdefaultlanguage[variant=american]{english}
\usepackage{microtype} % Font expansion, protrusion, and other goodness
% Disable ligatures across grapheme boundaries
% (see the package manual for details.)
\usepackage[english]{selnolig}
% Use symbols for footnotes, resetting each page
\usepackage[perpage,bottom,symbol*]{footmisc}
% Left flush footnotes. See the KOMA Script manual.
\deffootnote[1em]{1em}{1em}{\thefootnotemark}
% Set the width of the rule separating body text and footnotes
\setfootnoterule{0.7\textwidth}
% Like many fonts, Equity's asterisk is already set in a "superscripted" form.
% Superscripting *that* makes it annoyingly small.
% To fix this, we have to redefine footnote marks so that they aren't superscript,
% then raise all the other symbols.
%
% Feel free to remove this if your body type doesn't have this peculiarity,
% but unfortunately many do.
% See http://tex.stackexchange.com/a/16241
%
% We use the Unicode symbols themselves (instead of \dagger, \ddagger, \P, etc.)
% because the latter fall back to Computer Modern/Latin Modern in some cases,
% (e.g., if you're using mathastext instead of unicode-math).
% Alternatively, you could use \textdagger, \textddagger, etc.,
% but this seems more concise.
\DefineFNsymbols*{tweaked}{%
{*}%
{\textsuperscript†}%
{\textsuperscript‡}%
{\textsuperscript{◊}}%
{\textsuperscript{¶}}%
{**}%
{\textsuperscript{††}}%
{\textsuperscript{‡‡}}%
}
\setfnsymbol{tweaked}
\deffootnotemark{\thefootnotemark}
% Use the tocstyle package to give all ToC items sans serif font.
% We also want our monospace font to match the uppercase height of the sans serif.
% (Matching x-heights make it quite small in comparison.)
% Note: tocstyle is currently in alpha, so feel free to tweak this as needed.
% At time of writing, it seems to be the simplest way to get the intended effect.
\usepackage{tocstyle}
\settocfeature{pagenumberhook}{\addfontfeature{Numbers=Tabular}}
\settocfeature{entryvskip}{2pt}
% Don't use a sans font for description labels.
\addtokomafont{descriptionlabel}{\rmfamily\mdseries}
\setkomafont{disposition}{\rmfamily}
\setkomafont{section}{\large\itshape}
\setkomafont{subsection}{\normalsize\itshape}
\renewcommand*\thesection{\upshape\arabic{section}}
% Use uppercase numbers for numbered lists.
% (We're using lowercase ones for the body text.)
% See http://tex.stackexchange.com/a/133186
\usepackage{enumitem}
\setlist[enumerate]{font=\addfontfeatures{Numbers=LowercaseOff}}
% Custom footer
\usepackage{scrlayer-scrpage}
\setkomafont{pagefoot}{\sffamily\upshape}
\pagestyle{scrheadings}
\usepackage{minted} % Syntax highlighting via Pygments
\usepackage{graphicx}
\usepackage[font=footnotesize,justification=raggedright]{caption}
\usepackage{tikz} % Duck and cover.
\newcommand{\codesize}{\fontsize{\bodyfontsize}{\bodybaselineskip}}
% Syntax highlighting for ARM asm (minted doesn't do this well)
\usepackage{listings}
\lstset{
basicstyle=\ttfamily\codesize\selectfont,
keywordstyle=\color{darkGreen}\bfseries,
commentstyle=\textcolor[rgb]{0.25,0.50,0.50}
}
% listings definitions for ARM assembly.
% Get them from https://github.com/frosc/arm-assembler-latex-listings,
% install as shown at http://tex.stackexchange.com/a/1138/92465
\usepackage{lstlangarm} % See above
\usepackage{changepage} % For adjustwidth
\usepackage{metalogo} % for \LuaLaTeX
\ifebook
\else
\usepackage{multicol}
\fi
\setlength\columnsep{2em}
%\setlength\parskip{0em}
\setlength\parindent{1.5em}
\title{What every systems programmer should know about concurrency}
\author{Matt Kline}
\date{April 28, 2020}
% Custom footer
% Hyperlinks
\usepackage[unicode,pdfusetitle]{hyperref}
\usepackage{xcolor}
\definecolor{darkGreen}{HTML}{008000}
\hypersetup{
colorlinks=true, % Use colors
linkcolor=violet, % Intra-doc links
urlcolor=blue % URLs are blue
}
% Use \punckern to overlap periods, commas, and footnote markers
% for a tighter look.
% Care should be taken to not make it too tight - f" and the like can overlap
% if you're not careful.
\newcommand{\punckern}{\kern-0.4ex}
% For placing commas close to, or under, quotes they follow.
% We're programmers, and we blatantly disregard American typographical norms
% to put the quotes inside, but we can at least make it look a bit nicer.
\newcommand{\quotekern}{\kern-0.5ex}
% Create an unbreakable string of text in a monospaced font.
% Useful for `command --line --args`
\newcommand{\monobox}[1]{\mbox{\texttt{#1}}}
\newcommand{\keyword}[1]{\monobox{\color{darkGreen}#1}}
% C++ looks nicer if the ++ is in a monospace font and raised a bit.
% Also, use uppercase numbers to match the capital C.
\newcommand{\cpp}[1]{C\kern-0.1ex\raisebox{0.15ex}{\texttt{++}}{\addfontfeature{Numbers=LowercaseOff}#1}}
\newcommand{\clang}[1]{C{\addfontfeature{Numbers=LowercaseOff}#1}}
\newcommand{\csharp}{C\raisebox{0.25ex}{\#}}
\newcommand{\fig}[1]{Figure~\ref{#1}}
% Italicize new terms
\newcommand{\introduce}[1]{\textit{#1}}
\newcommand{\secref}[1]{\hyperref[#1]{\textsc{\S}\ref*{#1}}}
% Feel free to futz with the spacing above and below these.
% (Given the number of code snippets and figures,
% they can really make or break the spacing of the paragraphs.
% Keep in mind that TeX only lays out a page at a time,
% so you need to more or less look at how each page is individually handled.
\newenvironment{colfigure}
{\par\vspace{1\baselineskip minus 0.5\baselineskip}\noindent\minipage{\linewidth}}
{\endminipage\vspace{1\baselineskip minus 0.7\baselineskip}}
% See http://tex.stackexchange.com/a/68310
\makeatletter
\let\runauthor\@author
\let\rundate\@date
\let\runtitle\@title
\makeatother
% Spend a bit more time to get better word spacing.
% See http://tex.stackexchange.com/a/52855/92465
\emergencystretch=1em
% Consider more material for a given page, which helps a bit with footnotes.
% See the multicol manual.
\ifebook
\else
\setcounter{collectmore}{2}
\fi
\begin{document}
% Custom title instead of \maketitle
\begin{center}
\fontspec[Ligatures=TeX, StylisticSet=1]{NHaasGroteskDsPro-55Rg}\Large \runtitle
\bigskip
\large
\runauthor
\smallskip
\normalsize
\rundate
\end{center}
\bigskip
\ifebook
\else
\begin{adjustwidth}{1in}{1in}
\fi
\begin{center}
\large \bfseries\itshape Abstract
\end{center}
\smallskip
\noindent
Systems programmers are familiar with tools like mutexes, semaphores,
and condition variables.
But how do they work?
How do we write concurrent code when they're not available,
like when we're working below the operating system in an embedded environment,
or when we can't block due to hard time constraints?
And since your compiler and hardware conspire to turn your code into things
you didn't write, running in orders you never asked for,
how do multithreaded programs work at all?
Concurrency is a complicated and unintuitive topic,
but let's try to cover some fundamentals.
\bigskip
\tableofcontents
\ifebook
\else
\end{adjustwidth}
\fi
\medskip
\ifebook
\else
\begin{multicols}{2}
\fi
\section{Background}
\label{background}
Modern computers run many instruction streams concurrently.
On single-core machines, they take turns,
sharing the \textsc{cpu} in short slices of time.
On multi-core machines, several can run in parallel.
We call them many names---processes, threads, tasks,
interrupt service routines, and more---but most of the same principles apply
across the board.
While computer scientists have built lots of great abstractions,
these instruction streams
(let's call them all \emph{threads} for the sake of brevity)
ultimately interact by sharing bits of state.
For this to work, we need to understand the order in which
threads read and write to memory.
Consider a simple example where thread \textit{A} shares
an integer with others.
It writes the integer to some variable,
then sets a flag to instruct other threads to read whatever it just stored.
As code, this might resemble:
\begin{colfigure}
\begin{minted}[fontsize=\codesize]{cpp}
int v;
bool v_ready = false;
void threadA()
{
// Write the value
// and set its ready flag.
v = 42;
v_ready = true;
}
\end{minted}
\end{colfigure}
\begin{colfigure}
\begin{minted}[fontsize=\codesize]{cpp}
void threadB()
{
// Await a value change and read it.
while (!v_ready) { /* wait */ }
const int my_v = v;
// Do something with my_v...
}
\end{minted}
\end{colfigure}
We need to make sure that other threads only observe \textit{A's} write to
\texttt{v\_ready}
\emph{after A's} write to \texttt{v}.
(If another thread can ``see'' \texttt{v\_ready} become true before it sees
\texttt{v} become 42, this simple scheme won't work.)
You would think it's trivial to guarantee this order,
but nothing is as it seems.
For starters, any optimizing compiler will
rewrite your code to run faster on the hardware
it's targeting.
So long as the resulting instructions run to the same effect
\emph{for the current thread},
reads and writes can be moved to avoid pipeline stalls\footnote{%
Most \textsc{cpu} designs execute parts of several instructions in parallel
to increase their throughput (see \fig{pipeline}).
When the result of one instruction is needed by
a subsequent instruction in the pipeline, the \textsc{cpu} may need
to suspend forward progress, or \introduce{stall}, until that result is ready.}
or improve locality.\punckern\footnote{%
\textsc{ram} is not read in single bytes, but in chunks called
\introduce{cache lines}.
If variables that are used together can be placed on the same cache line,
they will be read and written all at once.
This usually provides a massive speedup,
but as we'll see in \secref{false-sharing},
can bite us when a line must be shared between cores.
}
Variables can be assigned to the same memory location if they're never used
at the same time.
Calculations can be made speculatively, before a branch is taken,
then ignored if the compiler guessed incorrectly.\punckern\footnote{This is
especially common when using profile-guided optimiation.}
%(These sorts of optimizations sometimes called the ``as-if'' rule in \cpp{}.)
Even if the compiler didn't change our code,
we'd still be in trouble, since our hardware does it too!
A modern \textsc{cpu} processes instructions in
a \emph{much} more complicated fashion than traditional pipelined approaches
like the one shown in \fig{pipeline}.
They contain many data paths, each for different types of instructions,
and schedulers which reorder and route instructions through these paths.
\begin{colfigure}
\centering
\includegraphics[keepaspectratio,width=0.7\linewidth]{pipeline}
\captionof{figure}{A traditional five-stage \textsc{cpu} pipeline
with fetch, decode, execute, memory access, and write-back stages.
Modern designs are much more complicated,
often reordering instructions on the fly.
\\ Image courtesy of
\href{https://commons.wikimedia.org/wiki/File:Fivestagespipeline.png}{Wikipedia}.}
\label{pipeline}
\end{colfigure}
It's also easy to make naïve assumptions about how memory works.
If we imagine a multi-core processor,
we might think of something resembling \fig{ideal-machine},
where each core takes turns performing reads and writes to the system's memory.
\begin{colfigure}
\centering
\includegraphics[keepaspectratio, width=0.8\linewidth]{ideal-machine}
\captionof{figure}{An idealized multi-core processor where cores
take turns accessing a single shared set of memory.}
\label{ideal-machine}
\end{colfigure}
But the world isn't so simple.
While processor speeds have increased exponentially over the past decades,
\textsc{ram} hasn't been able to keep up,
creating an ever-widening gulf between the time it takes to run an
instruction and the time needed to retrieve its data from memory.
Hardware designers have compensated by placing a growing number of
hierarchical caches directly on the \textsc{cpu} die.
Each core also usually has a \introduce{store buffer} that handles
pending writes while subsequent instructions are executed.
Keeping this memory system \introduce{coherent},
so that writes made by one core are observable by others,
even if those cores use different caches,
is quite challenging.
\begin{colfigure}
\centering
\includegraphics[keepaspectratio, width=\linewidth]{actual-machine}
\captionof{figure}{A common memory hierarchy for modern multi-core processors}
\label{dunnington}
\end{colfigure}
All of these complications mean that there is no consistent
concept of ``now'' in a multithreaded program,
especially on a multi-core \textsc{cpu}.
Creating some sense of order between threads
is a team effort of the hardware, the compiler,
the programming language, and your application.
Let's explore what we can do,
and what tools we will need.
\section{Enforcing law and order}
\label{seqcst}
Creating order in multithreaded programs requires different approaches on
each \textsc{cpu} architecture.
For many years, systems languages like C and \cpp{}
had no notion of concurrency,
forcing developers to use assembly or compiler extensions.
This was finally fixed in 2011, when both languages' \textsc{iso} standards
added synchronization tools.
So long as you use them correctly,
the compiler will prevent any reorderings---both by its own optimizer,
and by the \textsc{cpu}---that cause data races.\punckern\footnote{The
ISO~\clang{11} standard lifted its concurrency facilities,
almost verbatim, from the \cpp{11} standard. Everything you see here
should be identical in both languages, barring some arguably cleaner syntax in
\cpp{}.}
Let's try our previous example again.
For it to work, the ``ready'' flag needs to use an \introduce{atomic type}.
\begin{colfigure}
\begin{minted}[fontsize=\codesize]{cpp}
int v = 0;
std::atomic_bool v_ready(false);
void threadA()
{
v = 42;
v_ready = true;
}
\end{minted}
\end{colfigure}
\begin{colfigure}
\begin{minted}[fontsize=\codesize]{cpp}
void threadB()
{
while (!v_ready) { /* wait */ }
const int my_v = v;
// Do something with my_v...
}
\end{minted}
\end{colfigure}
The C and \cpp{} standard libraries define a series of these types
in \mintinline{c}{<stdatomic.h>} and \mintinline{cpp}{<atomic>}, respectively.
They look and act just like the integer types they mirror
(e.g., \monobox{bool}~\textrightarrow~\monobox{atomic\_bool},
\monobox{int}~\textrightarrow~\monobox{atomic\_int}, etc.),
but the compiler ensures that other variables' loads and stores aren't
reordered around theirs.
Informally, we can think of atomic variables
as rendezvous points for threads.
By making \monobox{v\_ready} atomic,
\monobox{v = 42}\, is now guaranteed to happen before
\monobox{v\_ready = true}\, in thread~\textit{A},
just as \monobox{my\_v = v}\, must happen after reading \monobox{v\_ready}\,
in thread~\textit{B}.
Formally, atomic types establish a \textit{single total modification order}
where,
``[\ldots] the result of any execution is the same as if the reads and writes
occurred in some order, and the operations of each individual
processor appear in this sequence in the order specified by its program.''
This model, defined by Leslie Lamport in 1979,
is called \introduce{sequential consistency}.
\section{Atomicity}
\label{atomicity}
But order is only one of the vital ingredients for inter-thread communication.
The other is what atomic types are named for: atomicity.
Something is \introduce{atomic} if it cannot be divided into smaller parts.
If threads don't use atomic reads and writes to share data, we're still in trouble.
Consider a program with two threads.
One processes a list of files,
incrementing a counter each time it finishes working on one.
The other handles the user interface, periodically reading
the counter to update a progress bar.
If that counter is a 64-bit integer, we can't access it atomically on 32-bit
machines, since we need two loads or stores to read or write the entire value.
If we're particularly unlucky,
the first thread could be halfway through writing the counter
when the second thread reads it, receiving garbage.
These unfortunate occasions are called \introduce{torn reads and writes}.
If reads and writes to the counter are atomic, however,
our problem disappears.
We can see that, compared to the difficulties of establishing the right order,
atomicity is fairly straightforward:
just make sure that any variables used for thread synchronization
are no larger than the \textsc{cpu} word size.
\section{Arbitrarily-sized “atomic” types}
Along with \texttt{atomic\_int} and friends,
\cpp{} provides the template \texttt{std::atomic<T>} for defining
arbitrary atomic types. C, lacking a similar
language feature but wanting to provide the same functionality,
added an \keyword{\_Atomic} keyword.
If \texttt{T} is larger than the machine's word size,
the compiler and the language runtime automatically
surround the variable's reads and writes with locks.
If you want to make sure this isn't happening,\punckern\footnote{\ldots
which is most of the time, since we're usually using atomic operations to
avoid locks in the first place.}
you can check with:
\begin{colfigure}
\begin{minted}[fontsize=\codesize]{cpp}
std::atomic<Foo> bar;
ASSERT(bar.is_lock_free());
\end{minted}
\end{colfigure}
In most cases,\punckern\footnote{The language standards
permit atomic types to be \emph{sometimes} lock-free.
This might be necessary for architectures that
don't guarantee atomicity for unaligned reads and writes.}
this information is known at compile time.
Consequently, \cpp{17} added \texttt{is\_always\_lock\_free}:
\begin{colfigure}
\begin{minted}[fontsize=\codesize]{cpp}
static_assert(
std::atomic<Foo>::is_always_lock_free);
\end{minted}
\end{colfigure}
\section{Read-modify-write}
\label{rmw}
Loads and stores are all well and good,
but sometimes we need to read a value, modify it,
and write it back as a single atomic step.
There are a few common \introduce{read-modify-write} \textsc{(rmw)} operations.
In \cpp{}, they're represented as member functions of \texttt{std::atomic<T>}.
In \clang, they're freestanding functions.
\subsection{Exchange}
\label{exchange}
The simplest atomic \textsc{rmw} operation is an \introduce{exchange}:
the current value is read and replaced with a new one.
To see where this might be useful,
let's tweak our example from \secref{atomicity}:
instead of displaying the total number of processed files,
the \textsc{ui} might want to show how many were processed per second.
We could implement this by having the \textsc{ui} thread read the counter then
zero it each second.
But we could get the following race condition
if reading and zeroing are separate steps:
\begin{enumerate}
\item The \textsc{ui} thread reads the counter.
\item Before the \textsc{ui} thread has the chance to zero it,
the worker thread increments it again.
\item The \textsc{ui} thread now zeroes the counter, and the previous increment
is lost.
\end{enumerate}
If the \textsc{ui} thread atomically exchanges the current value with zero,
the race disappears.
\subsection{Test and set}
\introduce{Test-and-set} works on a Boolean value:
we read it, set it to \mintinline{cpp}{true}, and provide the value it
held beforehand.
C and \cpp{} offer a type dedicated to this purpose, called \monobox{atomic\_flag}.
We could use it to build a simple spinlock:
\label{spinlock}
\begin{colfigure}
\begin{minted}[fontsize=\codesize]{cpp}
std::atomic_flag af;
void lock()
{
while (af.test_and_set()) { /* wait */ }
}
void unlock() { af.clear(); }
\end{minted}
\end{colfigure}
If we call \mintinline{cpp}{lock()} and the previous value is
\mintinline{cpp}{false}, we are the first to acquire the lock,
and can proceed with exclusive access to whatever the lock protects.
If the previous value is \mintinline{cpp}{true},
someone else has acquired the lock and we must
wait until they release it by clearing the flag.
\subsection{Fetch and…}
We can also read a value,
perform some simple operation on it
(addition, subtraction, bitwise \textsc{and}, \textsc{or}, \textsc{xor}),
and return its previous value---all as one atomic operation.
You might have noticed in the exchange example that
the worker thread's additions must also be atomic,
or else we could get a race where:
\begin{enumerate}
\item The worker thread loads the current counter value and adds one.
\item Before that thread can store the value back,
the \textsc{ui} thread zeroes the counter.
\item The worker now performs its store, as if the counter was never cleared.
\end{enumerate}
\subsection{Compare and swap}
\label{cas}
Finally, we have \introduce{compare-and-swap} \textsc{(cas)},
sometimes called \introduce{compare-and-exchange}.
It allows us to conditionally exchange a value \emph{if} its previous value
matches some expected one.
In \clang{} and \cpp{}, \textsc{cas} resembles the following,
if it were executed atomically:
\begin{colfigure}
\begin{minted}[fontsize=\codesize]{cpp}
template <typename T>
bool atomic<T>::compare_exchange_strong(
T& expected, T desired)
{
if (*this == expected) {
*this = desired;
return true;
}
else {
expected = *this;
return false;
}
}
\end{minted}
\end{colfigure}
\begin{samepage}
\noindent You might be perplexed by the \texttt{\_strong} suffix.
Is there a ``weak'' \textsc{cas}?
Yes, but hold onto that thought---we'll talk about it in
\secref{spurious-ll/sc-failures}.
\end{samepage}
Let's say we have some long-running task that we might want to cancel.
We'll give it three states: \textit{idle}, \textit{running},
and \textit{cancelled}, and write a loop that exits when
it is cancelled.
\begin{colfigure}
\begin{minted}[fontsize=\codesize]{cpp}
enum class TaskState : int8_t {
Idle, Running, Cancelled
};
std::atomic<TaskState> ts;
void taskLoop()
{
ts = TaskState::Running;
while (ts == TaskState::Running) {
// Do good work.
}
}
\end{minted}
\end{colfigure}
If we want to cancel the task if it's running, but do nothing if it's idle,
we could \textsc{cas}:
\begin{colfigure}
\begin{minted}[fontsize=\codesize]{cpp}
bool cancel()
{
auto expected = TaskState::Running;
return ts.compare_exchange_strong(
expected, TaskState::Cancelled);
}
\end{minted}
\end{colfigure}
\section{Atomic operations as building blocks}
Atomic loads, stores, and \textsc{rmw} operations are the building
blocks for every single concurrency tool.
It's useful to split those tools into two camps:
\introduce{blocking} and \introduce{lockless}.
Blocking synchronization methods are usually simpler to reason about,
but they can make threads pause for arbitrary amounts of time.
For example, consider a mutex,
which forces threads to take turns accessing shared data.
If some thread locks the mutex
and another tries to do the same,
the second thread must wait---or \introduce{block}---until
the first thread releases the lock,
however long that may be.
Blocking mechanisms are also susceptible to \introduce{deadlock} and
\introduce{livelock}---bugs where the entire system ``gets stuck''
due to threads waiting for each other.
In contrast, lockless synchronization methods
ensure that the program is always making forward progress.
These are \introduce{non-blocking}
since no thread can cause another to wait indefinitely.
Consider a program that streams audio,
or an embedded system where a sensor triggers an interrupt service routine
\textsc{(isr)} when new data arrives.
We want lock-free algorithms and data structures in these situations,
since blocking could break them.
(In the first case, the user's audio will begin to stutter if sound data
isn't provided at the bitrate it is consumed.
In the second, subsequent sensor inputs could be missed if the \textsc{isr}
does not complete as quickly as possible.)
It's important to point out that lockless algorithms are not somehow better
or faster than blocking ones---they are just different tools
designed for different jobs.
We should also note that algorithms aren't automatically lock-free just because
they only use atomic operations.
Our primitive spinlock from \secref{spinlock} is still a blocking
algorithm even though it doesn't use any \textsc{os}-provided syscalls to
put the blocked thread to sleep.\punckern\footnote{Putting a blocked thread
to sleep is often an optimization,
since the operating system's scheduler can run other threads on the \textsc{cpu}
until the sleeping one is unblocked.
Some concurrency libraries even offer hybrid locks which spin briefly,
then sleep.
(This avoids the cost of context switching away from the current thread if it
is blocked for less than the spin length, but avoids wasting \textsc{cpu}
time in a long-running loop.)}
Of course, there are situations where either blocking
or lockless approaches would work.\punckern\footnote{You
may also hear of \introduce{wait-free} algorithms---they are a subset of
lock-free ones which are guaranteed to complete in some
bounded number of steps.}
Whenever performance is a concern, \emph{profile!}
Performance depends on many factors,
ranging from the number of threads at
play to the specifics of your \textsc{cpu}.
And as always, consider the tradeoffs you make between
complexity and performance---concurrency is a perilous art.
\section{Sequential consistency on weakly-ordered hardware}
Different hardware architectures provide different ordering guarantees.
or \introduce{memory models}.
For example, x64 is relatively \introduce{strongly-ordered},
and can be trusted to preserve some system-wide order of
loads and stores in most cases.
Other architectures like \textsc{arm} are \introduce{weakly-ordered},
so you can't assume that loads and stores are executed in
program order unless the \textsc{cpu} is given special instructions---called
\introduce{memory barriers}---to not shuffle them around.
It's helpful to see how atomic operations work in a weakly-ordered system,
both to understand what's happening in hardware,
and to see why the \clang{} and \cpp{} concurrency models were designed as they
were.\punckern\footnote{It's worth noting that the concepts we discuss here aren't
specific to \clang{} and \cpp{}.
Other systems programming languages like D and Rust have converged on
similar models.}
Let's examine \textsc{arm}, since it's both popular and straightforward.
Consider the simplest atomic operations: loads and stores.
Given some \mintinline{cpp}{atomic_int foo},
% Shield your eyes.
% Essentially,
% 1. On the left, place getFoo() and setFoo() functions.
% 2. On the right, place the assembly they're compiled to.
% 3. In the middle, place an arrow for each (futzing with height a bit)
% with the text "becomes" over it.
\begin{colfigure}
\begin{minipage}{0.35\linewidth}
\begin{minted}[fontsize=\codesize]{cpp}
int getFoo()
{
return foo;
}
\end{minted}
\end{minipage}
\raisebox{-1ex}{
\begin{tikzpicture}
\draw [->, line width=1pt] (0, 0) -- node[above]{\itshape becomes} (0.17\linewidth, 0);
\end{tikzpicture}
}
\begin{minipage}{0.43\linewidth}
\begin{lstlisting}[language={[ARM]Assembler}]
getFoo:
ldr r3, <&foo>
dmb
ldr r0, [r3, #0]
dmb
bx lr
\end{lstlisting}
\end{minipage}
\end{colfigure}
%Similarly,
\begin{colfigure}
\begin{minipage}{0.35\linewidth}
\begin{minted}[fontsize=\codesize]{cpp}
void setFoo(int i)
{
foo = i;
}
\end{minted}
\end{minipage}
\raisebox{-1ex}{
\begin{tikzpicture}
\draw [->, line width=1pt] (0, 0) -- node[above]{\itshape becomes} (0.17\linewidth, 0);
\end{tikzpicture}
}
\begin{minipage}{0.43\linewidth}
\begin{lstlisting}[language={[ARM]Assembler}]
setFoo:
ldr r3, <&foo>
dmb
str r0, [r3, #0]
dmb
bx lr
\end{lstlisting}
\end{minipage}
\end{colfigure}
We load the address of our atomic variable into a scratch register
(\texttt{r3}),
sandwich our load or store between memory barriers (\keyword{dmb}),
then return.
The barriers give us sequential consistency---the first ensures that
prior reads and writes can't be placed after our operation,
and the second ensures that subsequent reads and writes can't be placed
before it.
\section{Implementing atomic read-modify-write operations with LL/SC instructions}
Like many other
\textsc{risc}\footnote{\introduce{Reduced instruction set computer},
in contrast to a \introduce{complex instruction set computer} \textsc{(cisc)}
architecture like x64.}
architectures, \textsc{arm} lacks dedicated \textsc{rmw} instructions.
And since the processor can context switch to another thread
at any time,
we can't build \textsc{rmw} ops from normal loads and stores.
Instead, we need special instructions:
\introduce{load-link} and \introduce{store-conditional} \textsc{(ll/sc)}.
The two work in tandem:
A load-link reads a value from an address---like any other load---but also
instructs the processor to monitor that address.
Store-conditional writes the given value \emph{only if}
no other stores were made to that address
since the corresponding load-link.
Let's see them in action with an atomic fetch and add.
On \textsc{arm},
\begin{colfigure}
\begin{minted}[fontsize=\codesize]{cpp}
void incFoo() { ++foo; }
\end{minted}
compiles to:
\begin{lstlisting}[language={[ARM]Assembler}]
incFoo:
ldr r3, <&foo>
dmb
loop:
ldrex r2, [r3] // LL foo
adds r2, r2, #1 // Increment
strex r1, r2, [r3] // SC
cmp r1, #0 // Check the SC result.
bne loop // Loop if the SC failed.
dmb
bx lr
\end{lstlisting}
\end{colfigure}
We \textsc{ll} the current value, add one, and immediately try to store it back
with a \textsc{sc}. If that fails, another thread may have
written to \texttt{foo} since our \textsc{ll},
so we try again.
In this way, at least one thread is always making forward progress in atomically
modifying \texttt{foo}, even if several are attempting to do so at
once.\punckern\footnote{\ldots though generally,
we want to avoid cases where multiple threads are vying for the same variable
for any significant amount of time.}
\subsection{Spurious LL/SC failures}
\label{spurious-ll/sc-failures}
As you might imagine, it would take too much \textsc{cpu} hardware to track
load-linked addresses for every single byte on the machine.
To reduce this cost, many processors monitor them at some coarser granularity,
such as the cache line.
This means that a \textsc{sc}
can fail if it's preceded by a write to \emph{any} address in the monitored block,
not just the specific one that was load-linked.
This is especially troublesome for compare and swap,
and is the raison d'être for \monobox{compare\_exchange\_weak}.
To see why, consider a function that atomically multiplies a value,
even though there's no atomic instruction to read-multiply-write in any
common architecture.
\begin{colfigure}
\begin{minted}[fontsize=\codesize]{cpp}
void atomicMultiply(int by)
{
int expected = foo;
// Which CAS should we use?
while (!foo.compare_exchange_?(
expected, expected * by)) {
// Empty loop.
// (On failure, expected is updated with
// foo's most recent value.)
}
}
\end{minted}
\end{colfigure}
Many lockless algorithms use \textsc{cas} loops like this to atomically update
a variable when calculating its new value isn't atomic. They:
\begin{enumerate}
\item Read the variable.
\item Perform some (non-atomic) operation on its value.
\item \textsc{cas} the new value with the previous one.
\item If the \textsc{cas} failed, another thread beat us to the punch,
so try again.
\end{enumerate}
If we use \monobox{compare\_exchange\_strong} for this family of algorithms,
the compiler must emit nested loops:
an inner one to protect us from spurious \textsc{sc} failures,
and an outer one which repeatedly performs our operation until no other thread
has interrupted us.
But unlike the \monobox{\_strong} version, a weak \textsc{cas}
is allowed to fail spuriously, just like the \textsc{ll/sc} mechanism
that implements it.
So, with \monobox{compare\_exchange\_weak},
the compiler is free to generate a single loop,
since we don't care about the difference between retries from spurious
\textsc{sc} failures and retries caused by another thread modifying our variable.
\section{Do we always need sequentially consistent operations?}
\label{lock-example}
All of our examples so far have been sequentially consistent to prevent
reorderings that break our code.
We've also seen how weakly-ordered architectures like \textsc{arm}
use memory barriers to create sequential consistency.
But as you might expect,
these barriers can have a noticeable impact on performance.
After all,
they inhibit optimizations that your compiler and hardware would otherwise make.
What if we could avoid some of this slowdown?
Consider a simple case like the spinlock from \secref{spinlock}.
Between the \texttt{lock()} and \texttt{unlock()} calls,
we have a \introduce{critical section}
where we can safely modify shared state protected by the lock.
Outside this critical section,
we only read and write to things that aren't
shared with other threads.
\begin{colfigure}
\begin{minted}[fontsize=\codesize]{cpp}
deepThought.calculate(); // non-shared
lock(); // Lock; critical section begins
sharedState.subject =
"Life, the universe and everything";
sharedState.answer = 42;
unlock(); // Unlock; critical section ends