-
Notifications
You must be signed in to change notification settings - Fork 13
/
concurrency-primer.tex
1655 lines (1431 loc) · 84.5 KB
/
concurrency-primer.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Build this document with LuaTeX, a modern Unicode-aware LaTeX engine
% that uses system TTF and OTF font files.
% This is needed for the fontspec, microtype, and nolig packages.
\newcommand{\bodyfontsize}{10bp}
\newcommand{\bodybaselineskip}{12bp}
\RequirePackage[table]{xcolor}
% We're using KOMA Script to hand-tune footnotes and TOC appearance.
% It should be available in your texlive distribution,
% which is how most distros package LaTeX.
\documentclass[fontsize=10pt, oneside]{scrartcl}
% Margins: see http://practicaltypography.com/page-margins.html and
% http://practicaltypography.com/line-length.html
% We're aiming for 80-ish characters per line.
\usepackage[
letterpaper,
footnotesep=\bodybaselineskip,
left=0.75in,right=0.75in,top=1in,bottom=1in,
]{geometry}
% Font specification.
\usepackage[no-math]{fontspec}
\usepackage[fleqn]{amsmath}
\usepackage[italic]{mathastext}
\usepackage{polyglossia}
\setdefaultlanguage[variant=american]{english}
\usepackage{microtype} % Font expansion, protrusion, and other goodness
% Disable ligatures across grapheme boundaries
% (see the package manual for details.)
\usepackage[english]{selnolig}
% Use symbols for footnotes, resetting each page
\usepackage[perpage,bottom,symbol*]{footmisc}
% Don't use a sans font for description labels.
\addtokomafont{descriptionlabel}{\rmfamily\mdseries}
\setkomafont{disposition}{\rmfamily}
\setkomafont{section}{\large\itshape}
\setkomafont{subsection}{\normalsize\itshape}
\renewcommand*\thesection{\upshape\arabic{section}}
\usepackage{enumitem}
% Custom footer
\usepackage{scrlayer-scrpage}
\setkomafont{pagefoot}{\sffamily\upshape}
\pagestyle{scrheadings}
\usepackage{minted} % Syntax highlighting via Pygments
\usepackage{graphicx}
\usepackage[font=footnotesize,justification=raggedright]{caption}
\usepackage{tikz} % Duck and cover.
\newcommand{\codesize}{\fontsize{\bodyfontsize}{\bodybaselineskip}}
% Syntax highlighting for Arm asm (minted doesn't do this well)
\usepackage{listings}
\lstset{
basicstyle=\ttfamily\codesize\selectfont,
keywordstyle=\color{darkGreen}\bfseries,
commentstyle=\textcolor[rgb]{0.25,0.50,0.50}
}
% listings definitions for Arm assembly.
% Get them from https://github.com/sysprog21/arm-assembler-latex-listings
\usepackage{lstlangarm} % See above
\usepackage{changepage} % For adjustwidth
\usepackage{metalogo} % for \LuaLaTeX
\input{lib/codeblock}
\setlength\columnsep{2em}
%\setlength\parskip{0em}
\setlength\parindent{1.5em}
\title{Concurrency Primer\footnote{%
The original title was ``What every systems programmer should know about concurrency''.}
}
\author{Matt Kline and Ching-Chun (Jim) Huang}
\date{\today}
% Custom footer
% Hyperlinks
\usepackage[unicode,pdfusetitle]{hyperref}
\usepackage{xcolor}
\definecolor{darkGreen}{HTML}{008000}
\hypersetup{
colorlinks=true, % Use colors
linkcolor=violet, % Intra-doc links
urlcolor=blue % URLs are blue
}
% Use \punckern to overlap periods, commas, and footnote markers
% for a tighter look.
% Care should be taken to not make it too tight - f" and the like can overlap
% if you're not careful.
\newcommand{\punckern}{\kern-0.4ex}
% For placing commas close to, or under, quotes they follow.
% We're programmers, and we blatantly disregard American typographical norms
% to put the quotes inside, but we can at least make it look a bit nicer.
\newcommand{\quotekern}{\kern-0.5ex}
% Create an unbreakable string of text in a monospaced font.
% Useful for `command --line --args`
\newcommand{\monobox}[1]{\mbox{\texttt{#1}}}
\newcommand{\keyword}[1]{\monobox{\color{darkGreen}#1}}
% C++ looks nicer if the ++ is in a monospace font and raised a bit.
% Also, use uppercase numbers to match the capital C.
\newcommand{\cplusplus}[1]{C\kern-0.1ex\raisebox{0.15ex}{\texttt{++}}}
\newcommand{\clang}[1]{C}
\newcommand{\csharp}{C\raisebox{0.25ex}{\#}}
\newcommand{\fig}[1]{Figure~\ref{#1}}
% Italicize new terms
\newcommand{\introduce}[1]{\textit{#1}}
\newcommand{\secref}[1]{\hyperref[#1]{\textsc{\S}\ref*{#1}}}
% See http://tex.stackexchange.com/a/68310
\makeatletter
\let\runauthor\@author
\let\rundate\@date
\let\runtitle\@title
\makeatother
% Spend a bit more time to get better word spacing.
% See http://tex.stackexchange.com/a/52855/92465
\emergencystretch=1em
\begin{document}
% Custom title instead of \maketitle
\begin{center}
\Large \runtitle
\bigskip
\large
\runauthor
\smallskip
\normalsize
\rundate
\end{center}
\bigskip
\begin{center}
\large \bfseries\itshape Abstract
\end{center}
\smallskip
\noindent
System programmers are acquainted with tools such as mutexes, semaphores, and condition variables.
However, the question remains: how do these tools work, and how do we write concurrent code in their absence?
For example, when working in an embedded environment beneath the operating system,
or when faced with hard time constraints that prohibit blocking.
Furthermore, since the compiler and hardware often combine to transform code into an unanticipated order,
how do multithreaded programs work? Concurrency is a complex and counterintuitive topic,
but let us endeavor to explore its fundamental principles.
\bigskip
\section{Background}
\label{background}
Modern computers execute multiple instruction streams concurrently.
On single-core systems, these streams alternate, sharing the \textsc{CPU} in brief time slices.
Multi-core systems, however, allow several streams to run in parallel.
These streams are known by various names such as processes, threads, tasks,
interrupt service routines (ISR), among others, yet many of the same principles govern them all.
Despite the development of numerous sophisticated abstractions by computer scientists,
these instruction streams—hereafter referred to as ``\emph{threads}'' for simplicity—primarily interact through shared state.
Proper functioning hinges on understanding the sequence in which threads read from and write to memory.
Consider a simple scenario where thread \textit{A} communicates an integer with other threads:
it writes the integer to a variable and then sets a flag, signaling other threads to read the newly stored value.
This operation could be conceptualized in code as follows:
\begin{ccode}
int v;
bool v_ready = false;
void threadA()
{
// Write the value
// and set its ready flag.
v = 42;
v_ready = true;
}
\end{ccode}
\begin{ccode}
void threadB()
{
// Await a value change and read it.
while (!v_ready) { /* wait */ }
const int b_v = v;
// Do something with b_v...
}
\end{ccode}
We must ensure that other threads only observe \textit{A}'s write to \cc|v_ready| \emph{after A's} write to \cc|v|.
If another thread can ``see'' \cc|v_ready| becoming true before observing \cc|v| becoming $42$,
this simple scheme will not work correctly.
One might assume it is straightforward to ensure this order,
yet the reality is often more complex.
Initially, any optimizing compiler will restructure your code to enhance performance on its target hardware.
The primary objective is to maintain the operational effect within \emph{the current thread},
allowing reads and writes to be rearranged to prevent pipeline stalls\footnote{%
Most \textsc{CPU} architectures execute segments of multiple instructions concurrently to improve throughput (refer to \fig{fig:pipeline}).
A stall, or suspension of forward progress, occurs when an instruction awaits the outcome of a preceding one in the pipeline until the necessary result becomes available.} or to optimize data locality.\punckern\footnote{%
\textsc{RAM} accesses data not byte by byte, but in larger units known as \introduce{cache lines}.
Grouping frequently used variables on the same cache line means they are processed together,
significantly boosting performance. However, as discussed in \secref{shared-resources},
this strategy can lead to complications when cache lines are shared across cores.}
Variables may be allocated to the same memory location if their usage does not overlap.
Furthermore, calculations might be performed speculatively ahead of a branch decision and subsequently discarded if the branch prediction proves incorrect.\punckern\footnote{%
Profile-guided optimization (PGO) often employs this strategy.}
%(These sorts of optimizations sometimes called the ``as-if'' rule in \cplusplus{}.)
Even without compiler alterations,
we would face challenges because our hardware complicates matters further!
Modern \textsc{CPU}s operate in a fashion far more complex than what traditional pipelined methods,
like those depicted in \fig{fig:pipeline}, suggest.
They are equipped with multiple data paths tailored for various instruction types and schedulers that reorder and direct instructions through these paths.
\includegraphics[keepaspectratio,width=0.7\linewidth]{images/pipeline}
\captionof{figure}{A traditional five-stage \textsc{CPU} pipeline with fetch, decode, execute, memory access, and write-back stages.
Modern designs are much more complicated, often reordering instructions on the fly.}
\label{fig:pipeline}
It is quite common to form oversimplified views about memory operations.
Picturing a multi-core processor setup might lead us to envision a model similar to \fig{fig:ideal-machine},
wherein each core alternately accesses and manipulates the system's memory.
\includegraphics[keepaspectratio, width=0.8\linewidth]{ideal-machine}
\captionof{figure}{An idealized multi-core processor where cores
take turns accessing a single shared set of memory.}
\label{fig:ideal-machine}
The reality is far from straightforward.
Although processor speeds have surged exponentially in recent decades,
\textsc{RAM} has struggled to match pace,
leading to a significant gap between the execution time of an instruction and the time required to fetch its data from memory.
To mitigate this, hardware designers have incorporated increasingly complex hierarchical caches directly onto the \textsc{CPU} die.
Additionally, each core often features a \introduce{store buffer} to manage pending writes while allowing further instructions to proceed.
Ensuring this memory system remains \introduce{coherent},
thus allowing writes made by one core to be observable by others even when utilizing different caches,
presents a significant challenge.
\includegraphics[keepaspectratio, width=0.8\linewidth]{images/mp-cache}
\captionof{figure}{A common memory hierarchy for modern multi-core processors}
\label{fig:dunnington}
The myriad complexities within multithreaded programs on multi-core \textsc{CPU}s lead to a lack of a uniform concept of ``now''.
Establishing some semblance of order among threads requires a concerted effort involving the hardware,
compiler, programming language, and your application.
Let's delve into our options and the tools necessary for this endeavor.
\section{Enforcing law and order}
\label{seqcst}
Establishing order in multithreaded programs varies across different \textsc{CPU} architectures.
For years, systems languages like \clang{} and \cplusplus{} lacked built-in concurrency mechanisms,
compelling developers to rely on assembly or compiler-specific extensions.
This gap was bridged in 2011 when the \textsc{ISO} standards for both languages introduced synchronization tools.
Provided these tools are used correctly,
the compiler ensures that neither its optimization processes nor the \textsc{CPU} will perform reorderings that could lead to data races.\punckern\footnote{%
The ISO~\clang{11} standard adopted its concurrency features,
almost directly, from the \cplusplus{11} standard.
Thus, the functionalities discussed should be the same in both languages,
with some minor syntactical differences favoring \cplusplus{} for clarity.
}
To ensure our earlier example functions as intended,
the ``ready'' flag must utilize an \introduce{atomic type}.
\begin{ccode}
#include <stdatomic.h>
int v = 0;
atomic_bool v_ready = false;
void *threadA()
{
v = 42;
v_ready = true;
}
\end{ccode}
\begin{ccode}
int b_v;
void *threadB()
{
while(!v_ready) { /* wait */ }
b_v = v;
/* Do something */
}
\end{ccode}
The \clang{} and \cplusplus{} standard libraries define a series of these types in \cc|<stdatomic.h>| and \cpp|<atomic>|,
respectively.
They look and act just like the integer types they mirror (e.g., \monobox{bool}~\textrightarrow~\monobox{atomic\_bool},
\monobox{int}~\textrightarrow~\monobox{atomic\_int}, etc.),
but the compiler ensures that other variables' loads and stores are not reordered around theirs.
Informally, we can think of atomic variables as rendezvous points for threads.
By making \monobox{v\_ready} atomic,
\monobox{v = 42}\, is now guaranteed to happen before \monobox{v\_ready = true}\, in thread~\textit{A},
just as \monobox{b\_v = v}\, must happen after reading \monobox{v\_ready}\,
in thread~\textit{B}.
Formally, atomic types establish a \textit{single total modification order} where,
``[\ldots] the result of any execution is the same as if the reads and writes occurred in some order, and the operations of each individual processor appear in this sequence in the order specified by its program.''
This model, defined by Leslie Lamport in 1979,
is called \introduce{sequential consistency}.
Notice that using atomic variables as an lvalue expression, such as \monobox{v\_ready = true} and \monobox{while(!v\_ready)}, is a convenient alternative to explicitly using \monobox{atomic\_load} or \monobox{atomic\_store}.\punckern\footnote{%
Atomic load/store are not necessary generated as atomic instructions.
Under a weaker consistency model, they could simply be normal load/store,
and their code generation can vary across different architectures.
Checkout \href{https://llvm.org/docs/Atomics.html\#atomics-and-codegen}{LLVM's document} as an example to see how it is handled.}
As stated in C11 6.7.2.4 and 6.7.3, the properties associated with atomic types are meaningful only for expressions that are
lvalues.
Lvalue-to-rvalue conversion (which models a memory read from an atomic location to a CPU register) strips atomicity along with other qualifiers.
\section{Atomicity}
\label{atomicity}
But order is only one of the vital ingredients for inter-thread communication.
The other is what atomic types are named for: atomicity.
Something is \introduce{atomic} if it can not be divided into smaller parts.
If threads do not use atomic reads and writes to share data, we are still in trouble.
Consider a program with two threads.
One thread processes a list of files, incrementing a counter each time it finishes working on one.
The other thread handles the user interface, periodically reading the counter to update a progress bar.
If that counter is a 64-bit integer, we can not access it atomically on 32-bit machines,
since we need two loads or stores to read or write the entire value.
If we are particularly unlucky, the first thread could be halfway through writing the counter when the second thread reads it,
receiving garbage.
These unfortunate occasions are called \introduce{torn reads and writes}.
If reads and writes to the counter are atomic, however, our problem disappears.
We can see that, compared to the difficulties of establishing the right order,
atomicity is fairly straightforward:
just make sure that any variables used for thread synchronization
are no larger than the \textsc{CPU} word size.
\includegraphics[keepaspectratio, width=0.8\linewidth]{images/atomicity}
\captionof{figure}{A flowchart depicting how two concurrent programs communicate and coordinate through a shared resource to achieve a goal, accessing the shared resource.}
\label{fig:atomicity}
Summary of concepts from the first three sections, as shown in \fig{fig:atomicity}.
In \secref{background}, we observe the importance of maintaining the correct order of operations: t3 \to t4 \to t5 \to t6 \to t7, so that two concurrent programs can function as expected.
In \secref{seqcst}, we see how two concurrent programs communicate to guarantee the order of operations: t5 \to t6.
In \secref{atomicity}, we understand that certain operations must be treated as a single atomic step to ensure the order of operations: t3 \to t4 \to t5 and the order of operations: t6 \to t7.
\section{Arbitrarily-sized ``atomic'' types}
\label{atomictype}
Along with \cc|atomic_int| and friends,
\cplusplus{} provides the template \cpp|std::atomic<T>| for defining arbitrary atomic types.
\clang{}, lacking a similar language feature but wanting to provide the same functionality,
added an \keyword{\_Atomic} keyword.
If \texttt{T} is larger than the machine's word size,
the compiler and the language runtime automatically surround the variable's reads and writes with locks.
If you want to make sure this is not happening,\punckern\footnote{%
\ldots which is most of the time,
since we are usually using atomic operations to avoid locks in the first place.}
you can check with:
\begin{cppcode}
std::atomic<Foo> bar;
ASSERT(bar.is_lock_free());
\end{cppcode}
In most cases,\punckern\footnote{%
The language standards permit atomic types to be \emph{sometimes} lock-free.
This might be necessary for architectures that do not guarantee atomicity for unaligned reads and writes.}
this information is known at compile time.
Consequently, \cplusplus{17} added \cpp|is_always_lock_free|:
\begin{cppcode}
static_assert(std::atomic<Foo>::is_always_lock_free);
\end{cppcode}
\section{Read-modify-write}
\label{rmw}
So far we have introduced the importance of order and atomicity.
In \secref{seqcst}, we see how an atomic object ensures the order of single store or load operation is not reordered by the compiler within a program.
Only upon establishing the correct inter-thread order can we continue to pursue how multiple threads can establish a correct cross-thread order.
After achieving this goal, we can further explore how concurrent threads can coordinate and collaborate smoothly.
In \secref{atomicity}, there is a need for atomicity to ensure that a group of operations is not only sequentially executed but also completes without being interrupted by operation from other threads.
This establishes correct order of operations from different threads.
\includegraphics[keepaspectratio, width=0.6\linewidth]{images/atomic-rmw}
\captionof{figure}{Exchange, Test and Set, Fetch and…, Compare and Swap can all be transformed into atomic RMW operations, ensuring that operations like t1 \to t2 \to t3 will become an atomic step.}
\label{fig:atomic-rmw}
Atomic loads and stores are all well and good when we do not need to consider the previous state of atomic variables, but sometimes we need to read a value, modify it, and write it back as a single atomic step.
As shown in \fig{fig:atomic-rmw}, the modification is based on the previous state that is visible for reading, and the result is then written back.
A complete \introduce{read-modify-write} operation is performed atomically to ensure visibility to subsequent operations.
Furthermore, for communication between concurrent threads, a shared resource is required, as shown in \fig{fig:atomicity}
Think back to the discussion in previous sections.
In order for concurrent threads to collaborate on operating a shared resource, we need a way to communicate.
Thus, the need for a channel for communication arises with the appearance of the shared resource.
As discussed earlier, the process of accessing shared resources responsible for communication must also ensure both order and non-interference.
To prevent the recursive protection of shared resources,
atomic operations can be introduced for the shared resources responsible for communication, as shown in \fig{fig:atomic-types}.
There are a few common \introduce{read-modify-write} (\textsc{RMW}) operations to make theses operation become a single atomic step.
In \cplusplus{}, they are represented as member functions of \cpp|std::atomic<T>|.
In \clang{}, they are freestanding functions.
\includegraphics[keepaspectratio, width=1\linewidth]{images/atomic-types}
\captionof{figure}{Test and Set (Left) and Compare and Swap (Right) leverage their functionality of checking and their atomicity to make other RMW operations perform atomically.
The red color represents atomic RMW operations, while the blue color represents RMW operations that behave atomically.}
\label{fig:atomic-types}
\subsection{Exchange}
\label{exchange}
Transform \textsc{RMW} into modifying a private variable first,
and then directly swapping the private variable with the shared variable.
Therefore, we only need to ensure that the second step,
which involves Read that load the shared variable and then Modify and Write that exchange it with the private variable,
is a single atomic step.
This allows programmers to extensively modify the private variable beforehand and only write it to the shared variable when necessary.
\subsection{Test and set}
\label{Testandset}
\introduce{Test-and-set} works on a Boolean value:
we read it, set it to \cpp|true|, and provide the value it held beforehand.
\clang{} and \cplusplus{} offer a type dedicated to this purpose, called \monobox{atomic\_flag}.
The initial value of an \monobox{atomic\_flag} is indeterminate until initialized with \monobox{ATOMIC\_FLAG\_INIT} macro.
\introduce{Test-and-set} operations are not limited to just \textsc{RMW} functions;
they can also be utilized for constructing simple spinlock.
In this scenario, the flag acts as a shared resource for communication between threads.
Thus, spinlock implemented with \introduce{Test-and-set} operations ensures that entire \textsc{RMW} operations on shared resources are performed atomically, as shown in \fig{fig:atomic-types}.
\label{spinlock}
\begin{ccode}
atomic_flag af = ATOMIC_FLAG_INIT;
void lock()
{
while (atomic_flag_test_and_set(&af)) { /* wait */ }
}
void unlock() { atomic_flag_clear(&af); }
\end{ccode}
If we call \cc|lock()| and the previous value is \cc|false|,
we are the first to acquire the lock,
and can proceed with exclusive access to whatever the lock protects.
If the previous value is \cc|true|,
someone else has acquired the lock and we must wait until they release it by clearing the flag.
\subsection{Fetch and…}
Transform \textsc{RMW} to directly modify the shared variable (such as addition, subtraction,
or bitwise \textsc{AND}, \textsc{OR}, \textsc{XOR}) and return its previous value,
all as part of a single atomic operation.
Compare with \introduce{Exchange} \secref{exchange}, when programmers only need to make simple modification to the shared variable,
they can use \introduce{Fetch-and…}.
\subsection{Compare and swap}
\label{cas}
Finally, we have \introduce{compare-and-swap} (\textsc{CAS}),
sometimes called \introduce{compare-and-exchange}.
It allows us to conditionally exchange a value \emph{if} its previous value matches the expected one.
In \clang{} and \cplusplus{}, as noted in C11 7.17.7.4, \textsc{CAS} resembles the following,
if it were executed atomically:
\begin{ccode}
/* A is an atomic type. C is the non-atomic type corresponding to A */
bool atomic_compare_exchange_strong(A* obj, C* expected, C desired)
{
if (memcmp(obj, expected, sizeof(*object)) == 0) {
memcpy(obj, &desired, sizeof(*object));
return true;
} else {
memcpy(expected, obj, sizeof(*object));
return false;
}
}
\end{ccode}
\begin{samepage}
\noindent The \cpp|_strong| suffix may leave you wondering if there is a corresponding ``weak'' \textsc{CAS}.
Indeed, there is. However, we will delve into that topic later in \secref{spurious-llsc-failures}.
\end{samepage}
Because \textsc{CAS} involves an expected value comparison,
it allows \textsc{CAS} operations to extend beyond just \textsc{RMW} functions.
Here's how it works: First, read the shared resource and use this value as the expected value.
Modify the private variable, and then \textsc{CAS}. Compare the current shared variable with the expected shared variable.
If they match, it indicates that modify is exclusive, ant then write by swaping the shared variable with the private variable.
If they don't match, it implies that interference from another thread has occurred.
Subsequently, update the expected value with the current shared value and retry modify in a loop.
This iterative process allows \textsc{CAS} to serve as a communication mechanism between threads,
ensuring that entire \textsc{RMW} operations on shared resources are performed atomically.
As shown in \fig{fig:atomic-types}, compared with \introduce{Test-and-set} \secref{Testandset},
a thread that employs \textsc{CAS} can directly use the shared resource to check.
It uses atomic \textsc{CAS} to ensure that Modify is atomic,
coupled with a while loop to ensure that the entire \textsc{RMW} can behave atomically.
However, atomic \textsc{RMW} operations here are merely a programming tool for programmers to achieve program logic correctness.
Its actual execution as atomic operations depends on the how compiler translate it into actual atomic instructions based on differenct hardware instruction set.
\introduce{Exchange}, \introduce{Fetch-and-Add}, \introduce{Test-and-set} and \textsc{CAS} in instruction level are different style of atomic \textsc{RMW} instructions.
ISA could only provide some of them,
leaving the rest to compilers to synthesize atomic \textsc{RMW} operations.
For example, In IA32/64 and IBM System/360/z architectures,
\introduce{Test-and-set} functionality is directly supported by hardware instructions.
x86 has XCHG, XADD for \introduce{Exchange} and \introduce{Fetch-and-Add} but has \introduce{Test-and-set} implemented with XCHG.
Arm, in another style, provides LL/SC (Load Linked/Store Conditional) flavor instructions for all the operations,
with \textsc{CAS} added in Armv8/v9-A.
\subsection{example}
\label{rmw_example}
The following example code is a simplified implementation of a thread pool, which demonstrates the use of \clang{}11 atomic library.
\inputminted{c}{./examples/rmw_example.c}
Stdout of the program is:
\begin{ccode}
PI calculated with 100 terms: 3.141592653589793
\end{ccode}
\textbf{Exchange}
In function \monobox{thread\_pool\_destroy}, \monobox{atomic\_exchange(\&thrd\_pool->state, cancelled)} reads the current state and replaces it with ``cancelled''.
A warning message is printed if the pool is destroyed while workers are still ``running''.
If the exchange is not performed atomically, we may initially get the state as ``running''. Subsequently, a thread could set the state to ``cancelled'' after finishing the last one, resulting in a false warning.
\textbf{Test and set}
In this example, the scenario is as follows:
First, the main thread initially acquires a lock \monobox{future->flag} and then sets it true,
which is akin to creating a job and then transferring its ownership to the worker.
Subsequently, the main thread will be blocked until the worker clears the flag.
This indicates that the main thread will wail until the worker completes the job and returns ownership back to the main thread, which ensures correct cooperation.
\textbf{Fetch and…}
In the function \monobox{thread\_pool\_destroy}, \monobox{atomic\_fetch\_and} is utilized as a means to set the state to ``idle''.
Yet, in this case, it is not necessary, as the pool needs to be reinitialized for further use regardless.
Its return value could be further utilized, for instance, to report the previous state and perform additional actions.
\textbf{Compare and swap}
Once threads are created in the thread pool as workers, they will continuously search for jobs to do.
Jobs are taken from the tail of the job queue.
To take a job without being taken by another worker halfway through, we need to atomically change the pointer to the last job.
Otherwise, the last job is under race.
The while loop in the function \monobox{worker},
\begin{ccode}
while (!atomic_compare_exchange_weak(&thrd_pool->head->prev, &job,
job->prev)) {
}
\end{ccode}
, keeps trying to claim the job atomically until success.
Built-in post increment and decrement operators and compound assignment on atomic objects, such as \monobox{++} and \monobox{+=}, are read-modify-write atomic operations with total sequentially consistent ordering as well.
They behave equivalently to a \cc|do while| loop. See \clang{}11 standard 6.5.2.4 and 6.5.16.2 for more details.
What if claiming a job, which updates \cc|thrd_pool->head->prev|, is not done atomically?
Two or more threads could have races updating \cc|thrd_pool->head->prev| and working on the same job.
Data races are undefined behavior in \clang{}11 and \cplusplus{}11.
Working on the same job can lead to duplication of the calculation of \cc|job->future->result|,
use after free and double free on the job.
But even when jobs were claimed atomically, a thread can still have chances holding a job that has been freed.
This is a defect of the example code.
Jobs in the example are dynamically allocated. They are freed after worker finishes each job.
However, this situation may lead to dangling pointers for workers that are still holding and attempting to claim the job.
If jobs are intended to be dynamically allocated, then safe memory reclamation should be implemented for such shared objects.
RCU, hazard pointer and reference counting are major ways of solving this problem.
\subsection{Further improvements}
At the beginning of \secref{rmw}, we described how a global total order is established by combining local order and inter-thread order imposed by atomic objects.
But should every object, including non-atomic ones, participate in a single global order established by atomic objects?
\introduce{Sequential consistency} solves the ordering problem in in \secref{seqcst}, but it may force too much ordering, as some normal operations may not require it.
Without specifying, atomic operations in \clang{}11 atomic library use \monobox{memory\_order\_seq\_cst} as default memory order. Operations post-fix with \monobox{\_explicit} accept an additional argument to specify which memory order to use.
How to leverage memory orders to optimize performance will be covered later in \secref{lock-example}.
\section{Shared Resources}
\label{shared-resources}
From \secref{rmw}, we have understood that there are two types of shared resources that need to be considered.
The first type is shared resources that concurrent threads will access in order to collaborate to achieve a goal.
The second type is shared resources that serve as a communication channel for concurrent threads,
ensuring correct access to shared resources.
However, all of these considerations stem from a programming perspective,
where we only distinguish between shared resources and private resources.
Given all the complexities to consider, modern hardware adds another layer to the puzzle,
as depicted in \fig{fig:dunnington}.
Remember, memory moves between the main \textsc{RAM} and the \textsc{CPU} in segments known as cache lines.
These cache lines also represent the smallest unit of data transferred between cores and caches.
When one core writes a value and another reads it,
the entire cache line containing that value must be transferred from the first core's cache(s) to the second core's cache(s),
ensuring a coherent ``view'' of memory across cores. This dynamic can significantly affect performance.
This slowdown is even more insidious when it occurs between unrelated variables that happen to be placed on the same shared resource,
which is the cache line, as shown in \fig{fig:false-sharing}.
When designing concurrent data structures or algorithms,
this \introduce{false sharing} must be taken into account.
One way to avoid it is to pad atomic variables with a cache line of private data,
but this is obviously a large space-time trade-off.
\includegraphics[keepaspectratio, width=0.6\linewidth]{images/false-sharing}
\captionof{figure}{Processor 1 and Processor 2 operate independently on variables A and B.
Simultaneously, they read the cache line containing these two variables.
In the next time step, each processor modifies A and B in their private L1 cache separately.
Subsequently, both processors write their modified cache line to the shared L2 cache.
At this moment, the expansion of the scope of shared resources to encompass cache lines highlights the importance of considering cache coherence issues.}
\label{fig:false-sharing}
Not only shared resources,
but we also need to consider shared resources that serve as a communication channel, e.g. spinlock (see \secref{spinlock}).
Processors using locks as a communication channel also need to transfer the cache line.
When a processor broadcasts the release of a lock,
multiple processors on different chips attempt to acquire the lock simultaneously.
To ensure a consistent state of the lock across all private L1 cache lines,
which is a part of cache coherence,
the cache line containing the lock will be continually transferred among the caches of those cores.
Unless the critical sections are considerably lengthy,
the time spent managing this cache line movement could exceed the time spent within the critical sections themselves,\punckern\footnote{%
This situation underlines how some systems may experience a cache miss that is substantially more costly than an atomic \textsc{RMW} operation,
as discussed in Paul~E.\ McKenney's
\href{https://www.youtube.com/watch?v=74QjNwYAJ7M}{talk from CppCon~2017}
for a deeper exploration.}
despite the algorithm's non-blocking nature.
With these high communication costs, there may be only one processor that succeeds in acquiring it again in the case of mutex lock or spinlock, as shown in \fig{fig:spinlock}.
Then the other processors that have not successfully acquired the lock will continue to wait,
resulting in little practical benefit (only one processor gains the lock) and significant communication overhead.
This disparity severely limits the scalability of the spin lock.
\includegraphics[keepaspectratio, width=0.9\linewidth]{images/spinlock}
\captionof{figure}{Three processors use lock as a communication channel to insure the access operations to the shared L2 cache will be correct.
Processors 2 and 3 are trying to acquire a lock that is held by processor 1.
Therefore, when processor 1 unlocks,
the state of lock needs to be updated on other processors' private L1 cache.}
\label{fig:spinlock}
\section{Concurrency tools and synchronization mechanisms}
\label{concurrency-tool}
Atomic loads, stores, and \textsc{RMW} operations are the building blocks for every single concurrency tool.
It is useful to split those tools into two camps:
\introduce{blocking} and \introduce{lockless}.
As mentioned in \secref{rmw}, multiple threads can use these blocking tools to communicate with others.
Furthermore, these blocking tools can even assist in synchronization between threads.
The blocking mechanism is quite simple,
because all threads need to do is block others in order to make their own progress.
However, this simplicity can also cause threads to pause for unpredictable durations and then influence the progress of the overall system.
Take a mutex as an example:
it requires threads to access shared data sequentially.
If a thread locks the mutex and another attempts to lock it too,
the second thread must wait, or \introduce{block},
until the first one unlocks it, regardless of the wait time.
Additionally, blocking mechanisms are prone to \introduce{deadlock} and \introduce{livelock},
issues that lead to the system becoming immobilized as threads perpetually wait on each other.
If the first thread acquires a mutex first,
then the second thread locks another mutex and subsequently attempts to lock the mutex held by the first thread.
At the same time, the first thread also tries to lock the mutex held by the second thread.
Then the deadlock occurs.
Therefore, we can see that deadlock occurs when different threads acquire locks in incompatible orders,
leading to system immobilization as threads perpetually wait on each other.
Additionally, in \secref{shared-resources},
we can see another problem with the lock: its scalability is limited.
After understanding the issue that blocking mechanisms are prone to,
we try to achieve synchronization between threads without lock.
Consider the program below: if there is only a single thread, execute these operations as follows:
\begin{cppcode}
while (x == 0)
x = 1 - x;
\end{cppcode}
When executed by a single thread, these operations complete within a finite time.
However, with two threads executing concurrently,
if one thread executes \cpp|x = 1 - x| and the other thread executes \cpp|x = 1 - x| subsequently,
then the value of x will always be 0, which will lead to a livelock.
Therefore, even without any locks in concurrent threads,
we still cannot guarantee that the overall system can make progress toward achieving the programmer's goals.
Consequently, we should not focus on comparing which communication tools or synchronization mechanisms are better,
but rather on exploring how to effectively use these tools in a given scenario to facilitate smooth communication between threads and achieve the programmer's goals.
\section{Lock-free}
In \secref{concurrency-tool}, we explored different mechanisms based on the characteristics of concurrency tools,
as described in \secref{atomicity} and \secref{rmw}.
In this section, we need to explore which strategies can help programmers to design a concurrency program
that allows concurrent threads to collectively ensure progress in the overall system while also improving scalability,
which is the initial goal of designing a concurrency program.
First of all, we must figure out the scope of our problem.
Understanding the relationship between the progress of each thread and the progress of the entire system is necessary.
\subsection{Type of progress}
When we consider the scenario where many concurrent threads collaborate and each thread is divided into many operations,
\textbf{Wait-Free} Every operation in every thread will be completed within a limited time.
This also implies that each operation contributes to the overall progress of the system.
\textbf{Lock-Free} At any given moment, among all operations in every thread,
at least one operation contributes to the overall progress of the system.
However, it does not guarantee that starvation will not occur.
\textbf{Obstruction-Free} At any given time, if there is only a single thread operating without interference from other threads,
its instructions can be completed within a finite time. However, when threads are working concurrently,
it does not guarantee progress.
Therefore, we can understand their three relationships as follows:
obstruction-free includes lock-free and lock-free includes wait-free.
Achieving wait-free is the most optimal approach,
allowing each thread to make progress without being blocked by other threads.
\includegraphics[keepaspectratio, width=1 \linewidth]{images/progress-type}
\captionof{figure}{In a wait-free system, each thread is guaranteed to make progress at every moment because no thread can block others.
This ensures that the overall system can always make progress.
In a lock-free system, at Time 1, Thread 1 may cause other threads to wait while it performs its operation.
However, even if Thread 1 suspends at Time 2, it does not subsequently block other threads.
This allows Thread 2 to make progress at Time 3, ensuring that the overall system continues to make progress even if one thread is suspended.
In an obstruction-free system, when Thread 1 is suspended at Time 2,
it causes other threads to be blocked as a result. This means that by Time 3,
Thread 2 and Thread 3 are still waiting, preventing the system from making progress thereafter.
Therefore, obstruction-free systems may halt progress if one thread is suspended,
leading to the potential blocking of other threads and even stalling the system.}
\label{fig:progress-type}
The main goal is that the whole system,
which contains all concurrent threads,
is always making forward progress.
To achieve this goal, we rely on concurrency tools,
including atomic operation and the operations that perform atomically, as described in \secref{rmw}.
Additionally, we carefully select synchronization mechanism, as described in \secref{concurrency-tool},
which may involve utilizing shared resources for communication (e.g., spinlock), as described in \secref{shared-resources}.
Furthermore, we design our program with appropriate data structures and algorithms.
Therefore, lock-free doesn't mean we cannot use any lock;
we just need to ensure that the blocking mechanism will not limit the scalability and that the system can avoid the problems described in \secref{concurrency-tool} (e.g., long time of waiting, deadlock).
Next, we take the single producer and multiple consumers problem as an example to demonstrate how to achieve fully lock-free programming by improving some implementations step by step.\punckern\footnote{%
The first three solutions, which are \secref{spmc-solution1}, \secref{spmc-solution2}, and \secref{spmc-solution3}, are referenced in the Herb Sutter's
\href{https://youtu.be/c1gO9aB9nbs?si=7qJs-0qZAVqLHr1P}{talk from CppCon~2014.}}
This problem is that one producer generates tasks and adds them to a job queue,
and multiple consumers take tasks from the job queue and execute them.
\subsection{SPMC solution - lock-based}
\label{spmc-solution1}
Firstly, introduce the scenario of lock-based algorithms.
At any time, there is only one consumer that can get the lock to access the job queue.
This is because in this scenario, the lock is mutex lock, also known as a mutual exclusive lock.
Not until the consumer releases the lock are the other consumers blocked when attempting to access the job queue.
The following text explains the meaning of each state in the \fig{fig:spmc-solution1}.
\textbf{state 1} : The producer is adding tasks to the job queue while multiple consumers wait for tasks to become available and is ready to take on any job that appears in the job queue.
\textbf{state 2} \to \textbf{state 3} : After the producer adds a task to the job queue,
the producer releases the mutex lock, and then wake the consumers up.
Those consumers tried to acquire the lock of the job queue for the job before.
\textbf{state 3} \to \textbf{state 4} : Consumer 1 acquires the mutex lock for the job queue,
retrieves a task from it, and then releases the mutex lock.
\textbf{state 5} : Next, other consumers attempt to acquire the mutex lock for the job queue.
However, after they acquire the lock, they find no tasks in the queue.
This is because the producer has not added more tasks to the job queue.
\textbf{state 6} : Consequently, the consumers wait on a condition variable.
During this time, the consumers are not busy waiting but rather waiting for the producer to wake it up.
This is because the mechanism is an advanced form of mutex lock.
\includegraphics[keepaspectratio, width=0.6\linewidth]{images/spmc-solution1}
\captionof{figure}{The interaction between the producer and consumer in SPMC Solution 1,
including their state transitions.}
\label{fig:spmc-solution1}
The reason why this implementation is not lock-free is:
First, if a producer suspends,
it causes consumers to have no job available,
leading them to block and thus halting progress in the entire system,
which is obstruction-free, as shown in the \fig{fig:progress-type}.
Secondly, consumers concurrently need to access shared resources, which is the job.
Then, one consumer acquires the lock of the job queue but suddenly gets suspended before completing without unlocking,
causing other consumers to be blocked.
Meanwhile, the producer still keeps adding jobs, but the system fails to make any progress,
which is obstruction-free, as shown in the \fig{fig:progress-type}.
Therefore, neither the former nor the latter implementation approach is lock-free.
\subsection{SPMC solution - lock-based and lock-free}
\label{spmc-solution2}
As described in \secref{spmc-solution1}, there is a problem when the producer suspends;
the whole system cannot make any progress.
Additionally, consumers contend for the lock of the job queue to access the job;
however, after they acquire the lock, they may still need to wait when the queue is empty.
To solve this issue, the introduction of lock-based and lock-free algorithm is presented in this section.
The following text explains the meaning of each state in the \fig{fig:spmc-solution2}.
\textbf{state 0} : The producer prepares all the jobs in advance.
\textbf{state 1} : Consumer 1 acquires the lock on the job queue, takes a job, and releases the lock.
\textbf{state 2} : After consumer 2 acquires the lock, it definitely can find that there are still jobs in the queue.
Through this approach, once a consumer obtains the lock on the job queue,
there is guaranteed job available unless all jobs have been taken by other consumers.
Thus, there is no need to wait due to a lack of jobs;
the only wait is for acquiring the lock to access the job queue.
\includegraphics[keepaspectratio, width=0.7\linewidth]{images/spmc-solution2}
\captionof{figure}{The interaction between the producer and consumer in Solution 2,
including their state transitions.}
\label{fig:spmc-solution2}
This implementation is referred to as both locked-based and lock-free.
The algorithm is designed such that the producer adds all jobs to the job queue before multiple consumers begin taking them.
This design ensures that if the producer suspends or adds the job slowly,
consumers will not be blocked due to the lack of a job.
Consumers just thought they have done all the jobs that the producer added.
Therefore, this implementation qualifies as lock-free, as shown in \fig{fig:progress-type}.
The reason that implementation of getting a job is locked-based, not lock-free,
is the same as the second reason described in \secref{spmc-solution1}.
\subsection{SPMC solution - fully lock-free}
\label{spmc-solution3}
As described in \secref{shared-resources},
we can understand that communications between processors across a chip are through cache lines,
which incurs high costs. Additionally, using locks further decreases overall performance and limits scalability.
However, when locks are necessary for concurrent threads to communicate,
reducing the sharing resource and the granularity of the sharing resource to communicate (e.g., spinlock, mutex lock) is crucial.
Therefore, to achieve fully lock-free programming, we change the data structure to reduce the granularity of locks.
\includegraphics[keepaspectratio, width=1\linewidth]{images/spmc-solution3}
\captionof{figure}{The left side shows that the lock protects the entire job queue to ensure exclusive access to its head for multiple threads.
The right side illustrates that each thread has its own slot for accessing jobs,
not only achieving exclusivity through data structure but also eliminating the need for shared resources for communication.}
\label{fig:spmc-solution3}
Providing each consumer with their own unique slot to access jobs addresses the problem at its root,
directly avoiding competition.
By doing so, consumers no longer rely on a shared resource for communication.
Consequently, other consumers will not be blocked by a suspended consumer holding a lock.
This approach ensures that the system maintains its progress,
as each consumer operates independently within their own slot,
which is lock-free, as shown in \fig{fig:progress-type}.
\subsection{SPMC solution - fully lock-free with CAS}
\label{SPMC-solution4}
In addition to reducing granularity,
there is another way to avoid that if one consumer acquires the lock on the job queue but suddenly gets suspended,
causing other consumers to be blocked as described in \secref{spmc-solution2}.
As described in \secref{cas}, we can use \textsc{CAS} with a loop to ensure that the write operation achieves semantic atomicity.
Unlike \secref{spmc-solution2},
which uses a shared resource (e.g., advanced form of mutex lock) for blocking synchronization,
the first thread holding the lock causes the other threads to wait until the first thread releases the lock.
As described in \secref{cas}, \textsc{CAS} allows threads that initially failed to acquire the lock to continue to execute Read and Modify.
Therefore, we can conclude that if one thread is blocked,
it indicates that there is another thread is making progress,
which is lock-free, as shown in \fig{fig:progress-type}.
As described in \secref{spmc-solution2}, a blocking mechanism uses mutex lock;
we can see that only one thread is active when it accesses the job queue.
Although \textsc{CAS} will continue to execute Read and Modify,
it doesn't result in an increase in overall progress.
This is because the operations will be useless when atomic \textsc{CAS} fails.
Therefore, we can understand that lock-free algorithms are not faster than blocking ones.
The reason for using lock-free is to ensure that if one thread is blocked,
it doesn't cause other threads to be blocked,
thereby ensuring that the overall system must make progress over a long period of time.
\subsection{Conclusion about lock-free}
In conclusion about lockfree,
we can see that both blocking and lockless approaches have their place in software development.
They serve different purposes with their own design philosophies.
When performance is a key consideration, it is crucial to profile your application,
take advantage of every concurrency tool or mechanism, and accompany them with appropriate data structures and algorithms.
The performance impact varies with numerous factors, such as thread count and CPU architecture specifics.
Balancing complexity and performance is essential in concurrency,
a domain fraught with challenges.
\subsection{ABA problem}
CAS has been introduced as one of the read-modify-write operations.
However, the target object not changing does not necessarily mean that no other threads modified it halfway through.
If the target object is changed by another thread and then changed back, the result of comparison would still be equal.
In this case, the target object has indeed been modified, yet the operation appears unchanged, compromising its atomicity.
We call this \introduce{ABA problem}.
Consider the following scenario,
\inputminted{c}{./examples/simple_aba_example.c}
The execution result would be:
\begin{ccode}
A: v = 42
B: v = 47
B: v = 42
A: v = 52
\end{ccode}
In the example provided, the presence of ABA problem results in thread A being unaware that variable \cc{v} has been altered.
Since the comparison result indicates \cc{v} unchanged, \cc{v + 10} is swapped in.
Here sleeping is only used to ensure the occurrence of ABA problem.
In real world scenario, instead of sleeping, thread A could paused by being context switched for other tasks, including being preempted by higher priority tasks.
This example seems harmless, but things can get nasty when atomic \textsc{RMW} operations are used in more complex data structures.
In a broader context, the ABA problem occurs when changes occur between loading and comparing, but the comparing mechanism is unable to identify that the state of the target object is not the latest, yielding a false positive result.
Back to thread pool example in \secref{rmw}, it contains ABA problem as well.
In \monobox{worker} function, we have the thread trying to claim the job.
\begin{ccode}
job_t *job = atomic_load(&thrd_pool->head->prev);
...
while (!atomic_compare_exchange_weak(&thrd_pool->head->prev, &job,
job->prev))
;
\end{ccode}
Consider the following scenario:
\begin{enumerate}
\item There is only one job left.
\item Thread A loads the pointer to the job by \cc{atomic_load()}.
\item Thread A is preempted.
\item Thread B claims the job and successfully updates \cc{thrd_pool->head->prev}.
\item Thread B sets thread pool state to idle.
\item Main thread finishes waiting and adds more jobs.
\item Memory allocator reuses the recently freed memory as new jobs addresses.
\item Fortunately, the first added job has the same address as the one thread A held.
\item Thread A is back in running state. The comparison result is equal so it updates \cc{thrd_pool->head->prev} with the old \cc{job->prev}, which is already a dangling pointer.
\item Another thread loads the dangling pointer from \cc{thrd_pool->head->prev}.
\end{enumerate}
Notice that even though \cc{job->prev} is not loaded explicitly before comparison, compiler could place loading instructions before comparison.
At the end, the dangling pointer could either point to garbage or trigger segmentation fault.
It could be even worse if nested ABA problem occurs in thread B.
Also, the possibility to allocate a job with same address could be higher when using memory pool, meaning that more chances to have ABA problem occurred.
In fact, pre-allocated memory should be used to achieve lock-free since \monobox{malloc} could have mutex involved in multi-threaded environment.
Being unable to determine whether the target object has been changed through comparison could result in a false positive when the return value of CAS is true.
Thus, the atomicity provided by CAS is not guaranteed.
The general concept of solving this problem involves adding more information to make different state distinguishable, and then making a decision on whether to act on the old state or retry with the new state.
If acting on the old state is chosen, then safe memory reclamation should be considered as memory may have already been freed by other threads.
More aggressively, one might consider the programming paradigm where each operation on the target object does not have a side effect on modifying it.
In the later section, we will introduce a different way of implementing atomic \textsc{RMW} operations by using LL/SC instructions. The exclusive status provided by LL/SC instructions avoids the pitfall introduced by comparison.
To make different state distinguishable, a common solution is incrementing a version number each time target object is changed.
By bundling the target object and version into a comparison, it ensures that each change marks a distinguishable result.
Given a sufficient large size for version number, there should be no repeated version numbers.
There are multiple methods for storing the version number, depending on the evaluation of the duration before a version number wraps around.
In the thread pool example, the target object is a pointer. The unused bits in a pointer can be utilized to store the version number.
In addition to embedding the version number into a pointer, we could consider utilizing an additional 32-bit or 64-bit value next to the target object for the version number.
It requires the compare-and-swap instruction to be capable of comparing a wider size at once.
Sometimes, this is referred to as \introduce{double-width compare-and-swap}.
On x86-64 processors, for atomic instructions that load or store more than a CPU word size, it needs additional hardware support.
You can use \monobox{grep cx16 /proc/cpuinfo} to check if the processor supports 16-byte compare-and-swap.
For hardware that does not support the desired size, software implementations which may have locks involve are used instead, as mentioned in \secref{atomictype}.
Back to the example, ABA problem in the following code is fixed by using an version number that increments each time a job is added to the empty queue. On x86-64, add a compiler flag \monobox{-mcx16} to enable 16-byte compare-and-swap in \monobox{worker} function.
\inputminted{c}{./examples/rmw_example_aba.c}
Notice that, in the \cc{struct idle_job}, a union is used for type punning, which bundles the pointer and version number for compare-and-swap.
Directly casting a job pointer to a pointer that points to a 16-byte object is undefined behavior (due to having different alignment), thus type punning is used instead.
By using this techniques, \cc{struct idle_job} still can be accessed normally in other places, minimizing code modification.
Compiler optimizations are conservative on type punning, but it is acceptable for atomic operations.
See \secref{fusing}.
Another way to prevent ABA problem in the example is using safe memory reclamation mechanisms.
Different from previously mentioned acting on the old state, the address of a job is not freed until no one is using it.
This prevents memory allocator or memory pool from reusing the address and causing problem.
\section{Sequential consistency on weakly-ordered hardware}
Different hardware architectures offer distinct memory models or \introduce{memory models}.
For instance, x64 architecture\punckern\footnote{%
Also known as x86-64, x64 is a 64-bit extension of the x86 instruction set, officially unveiled in 1999.
This extension heralded the introduction of two novel operation modes:
64-bit mode for leveraging the full potential of 64-bit processing and compatibility mode for maintaining support for 32-bit applications.
Initially developed by AMD and publicly released in 2000, the x64 architecture has since been adopted by Intel and VIA,
signaling a unified industry shift towards 64-bit computing.
This wide adoption marked the effective obsolescence of the Intel Itanium architecture (IA-64),
despite its initial design to supersede the x86 architecture.
} is known to be \introduce{strongly-ordered},
generally ensuring a global sequence for loads and stores in most scenarios.
Conversely, architectures like \textsc{Arm} are considered \introduce{weakly-ordered},
meaning one should not expect loads and stores to follow the program sequence without explicit instructions to the \textsc{CPU}.
These instructions, known as \introduce{memory barriers}, are essential to prevent the reordering of these operations.
It is helpful to see how atomic operations work in a weakly-ordered system,
both to understand what's happening in hardware,
and to see why the \clang{} and \cplusplus{} concurrency models were designed as they were.\punckern\footnote{%
It is worth noting that the concepts we discuss here are not specific to \clang{} and \cplusplus{}.
Other systems programming languages like D and Rust have converged on similar models.}
Let's examine \textsc{Arm}, since it is both popular and straightforward.
Consider the simplest atomic operations: loads and stores.
Given some \cc|atomic_int foo|,
\newline