forked from aosabook/500lines
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdata-store.tex
769 lines (639 loc) · 28.9 KB
/
data-store.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
\begin{aosachapter}{DBDB: Dog Bed Database}{s:data-store}{Taavi Burns}
\emph{As the newest bass (and sometimes tenor) in
\href{http://www.countermeasuremusic.com}{Countermeasure}, Taavi strives
to break the mould\ldots{} sometimes just by ignoring its existence.
This is certainly true through the diversity of workplaces in his
career: IBM (doing C and Perl), FreshBooks (all the things), Points.com
(doing Python), and now at PagerDuty (doing Scala). Aside from
that---when not gliding along on his Brompton folding bike---you might
find him playing Minecraft with his son or engaging in parkour (or rock
climbing, or other adventures) with his wife. He knits continental.}
\aosasecti{Introduction}\label{introduction}
DBDB (Dog Bed Database) is a Python library that implements a simple
key/value database. It lets you associate a key with a value, and store
that association on disk for later retrieval.
DBDB aims to preserve data in the face of computer crashes and error
conditions. It also avoids holding all data in RAM at once so you can
store more data than you have RAM.
\aosasecti{Memory}\label{memory}
I remember the first time I was really stuck on a bug. When I finished
typing in my BASIC program and ran it, weird sparkly pixels showed up on
the screen, and the program aborted early. When I went back to look at
the code, the last few lines of the program were gone.
One of my mom's friends knew how to program, so we set up a call. Within
a few minutes of speaking with her, I found the problem: the program was
too big, and had encroached onto video memory. Clearing the screen
truncated the program, and the sparkles were artifacts of Applesoft
BASIC's behaviour of storing program state in RAM just beyond the end of
the program.
From that moment onwards, I cared about memory allocation. I learned
about pointers and how to allocate memory with malloc. I learned how my
data structures were laid out in memory. And I learned to be very, very
careful about how I changed them.
Some years later, while reading about a process-oriented language called
Erlang, I learned that it didn't actually have to copy data to send
messages between processes, because everything was immutable. I then
discovered immutable data structures in Clojure, and it really began to
sink in.
When I read about CouchDB in 2013, I just smiled and nodded, recognising
the structures and mechanisms for managing complex data as it changes.
I learned that you can design systems built around immutable data.
Then I agreed to write a book chapter.
I thought that describing the core data storage concepts of CouchDB (as
I understood them) would be fun.
While trying to write a binary tree algorithm that mutated the tree in
place, I got frustrated with how complicated things were getting. The
number of edge cases and trying to reason about how changes in one part
of the tree affected others was making my head hurt. I had no idea how I
was going to explain all of this.
Remembering lessons learned, I took a peek at a recursive algorithm for
updating immutable binary trees and it turned out to be relatively
straightforward.
I learned, once again, that it's easier to reason about things that
don't change.
So starts the story.
\aosasecti{Why Is it Interesting?}\label{why-is-it-interesting}
Most projects require a database of some kind. You really shouldn't
write your own; there are many edge cases that will bite you, even if
you're just writing JSON to disk:
\begin{aosaitemize}
\item
What happens if your filesystem runs out of space?
\item
What happens if your laptop battery dies while saving?
\item
What if your data size exceeds available memory? (Unlikely for most
applications on modern desktop computers\ldots{} but not unlikely for
a mobile device or server-side web application.)
\end{aosaitemize}
However, if you want to \emph{understand} how a database handles all of
these problems, writing one for yourself can be a good idea.
The techniques and concepts we discuss here should be applicable to any
problem that needs to have rational, predictable behaviour when faced
with failure.
Speaking of failure\ldots{}
\aosasecti{Characterizing Failure}\label{characterizing-failure}
Databases are often characterized by how closely they adhere to the ACID
properties: atomicity, consistency, isolation, and durability.
Updates in DBDB are atomic and durable, two attributes which are
described later in the chapter. DBDB provides no consistency guarantees
as there are no constraints on the data stored. Isolation is likewise
not implemented.
Application code can, of course, impose its own consistency guarantees,
but proper isolation requires a transaction manager. We won't attempt
that here; however, you can learn more about transaction management in
the CircleDB chapter (\aosachapref{s:functionalDB}).
We also have other system-maintenance problems to think about. Stale
data is not reclaimed in this implementation, so repeated updates (even
to the same key) will eventually consume all disk space. (You will
shortly discover why this is the case.)
\href{http://www.postgresql.org/}{PostgreSQL} calls this reclamation
``vacuuming'' (which makes old row space available for re-use), and
\href{http://couchdb.apache.org/}{CouchDB} calls it ``compaction'' (by
rewriting the ``live'' parts of the data store into a new file, and
atomically moving it over the old one).
DBDB could be enhanced to add a compaction feature, but it is left as an
exercise for the reader\footnote{Bonus feature: Can you guarantee that
the compacted tree structure is balanced? This helps maintain
performance over time.}.
\aosasecti{The Architecture of DBDB}\label{the-architecture-of-dbdb}
DBDB separates the concerns of ``put this on disk somewhere'' (how data
are laid out in a file; the physical layer) from the logical structure
of the data (a binary tree in this example; the logical layer) from the
contents of the key/value store (the association of key \texttt{a} to
value \texttt{foo}; the public API).
Many databases separate the logical and physical aspects as it is is
often useful to provide alternative implementations of each to get
different performance characteristics, e.g.~DB2's SMS (files in a
filesystem) versus DMS (raw block device) tablespaces, or MySQL's
\href{http://dev.mysql.com/doc/refman/5.7/en/storage-engines.html}{alternative
engine implementations}.
\aosasecti{Discovering the Design}\label{discovering-the-design}
Most of the chapters in this book describe how a program was built from
inception to completion. However, that is not how most of us interact
with the code we're working on. We most often discover code that was
written by others, and figure out how to modify or extend it to do
something different.
In this chapter, we'll assume that DBDB is a completed project, and walk
through it to learn how it works. Let's explore the structure of the
entire project first.
\aosasectii{Organisational Units}\label{organisational-units}
Units are ordered here by distance from the end user; that is, the first
module is the one that a user of this program would likely need to know
the most about, while the last is something they should have very little
interaction with.
\begin{aosaitemize}
\item
\texttt{tool.py} defines a command-line tool for exploring a database
from a terminal window.
\item
\texttt{interface.py} defines a class (\texttt{DBDB}) which implements
the Python dictionary API using the concrete \texttt{BinaryTree}
implementation. This is how you'd use DBDB inside a Python program.
\item
\texttt{logical.py} defines the logical layer. It's an abstract
interface to a key/value store.
\begin{aosaitemize}
\item
\texttt{LogicalBase} provides the API for logical updates (like get,
set, and commit) and defers to a concrete subclass to implement the
updates themselves. It also manages storage locking and
dereferencing internal nodes.
\item
\texttt{ValueRef} is a Python object that refers to a binary blob
stored in the database. The indirection lets us avoid loading the
entire data store into memory all at once.
\end{aosaitemize}
\item
\texttt{binary\_tree.py} defines a concrete binary tree algorithm
underneath the logical interface.
\begin{aosaitemize}
\item
\texttt{BinaryTree} provides a concrete implementation of a binary
tree, with methods for getting, inserting, and deleting key/value
pairs. \texttt{BinaryTree} represents an immutable tree; updates are
performed by returning a new tree which shares common structure with
the old one.
\item
\texttt{BinaryNode} implements a node in the binary tree.
\item
\texttt{BinaryNodeRef} is a specialised \texttt{ValueRef} which
knows how to serialise and deserialise a \texttt{BinaryNode}.
\end{aosaitemize}
\item
\texttt{physical.py} defines the physical layer. The \texttt{Storage}
class provides persistent, (mostly) append-only record storage.
\end{aosaitemize}
These modules grew from attempting to give each class a single
responsibility. In other words, each class should have only one reason
to change.
\aosasectii{Reading a Value}\label{reading-a-value}
We'll start with the simplest case: reading a value from the database.
Let's see what happens when we try to get the value associated with key
\texttt{foo} in \texttt{example.db}:
\begin{verbatim}
$ python -m dbdb.tool example.db get foo
\end{verbatim}
This runs the \texttt{main()} function from module \texttt{dbdb.tool}:
\begin{verbatim}
# dbdb/tool.py
def main(argv):
if not (4 <= len(argv) <= 5):
usage()
return BAD_ARGS
dbname, verb, key, value = (argv[1:] + [None])[:4]
if verb not in {'get', 'set', 'delete'}:
usage()
return BAD_VERB
db = dbdb.connect(dbname) # CONNECT
try:
if verb == 'get':
sys.stdout.write(db[key]) # GET VALUE
elif verb == 'set':
db[key] = value
db.commit()
else:
del db[key]
db.commit()
except KeyError:
print("Key not found", file=sys.stderr)
return BAD_KEY
return OK
\end{verbatim}
The \texttt{connect()} function opens the database file (possibly
creating it, but never overwriting it) and returns an instance of
\texttt{DBDB}:
\begin{verbatim}
# dbdb/__init__.py
def connect(dbname):
try:
f = open(dbname, 'r+b')
except IOError:
fd = os.open(dbname, os.O_RDWR | os.O_CREAT)
f = os.fdopen(fd, 'r+b')
return DBDB(f)
\end{verbatim}
\begin{verbatim}
# dbdb/interface.py
class DBDB(object):
def __init__(self, f):
self._storage = Storage(f)
self._tree = BinaryTree(self._storage)
\end{verbatim}
We see right away that \texttt{DBDB} has a reference to an instance of
\texttt{Storage}, but it also shares that reference with
\texttt{self.\_tree}. Why? Can't \texttt{self.\_tree} completely manage
access to the storage by itself?
The question of which objects ``own'' a resource is often an important
one in a design, because it gives us hints about what changes might be
unsafe. Let's keep that question in mind as we move on.
Once we have a DBDB instance, getting the value at \texttt{key} is done
via a dictionary lookup (\texttt{db{[}key{]}}), which causes the Python
interpreter to call \texttt{DBDB.\_\_getitem\_\_()}.
\begin{verbatim}
# dbdb/interface.py
class DBDB(object):
# ...
def __getitem__(self, key):
self._assert_not_closed()
return self._tree.get(key)
def _assert_not_closed(self):
if self._storage.closed:
raise ValueError('Database closed.')
\end{verbatim}
\texttt{\_\_getitem\_\_()} ensures that the database is still open by
calling \texttt{\_assert\_not\_closed}. Aha! Here we see at least one
reason why \texttt{DBDB} needs direct access to our \texttt{Storage}
instance: so it can enforce preconditions. (Do you agree with this
design? Can you think of a different way that we could do this?)
DBDB then retrieves the value associated with \texttt{key} on the
internal \texttt{\_tree} by calling \texttt{\_tree.get()}, which is
provided by \texttt{LogicalBase}:
\begin{verbatim}
# dbdb/logical.py
class LogicalBase(object):
# ...
def get(self, key):
if not self._storage.locked:
self._refresh_tree_ref()
return self._get(self._follow(self._tree_ref), key)
\end{verbatim}
\texttt{get()} checks if we have the storage locked. We're not 100\%
sure \emph{why} there might be a lock here, but we can guess that it
probably exists to allow writers to serialize access to the data. What
happens if the storage isn't locked?
\begin{verbatim}
# dbdb/logical.py
class LogicalBase(object):
# ...
def _refresh_tree_ref(self):
self._tree_ref = self.node_ref_class(
address=self._storage.get_root_address())
\end{verbatim}
\texttt{\_refresh\_tree\_ref} resets the tree's ``view'' of the data
with what is currently on disk, allowing us to perform a completely
up-to-date read.
What if storage \emph{is} locked when we attempt a read? This means that
some other process is probably changing the data we want to read right
now; our read is not likely to be up-to-date with the current state of
the data. This is generally known as a ``dirty read''. This pattern
allows many readers to access data without ever worrying about blocking,
at the expense of being slightly out-of-date.
For now, let's take a look at how we actually retrieve the data:
\begin{verbatim}
# dbdb/binary_tree.py
class BinaryTree(LogicalBase):
# ...
def _get(self, node, key):
while node is not None:
if key < node.key:
node = self._follow(node.left_ref)
elif node.key < key:
node = self._follow(node.right_ref)
else:
return self._follow(node.value_ref)
raise KeyError
\end{verbatim}
This is a standard binary tree search, following refs to their nodes. We
know from reading the \texttt{BinaryTree} documentation that
\texttt{Node}s and \texttt{NodeRef}s are value objects: they are
immutable and their contents never change. \texttt{Node}s are created
with an associated key and value, and left and right children. Those
associations also never change. The content of the whole
\texttt{BinaryTree} only visibly changes when the root node is replaced.
This means that we don't need to worry about the contents of our tree
being changed while we are performing the search.
Once the associated value is found, it is written to \texttt{stdout} by
\texttt{main()} without adding any extra newlines, to preserve the
user's data exactly.
\aosasectiii{Inserting and Updating}\label{inserting-and-updating}
Now we'll set key \texttt{foo} to value \texttt{bar} in
\texttt{example.db}:
\begin{verbatim}
$ python -m dbdb.tool example.db set foo bar
\end{verbatim}
Again, this runs the \texttt{main()} function from module
\texttt{dbdb.tool}. Since we've seen this code before, we'll just
highlight the important parts:
\begin{verbatim}
# dbdb/tool.py
def main(argv):
...
db = dbdb.connect(dbname) # CONNECT
try:
...
elif verb == 'set':
db[key] = value # SET VALUE
db.commit() # COMMIT
...
except KeyError:
...
\end{verbatim}
This time we set the value with \texttt{db{[}key{]} = value} which calls
\texttt{DBDB.\_\_setitem\_\_()}.
\begin{verbatim}
# dbdb/interface.py
class DBDB(object):
# ...
def __setitem__(self, key, value):
self._assert_not_closed()
return self._tree.set(key, value)
\end{verbatim}
\texttt{\_\_setitem\_\_} ensures that the database is still open and
then stores the association from \texttt{key} to \texttt{value} on the
internal \texttt{\_tree} by calling \texttt{\_tree.set()}.
\texttt{\_tree.set()} is provided by \texttt{LogicalBase}:
\begin{verbatim}
# dbdb/logical.py
class LogicalBase(object):
# ...
def set(self, key, value):
if self._storage.lock():
self._refresh_tree_ref()
self._tree_ref = self._insert(
self._follow(self._tree_ref), key, self.value_ref_class(value))
\end{verbatim}
\texttt{set()} first checks the storage lock:
\begin{verbatim}
# dbdb/storage.py
class Storage(object):
...
def lock(self):
if not self.locked:
portalocker.lock(self._f, portalocker.LOCK_EX)
self.locked = True
return True
else:
return False
\end{verbatim}
There are two important things to note here:
\begin{aosaitemize}
\item
Our lock is provided by a 3rd-party file-locking library called
\href{https://pypi.python.org/pypi/portalocker}{portalocker}.
\item
\texttt{lock()} returns \texttt{False} if the database was already
locked, and \texttt{True} otherwise.
\end{aosaitemize}
Returning to \texttt{\_tree.set()}, we can now understand why it checked
the return value of \texttt{lock()} in the first place: it lets us call
\texttt{\_refresh\_tree\_ref} for the most recent root node reference so
we don't lose updates that another process may have made since we last
refreshed the tree from disk. Then it replaces the root tree node with a
new tree containing the inserted (or updated) key/value.
Inserting or updating the tree doesn't mutate any nodes, because
\texttt{\_insert()} returns a new tree. The new tree shares unchanged
parts with the previous tree to save on memory and execution time. It's
natural to implement this recursively:
\begin{verbatim}
# dbdb/binary_tree.py
class BinaryTree(LogicalBase):
# ...
def _insert(self, node, key, value_ref):
if node is None:
new_node = BinaryNode(
self.node_ref_class(), key, value_ref, self.node_ref_class(), 1)
elif key < node.key:
new_node = BinaryNode.from_node(
node,
left_ref=self._insert(
self._follow(node.left_ref), key, value_ref))
elif node.key < key:
new_node = BinaryNode.from_node(
node,
right_ref=self._insert(
self._follow(node.right_ref), key, value_ref))
else:
new_node = BinaryNode.from_node(node, value_ref=value_ref)
return self.node_ref_class(referent=new_node)
\end{verbatim}
Notice how we always return a new node (wrapped in a \texttt{NodeRef}).
Instead of updating a node to point to a new subtree, we make a new node
which shares the unchanged subtree. This is what makes this binary tree
an immutable data structure.
You may have noticed something strange here: we haven't made any changes
to anything on disk yet. All we've done is manipulate our view of the
on-disk data by moving tree nodes around.
In order to actually write these changes to disk, we need an explicit
call to \texttt{commit()}, which we saw as the second part of our
\texttt{set} operation in \texttt{tool.py} at the beginning of this
section.
Committing involves writing out all of the dirty state in memory, and
then saving the disk address of the tree's new root node.
Starting from the API:
\begin{verbatim}
# dbdb/interface.py
class DBDB(object):
# ...
def commit(self):
self._assert_not_closed()
self._tree.commit()
\end{verbatim}
The implementation of \texttt{\_tree.commit()} comes from
\texttt{LogicalBase}:
\begin{verbatim}
# dbdb/logical.py
class LogicalBase(object)
# ...
def commit(self):
self._tree_ref.store(self._storage)
self._storage.commit_root_address(self._tree_ref.address)
\end{verbatim}
All \texttt{NodeRef}s know how to serialise themselves to disk by first
asking their children to serialise via \texttt{prepare\_to\_store()}:
\begin{verbatim}
# dbdb/logical.py
class ValueRef(object):
# ...
def store(self, storage):
if self._referent is not None and not self._address:
self.prepare_to_store(storage)
self._address = storage.write(self.referent_to_string(self._referent))
\end{verbatim}
\texttt{self.\_tree\_ref} in \texttt{LogicalBase} is actually a
\texttt{BinaryNodeRef} (a subclass of \texttt{ValueRef}) in this case,
so the concrete implementation of \texttt{prepare\_to\_store()} is:
\begin{verbatim}
# dbdb/binary_tree.py
class BinaryNodeRef(ValueRef):
def prepare_to_store(self, storage):
if self._referent:
self._referent.store_refs(storage)
\end{verbatim}
The \texttt{BinaryNode} in question, \texttt{\_referent}, asks its refs
to store themselves:
\begin{verbatim}
# dbdb/binary_tree.py
class BinaryNode(object):
# ...
def store_refs(self, storage):
self.value_ref.store(storage)
self.left_ref.store(storage)
self.right_ref.store(storage)
\end{verbatim}
This recurses all the way down for any \texttt{NodeRef} which has
unwritten changes (i.e., no \texttt{\_address}).
Now we're back up the stack in \texttt{ValueRef}'s \texttt{store} method
again. The last step of \texttt{store()} is to serialise this node and
save its storage address:
\begin{verbatim}
# dbdb/logical.py
class ValueRef(object):
# ...
def store(self, storage):
if self._referent is not None and not self._address:
self.prepare_to_store(storage)
self._address = storage.write(self.referent_to_string(self._referent))
\end{verbatim}
At this point the \texttt{NodeRef}'s \texttt{\_referent} is guaranteed
to have addresses available for all of its own refs, so we serialise it
by creating a bytestring representing this node:
\begin{verbatim}
# dbdb/binary_tree.py
class BinaryNodeRef(ValueRef):
# ...
@staticmethod
def referent_to_string(referent):
return pickle.dumps({
'left': referent.left_ref.address,
'key': referent.key,
'value': referent.value_ref.address,
'right': referent.right_ref.address,
'length': referent.length,
})
\end{verbatim}
Updating the address in the \texttt{store()} method is technically a
mutation of the \texttt{ValueRef}. Because it has no effect on the
user-visible value, we can consider it to be immutable.
Once \texttt{store()} on the root \texttt{\_tree\_ref} is complete (in
\texttt{LogicalBase.commit()}), we know that all of the data are written
to disk. We can now commit the root address by calling:
\begin{verbatim}
# dbdb/physical.py
class Storage(object):
# ...
def commit_root_address(self, root_address):
self.lock()
self._f.flush()
self._seek_superblock()
self._write_integer(root_address)
self._f.flush()
self.unlock()
\end{verbatim}
We ensure that the file handle is flushed (so that the OS knows we want
all the data saved to stable storage like an SSD) and write out the
address of the root node. We know this last write is atomic because we
store the disk address on a sector boundary. It's the very first thing
in the file, so this is true regardless of sector size, and
single-sector disk writes are guaranteed atomic by the disk hardware.
Because the root node address has either the old or new value (never a
bit of old and a bit of new), other processes can read from the database
without getting a lock. An external process might see the old or the new
tree, but never a mix of the two. In this way, commits are atomic.
Because we write the new data to disk and call the \texttt{fsync}
syscall\footnote{Calling \texttt{fsync} on a file descriptor asks the
operating system and hard drive (or SSD) to write all buffered data
immediately. Operating systems and drives don't usually write
everything immediately in order to improve performance.} before we
write the root node address, uncommitted data are unreachable.
Conversely, once the root node address has been updated, we know that
all the data it references are also on disk. In this way, commits are
also durable.
We're done!
\aosasectii{How NodeRefs Save Memory}\label{how-noderefs-save-memory}
To avoid keeping the entire tree structure in memory at the same time,
when a logical node is read in from disk the disk address of its left
and right children (as well as its value) are loaded into memory.
Accessing children and their values requires one extra function call to
\texttt{NodeRef.get()} to dereference (``really get'') the data.
All we need to construct a \texttt{NodeRef} is an address:
\begin{verbatim}
+---------+
| NodeRef |
| ------- |
| addr=3 |
| get() |
+---------+
\end{verbatim}
Calling \texttt{get()} on it will return the concrete node, along with
that node's references as \texttt{NodeRef}s:
\begin{verbatim}
+---------+ +---------+ +---------+
| NodeRef | | Node | | NodeRef |
| ------- | | ------- | +-> | ------- |
| addr=3 | | key=A | | | addr=1 |
| get() ------> | value=B | | +---------+
+---------+ | left ----+
| right ----+ +---------+
+---------+ | | NodeRef |
+-> | ------- |
| addr=2 |
+---------+
\end{verbatim}
When changes to the tree are not committed, they exist in memory with
references from the root down to the changed leaves. The changes aren't
saved to disk yet, so the changed nodes contain concrete keys and values
and no disk addresses. The process doing the writing can see uncommitted
changes and can make more changes before issuing a commit, because
\texttt{NodeRef.get()} will return the uncommitted value if it has one;
there is no difference between committed and uncommitted data when
accessed through the API. All the updates will appear atomically to
other readers because changes aren't visible until the new root node
address is written to disk. Concurrent updates are blocked by a lockfile
on disk. The lock is acquired on first update, and released after
commit.
\aosasectii{Exercises for the Reader}\label{exercises-for-the-reader}
DBDB allows many processes to read the same database at once without
blocking; the tradeoff is that readers can sometimes retrieve stale
data. What if we needed to be able to read some data consistently? A
common use case is reading a value and then updating it based on that
value. How would you write a method on \texttt{DBDB} to do this? What
tradeoffs would you have to incur to provide this functionality?
The algorithm used to update the data store can be completely changed
out by replacing the string \texttt{BinaryTree} in
\texttt{interface.py}. Data stores tend to use more complex types of
search trees such as B-trees, B+ trees, and others to improve the
performance. While a balanced binary tree (and this one isn't) needs to
do $O(log_2(n))$ random node reads to find a value, a B+ tree needs many
fewer, for example $O(log_{32}(n))$ because each node splits 32 ways
instead of just 2. This makes a huge different in practice, since
looking through 4 billion entries would go from $log_2(2^{32}) = 32$ to
$log_{32}(2^{32}) \approx 6.4$ lookups. Each lookup is a random access,
which is incredibly expensive for hard disks with spinning platters.
SSDs help with the latency, but the savings in I/O still stand.
By default, values are stored by \texttt{ValueRef} which expects bytes
as values (to be passed directly to \texttt{Storage}). The binary tree
nodes themselves are just a sublcass of \texttt{ValueRef}. Storing
richer data via json or msgpack is a matter of writing your own and
setting it as the \texttt{value\_ref\_class}. \texttt{BinaryNodeRef} is
an example of using
\href{https://docs.python.org/3.4/library/pickle.html}{pickle} to
serialise data.
Database compaction is another interesting exercise. Compacting can be
done via an infix-of-median traversal of the tree writing things out as
you go. It's probably best if the tree nodes all go together, since
they're what's traversed to find any piece of data. Packing as many
intermediate nodes as possible into a disk sector should improve read
performance, at least right after compaction. There are some subtleties
here (for example, memory usage) if you try to implement this yourself.
And remember: always benchmark performance enhancements before and
after! You'll often be surprised by the results.
\aosasectii{Patterns and Principles}\label{patterns-and-principles}
Test interfaces, not implementation. As part of developing DBDB, I wrote
a number of tests that described how I wanted to be able to use it. The
first tests ran against an in-memory version of the database, then I
extended DBDB to persist to disk, and even later added the concept of
NodeRefs. Most of the tests didn't have to change, which gave me
confidence that things were still working as expected.
Respect the Single Responsibility Principle. Classes should have at most
one reason to change. That's not strictly the case with DBDB, but there
are multiple avenues of extension with only localised changes required.
Refactoring as I added features was a pleasure!
\aosasectii{Summary}\label{summary}
DBDB is a simple database that makes simple guarantees, and yet things
still became complicated in a hurry. The most important thing I did to
manage this complexity was to implement an ostensibly mutable object
with an immutable data structure. I encourage you to consider this
technique the next time you find yourself in the middle of a tricky
problem that seems to have more edge cases than you can keep track of.
\end{aosachapter}