This project is my learning project for B+Tree (also my first Kotlin experience). It is heavily inspired by InnoDB. So far the list below is implemented in total 963 lines of code:
- Basic get/put/delete/scan
- Split page at the middle
- Read page from memory if it exists, otherwise read from the file
- Write page to the file synchronously
I stopped here since I had learned a lot. The list below is lots of TODO:
- Merge pages
- Logical delete
- Split by different strategy
- Asynchronous write + WAL
- MVCC
- etc.
- B+Tree index structures in InnoDB – Jeremy Cole
- This is the bible for this project.
- Part 7 - Introduction to the B-Tree | Let’s Build a Simple Database
- This series of posts are great to understand basic of B+Tree.
This B+Tree implementation is mostly generic/schema free. You can use any byte array representation for key and value, but I use Apache Avro for demo purpose since it is sortable even in byte array form (i.e. without decoding)
First, initialize the db file:
$ ./runsample init foo.db
Then, you can get/put/delete/scan
the B+Tree (The default schema is under /sample
directory):
$ ./runsample open foo.db
COMMAND: put
KEY: {"name": "foo", "timestamp": 1}
VALUE: {"value": "bbb"}
COMMAND: get
KEY: {"name": "foo", "timestamp": 1}
{"name": "foo", "timestamp": 1} {"value": "bbb"}
COMMAND: put
KEY: {"name": "foo", "timestamp": 10}
VALUE: {"value": "bbb"}
COMMAND: scan
START KEY: {"name": "foo", "timestamp": 0}
END KEY: {"name": "foo", "timestamp": 100}
{"name": "foo", "timestamp": 1} {"value": "bbb"}
{"name": "foo", "timestamp": 10} {"value": "bbb"}
COMMAND: put
KEY: {"name": "bar", "timestamp": 10}
VALUE: {"value": "bbb"}
COMMAND: put
KEY: {"name": "bar", "timestamp": 1}
VALUE: {"value": "bbb"}
COMMAND: scan
START KEY: {"name": "foo", "timestamp": 0}
END KEY: {"name": "foo", "timestamp": 100}
{"name": "foo", "timestamp": 1} {"value": "bbb"}
{"name": "foo", "timestamp": 10} {"value": "bbb"}
COMMAND: scan
START KEY: {"name": "bar", "timestamp": 0}
END KEY: {"name": "foo", "timestamp": 100}
{"name": "bar", "timestamp": 1} {"value": "bbb"}
{"name": "bar", "timestamp": 10} {"value": "bbb"}
{"name": "foo", "timestamp": 1} {"value": "bbb"}
{"name": "foo", "timestamp": 10} {"value": "bbb"}
This is a class for B+Tree itself. It supports the operations below:
- get(key)
- Find appropriate LeafNode
- Get KeyValue from the LeafNode
- put(key, value)
- Find appropriate LeafNode
- Put KeyValue into the LeafNode and sync with the file
- If the Page of the LeafNode is full, split it
- Add the new LeafNode to the same parent
- If any InternalNode is also full, split it
- Add the new InternalNode to the same parent
- Sync changed pages with the file
- If RootNode is full, split it
- Add new Nodes as children
- Left node's minimum key is logical minimum key (0 bytes)
- Sync changed pages with the file
- Add new Nodes as children
- delete(key)
- Find appropriate LeafNode
- Delete KeyValue from the LeafNode and sync with the file
- scan(startKey, endKey)
- Find appropriate LeafNode for the startKey and endKey
- Filter KeyValues inside the LeafNode between startKey and endKey
- Traverse linked list of LeafNodes until the LeafNode for endKey
This is a (set of) class for Node of B+Tree.
This class stores all KeyValue on backend Page.
- get(key)
- Loop KeyValue list on Page
- Find exactly matching KeyValue
- put(key, value)
- Loop KeyValue list on Page
- If there is exactly matching KeyValue, update it
- If not, insert it with sorted order
- delete(key)
- Loop KeyValue list on Page
- Delete exactly matching KeyValue
This class stores minimum key of each child and its Page id as KeyValue.
- findChildPageId(key)
- Loop KeyValue list on Page
- Find a child for the key
- If k1 <= key < k2, return the value of k1
- If key < k-left-edge, return the value of k-left-edge
- addChildNode(node)
- Fetch minimum key of the given node
- Put the minimum key and node id as KeyValue
This class is for the RootNode. RootNode is LeafNode for the first time, then it becomes InternalNode after the first split. So, it inherits both.
- splitRoot()
- Create a new Page for left LeafNode and move all KeyValue there
- Split the left LeafNode i.e. a new right LeafNode is created
This is a class for Page of B+Tree. Page is backend data structure for all Node classes. PageData is Avro record:
{
"name": "PageData",
"type": "record",
"fields": [
{"name": "id", "type": "PageId"},
{"name": "nodeType", "type": "NodeType"},
{"name": "previousId", "type": "PageId"},
{"name": "nextId", "type": "PageId"},
{"name": "records", "type": {"type": "array", "items": "KeyValue"} }
]
}
So, the Page is encoded/decoded as Avro's binary format (single-object encoding).
It supports mutational operations below and calculates encoded byte size to check whether the Page is full:
- insert(index, keyvalue)
- update(index, keyvalue)
- delete(index)
This class is for managing file read and write. The file has fixed size (128 bytes) metadata at the beginning, then each Page (256 bytes) is stored ordered by Page id.
- allocate()
- Fetch next free Page id and update it
- So far, this is always the next of last Page. If merge is implemented, merged pages are reclaimed.
- Create a new Page for the id on memory
- Fetch next free Page id and update it
- read(id)
- Seek to the first byte for the Page id
- Read all bytes for the Page (1KB) and return decoded Page object
- write(page)
- Seek to the first byte for the Page id
- Write encoded Page
This class is for managing Page lifecycle. It has memory pool for pages (currently just a Map
).
- get(id)
- If it is in the pool, return it
- If not, read from FileManager and also cache it
- allocate()
- Allocate a new Page from FileManager and cache it
- split(page)
- Find split point (currently middle of size)
- Delete right side KeyValue list
- Allocate a new Page with deleted KeyValue list
- Connect Linked list between Nodes
- Write to the file
Currently, there is only one split strategy: middle of size. This is the best approach for random insert since a new KeyValue comes random place and half split pages could be filled later.
However, if the insert is in order (both ascending and descending), it is not optimized. Because a new KeyValue is always inserted to one node, rest of nodes are left with half usage.
InnoDB has an optimization to solve this problem (insertion order). See this great post.
RootNode(id:0 size=93 records=14)
key=
LeafNode(id:1 size=163 keys=[00:01, 00:02, 00:03])
key=00:04
LeafNode(id:11 size=163 keys=[00:04, 00:05, 00:06])
key=00:07
LeafNode(id:7 size=209 keys=[00:07, 00:08, 00:09, 00:0a])
key=00:0b
LeafNode(id:9 size=163 keys=[00:0b, 00:0c, 00:0d])
key=00:0e
LeafNode(id:5 size=209 keys=[00:0e, 00:0f, 00:10, 00:11])
key=00:12
LeafNode(id:10 size=163 keys=[00:12, 00:13, 00:14])
key=00:15
LeafNode(id:3 size=163 keys=[00:15, 00:16, 00:17])
key=00:18
LeafNode(id:14 size=163 keys=[00:18, 00:19, 00:1a])
key=00:1b
LeafNode(id:2 size=255 keys=[00:1b, 00:1c, 00:1d, 00:1e, 00:1f])
key=00:20
LeafNode(id:12 size=209 keys=[00:20, 00:21, 00:22, 00:23])
key=00:24
LeafNode(id:4 size=163 keys=[00:24, 00:25, 00:26])
key=00:27
LeafNode(id:13 size=163 keys=[00:27, 00:28, 00:29])
key=00:2a
LeafNode(id:6 size=209 keys=[00:2a, 00:2b, 00:2c, 00:2d])
key=00:2e
LeafNode(id:8 size=255 keys=[00:2e, 00:2f, 00:30, 00:31, 00:32])
RootNode(id:0 size=103 records=16)
key=
LeafNode(id:1 size=163 keys=[00:01, 00:02, 00:03])
key=00:04
LeafNode(id:2 size=163 keys=[00:04, 00:05, 00:06])
key=00:07
LeafNode(id:3 size=163 keys=[00:07, 00:08, 00:09])
key=00:0a
LeafNode(id:4 size=163 keys=[00:0a, 00:0b, 00:0c])
key=00:0d
LeafNode(id:5 size=163 keys=[00:0d, 00:0e, 00:0f])
key=00:10
LeafNode(id:6 size=163 keys=[00:10, 00:11, 00:12])
key=00:13
LeafNode(id:7 size=163 keys=[00:13, 00:14, 00:15])
key=00:16
LeafNode(id:8 size=163 keys=[00:16, 00:17, 00:18])
key=00:19
LeafNode(id:9 size=163 keys=[00:19, 00:1a, 00:1b])
key=00:1c
LeafNode(id:10 size=163 keys=[00:1c, 00:1d, 00:1e])
key=00:1f
LeafNode(id:11 size=163 keys=[00:1f, 00:20, 00:21])
key=00:22
LeafNode(id:12 size=163 keys=[00:22, 00:23, 00:24])
key=00:25
LeafNode(id:13 size=163 keys=[00:25, 00:26, 00:27])
key=00:28
LeafNode(id:14 size=163 keys=[00:28, 00:29, 00:2a])
key=00:2b
LeafNode(id:15 size=163 keys=[00:2b, 00:2c, 00:2d])
key=00:2e
LeafNode(id:16 size=255 keys=[00:2e, 00:2f, 00:30, 00:31, 00:32])
RootNode(id:0 size=103 records=16)
key=
LeafNode(id:1 size=255 keys=[00:01, 00:02, 00:03, 00:04, 00:05])
key=00:06
LeafNode(id:16 size=163 keys=[00:06, 00:07, 00:08])
key=00:09
LeafNode(id:15 size=163 keys=[00:09, 00:0a, 00:0b])
key=00:0c
LeafNode(id:14 size=163 keys=[00:0c, 00:0d, 00:0e])
key=00:0f
LeafNode(id:13 size=163 keys=[00:0f, 00:10, 00:11])
key=00:12
LeafNode(id:12 size=163 keys=[00:12, 00:13, 00:14])
key=00:15
LeafNode(id:11 size=163 keys=[00:15, 00:16, 00:17])
key=00:18
LeafNode(id:10 size=163 keys=[00:18, 00:19, 00:1a])
key=00:1b
LeafNode(id:9 size=163 keys=[00:1b, 00:1c, 00:1d])
key=00:1e
LeafNode(id:8 size=163 keys=[00:1e, 00:1f, 00:20])
key=00:21
LeafNode(id:7 size=163 keys=[00:21, 00:22, 00:23])
key=00:24
LeafNode(id:6 size=163 keys=[00:24, 00:25, 00:26])
key=00:27
LeafNode(id:5 size=163 keys=[00:27, 00:28, 00:29])
key=00:2a
LeafNode(id:4 size=163 keys=[00:2a, 00:2b, 00:2c])
key=00:2d
LeafNode(id:3 size=163 keys=[00:2d, 00:2e, 00:2f])
key=00:30
LeafNode(id:2 size=163 keys=[00:30, 00:31, 00:32])