bgDB
is a tiny relational database that aims to make end to end database development more approachable.
The starting point is that database development is fun and that it can help us learn many fundamental computer science concepts. To name a few:
- Parser/Lexer/Compiler development
- Data Structures
- Asynchronous programming
- Concurrency
- Performance
- Working with hardware
bgDB
doesn't aim to be the fastest nor to provide features that other RDBMSs already provide for decades. It main goal is to be approachable and easily extendable. Language of choice is C#
for most of the engine and F#
for Parser and Lexer. Even though managed languages are not the best choice for Database development, given that garbage collection can induce unwanted performance problems and that for many parts developer would want more control, for what we aim to do here they serve the purpose.
After cloning the enlistment you will need dotnet core 3.1
or above. It can be found here.
The engine it self doesn't have any dependencies, besides .net core. Tests and perf benchmarks will pull additional nugets. For running perf tests and updating plots you will need to install R. Instructions are in perf readme.md.
To try it out you can start bgdbRepl (read-eval-print-loop) by starting project in bgdbRepl folder. From root enlistment do following:
cd bgdbRepl
dotnet run
Booted Page Manager with file name repl.db. File size 10MB. Creating db file from scratch.
Booted Log Manager. Using in memory stream.
Booted Metadata Manager.
Query entry gate ready.
Transactions are implicit for now. Every command will be executed in separate transaction.
====================
>CREATE TABLE T1 (TYPE_INT a, TYPE_DOUBLE b, TYPE_STRING(20) c)
Total rows returned 0
Inserting rows
>INSERT INTO T1 VALUES (1, 1.1, 'somerandomstring1')
Total rows returned 0
>INSERT INTO T1 VALUES (2, 2.2, 'somerandomstring1')
Total rows returned 0
>INSERT INTO T1 VALUES (3, 2.2, 'somerandomstring2')
Total rows returned 0
>INSERT INTO T1 VALUES (5, 14.2, 'somerandomstring2')
Total rows returned 0
Querying
SELECT MAX(a), MIN(b), c FROM T1 WHERE a > 1 GROUP BY c
| a_Max | b_Min | c |
-------------------------------------------------
| 2 | 2.2 | somerandomstring1 |
-------------------------------------------------
| 5 | 2.2 | somerandomstring2 |
-------------------------------------------------
Total rows returned 2
Or let us create one more table and test out joins.
CREATE TABLE T2 (TYPE_INT a, TYPE_STRING(10) c)
INSERT INTO T2 VALUES (1, 'somerandomstring2')
INSERT INTO T2 VALUES (100, 'somerandomstring2')
Needlessly complex query that illustrates data flow:
SELECT MAX(T1.a), MIN(T1.b), T2.c
FROM T1
JOIN T2 ON T1.c = T2.c
WHERE T2.a = 100
GROUP BY T2.c
| TR1.A_Max | TR1.B_Min | TR2.C |
---------------------------------------------
| 5 | 2.2 | somerandomstring2 |
---------------------------------------------
Total rows returned 1
To experiment with slightly larger datasets you can also load titanic dataset by passing set_load_path argument
dotnet run --set_load_path .\datasets\titanic-passengers.csv
>SELECT TOP 10 * FROM Passengers
|PASSENGERID |SURVIVED | CLASS | NAME | SEX | AGE | SIBLINGS | PARENTS |EMBARKEDPORT |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 343 | No | 2 | Collander, Mr. Erik Gustaf | male | 28 | 0 | 0 | S |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 76 | No | 3 | Moen, Mr. Sigurd Hansen | male | 25 | 0 | 0 | S |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 641 | No | 3 | Jensen, Mr. Hans Peder | male | 20 | 0 | 0 | S |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 568 | No | 3 | Palsson, Mrs. Nils (Alma Cornelia Berglund) | female | 29 | 0 | 4 | S |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 672 | No | 1 | Davidson, Mr. Thornton | male | 31 | 1 | 0 | S |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 105 | No | 3 | Gustafsson, Mr. Anders Vilhelm | male | 37 | 2 | 0 | S |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 576 | No | 3 | Patchett, Mr. George | male | 19 | 0 | 0 | S |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 382 | Yes | 3 | "Nakid, Miss. Maria (""Mary"")" | female | 1 | 0 | 2 | C |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 228 | No | 3 | "Lovell, Mr. John Hall (""Henry"")" | male | 20.5 | 0 | 0 | S |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| 433 | Yes | 2 | Louch, Mrs. Charles Alexander (Alice Adelaide Slow) | female | 42 | 1 | 0 | S |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total rows returned 10
Or something a bit more complex.
>SELECT COUNT(PassengerId), Sex, Survived, MIN(Age), MAX(Age) FROM Passengers WHERE EmbarkedPort = 'S' AND Class = 3 GROUP BY Sex, Survived
|PassengerId_Count | Sex |Survived | Age_Min | Age_Max |
----------------------------------------------------------------
| 183 | male | No | 1 | 74 |
----------------------------------------------------------------
| 45 | female | No | 2 | 48 |
----------------------------------------------------------------
| 31 | female | Yes | 1 | 63 |
----------------------------------------------------------------
| 30 | male | Yes | 1 | 45 |
----------------------------------------------------------------
Total rows returned 4
If you want to play with more data you can start repl with argument --rep_load_count N
that will duplicate input dataset N times. E.g.
dotnet run --set_load_path .\datasets\titanic-passengers.csv --rep_load_count 1000
Will load ~800k rows.
Bgdb also supports accessing row file system:
SELECT Extension, SUM(FileSize) FROM FILESYSTEM('./assets') GROUP BY Extension"
Hint: For file system operation start bgdbrepl with format_list option. This way querying long strings is more readable:
cd bgdbRepl
dotnet run --use_list_format
One idea for the future of BgDb is support for extracting information from data types such as image, video and audio.
Currently there is a support for basic video querying. You can either list info about video file or split it in chunks for future processing:
SELECT * FROM VIDEO_CHUNKER(300, SELECT * FROM FILESYSTEM('E:/bgdb_video_examples') WHERE Extension = '.mkv')
This example will split all the *.mkv files in 300 second chunks:
FilePath -> E:\bgdb_video_examples\Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special.mkv
FileName -> Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special.mkv
Extension -> .mkv
FileSize -> 313687893
chunk_path -> C:\Users\Aleksandar\projects\bgdb_github\bgdb\bgdbRepl\bin\Debug\net6.0\temp\d8ec89d7-50a9-41d0-8a94-091e12934cab\Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special000.mkv
NbStreams -> 2
NbPrograms -> 0
StartTimeInSeconds -> 0.067
DurationInSeconds -> 304.337
FormatName -> matroska,webm
---------------------
FilePath -> E:\bgdb_video_examples\Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special.mkv
FileName -> Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special.mkv
Extension -> .mkv
FileSize -> 313687893
chunk_path -> C:\Users\Aleksandar\projects\bgdb_github\bgdb\bgdbRepl\bin\Debug\net6.0\temp\d8ec89d7-50a9-41d0-8a94-091e12934cab\Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special001.mkv
NbStreams -> 2
NbPrograms -> 0
StartTimeInSeconds -> 304.28
DurationInSeconds -> 602.101
FormatName -> matroska,webm
---------------------
FilePath -> E:\bgdb_video_examples\Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special.mkv
FileName -> Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special.mkv
Extension -> .mkv
FileSize -> 313687893
chunk_path -> C:\Users\Aleksandar\projects\bgdb_github\bgdb\bgdbRepl\bin\Debug\net6.0\temp\d8ec89d7-50a9-41d0-8a94-091e12934cab\Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special002.mkv
NbStreams -> 2
NbPrograms -> 0
StartTimeInSeconds -> 602.05
DurationInSeconds -> 900.967
FormatName -> matroska,webm
...
Note that all the temporary files (e.g. video chunks) will leave as long as transaction is opened. On commit/rollback all the files will be purged.
Bgdb supports image classification through CLASSIFY_IMAGE function:
SELECT CLASSIFY_IMAGE(FilePath), FilePath, FileName FROM FILESYSTEM('../tests/E2EQueryExecutionTests/assets/pics') WHERE EXTENSION = '.jpg' OR EXTENSION = '.jfif'
---------------------
Object_Classification_Result -> basketball
FilePath -> C:\Users\Aleksandar\projects\bgdb_github\bgdb\tests\E2EQueryExecutionTests\assets\pics\basketball.jpg
FileName -> basketball.jpg
---------------------
Object_Classification_Result -> hippopotamus
FilePath -> C:\Users\Aleksandar\projects\bgdb_github\bgdb\tests\E2EQueryExecutionTests\assets\pics\hippo.jfif
FileName -> hippo.jfif
---------------------
You can also combine image classification with video operations. Following example will print classification result for every frame, extracted every 60s of video chunked into 300s pieces:
SELECT chunk_path, FilePath, frame_path, CLASSIFY_IMAGE(frame_path)
FROM VIDEO_TO_IMAGE(
1, 60, SELECT * FROM VIDEO_CHUNKER(300, SELECT * FROM FILESYSTEM('E:/bgdb_video_examples')))
Output:
---------------------
chunk_path -> C:\Users\Aleksandar\projects\bgdb_github\bgdb\bgdbRepl\bin\Debug\net6.0\temp\ff4862cd-97fa-445a-b42b-2b5762336314\Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special000.mkv
FilePath -> E:\bgdb_video_examples\Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special.mkv
frame_path -> C:\Users\Aleksandar\projects\bgdb_github\bgdb\bgdbRepl\bin\Debug\net6.0\temp\f3b4bb86-c687-4560-957a-d5065e45c985\Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special000001.bmp
Object_Classification_Result -> panpipe
---------------------
chunk_path -> C:\Users\Aleksandar\projects\bgdb_github\bgdb\bgdbRepl\bin\Debug\net6.0\temp\ff4862cd-97fa-445a-b42b-2b5762336314\Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special000.mkv
FilePath -> E:\bgdb_video_examples\Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special.mkv
frame_path -> C:\Users\Aleksandar\projects\bgdb_github\bgdb\bgdbRepl\bin\Debug\net6.0\temp\f3b4bb86-c687-4560-957a-d5065e45c985\Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special000002.bmp
Object_Classification_Result -> oboe
---------------------
chunk_path -> C:\Users\Aleksandar\projects\bgdb_github\bgdb\bgdbRepl\bin\Debug\net6.0\temp\ff4862cd-97fa-445a-b42b-2b5762336314\Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special000.mkv
FilePath -> E:\bgdb_video_examples\Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special.mkv
frame_path -> C:\Users\Aleksandar\projects\bgdb_github\bgdb\bgdbRepl\bin\Debug\net6.0\temp\f3b4bb86-c687-4560-957a-d5065e45c985\Anthony.Bourdain.No.Reservations.S04E00.Holiday.Special000003.bmp
Object_Classification_Result -> American lobster
---------------------
...
Repl is currently pretty limited. There is also no support for transactions in parser layer (transactions are implicit and to single command, BEGIN/COMMIT/ROLLBACK TRAN
support will be added). To get a feeling how things are working under the hood it is best to take a look at end to end tests.
For example:
[Test]
public async Task MultiTableGroupBy()
{
await using (ITransaction tran = this.logManager.CreateTransaction(pageManager))
{
await this.queryEntryGate.Execute("CREATE TABLE T1 (TYPE_INT A, TYPE_INT B)", tran).AllResultsAsync();
await this.queryEntryGate.Execute("CREATE TABLE T2 (TYPE_INT A, TYPE_INT B)", tran).AllResultsAsync();
await this.queryEntryGate.Execute("INSERT INTO T1 VALUES (1, 1)", tran).AllResultsAsync();
await this.queryEntryGate.Execute("INSERT INTO T1 VALUES (1, 2)", tran).AllResultsAsync();
await this.queryEntryGate.Execute("INSERT INTO T2 VALUES (2, 3)", tran).AllResultsAsync();
await this.queryEntryGate.Execute("INSERT INTO T2 VALUES (2, 4)", tran).AllResultsAsync();
await tran.Commit();
}
await using (ITransaction tran = this.logManager.CreateTransaction(pageManager, isReadOnly: true, "SELECT"))
{
RowHolder[] result = await this.queryEntryGate.Execute("SELECT MAX(B), A FROM T1 GROUP BY A", tran).ToArrayAsync();
Assert.AreEqual(1, result.Length);
Assert.AreEqual(2, result[0].GetField<int>(0));
Assert.AreEqual(1, result[0].GetField<int>(1));
result = await this.queryEntryGate.Execute("SELECT MAX(B), A FROM T2 GROUP BY A", tran).ToArrayAsync();
Assert.AreEqual(1, result.Length);
Assert.AreEqual(4, result[0].GetField<int>(0));
Assert.AreEqual(2, result[0].GetField<int>(1));
}
}
E2E tests can be found here.
At this point list of features is rather limited but, hopefully, the list will keep growing.
CREATE TABLE
INSERT INTO TABLE
- FILTERS (
WHERE
statement with basic arithmetic) GROUP BY
statement- Aggregates (
MAX
,MIN
,SUM
,COUNT
) - Support for wildcard select (
SELECT * FROM
) - Support for TOP clause (
SELECT TOP N * FROM
) - Support for JOIN clause (only
INNER JOIN
for now) - Support for functions (
SELECT CONCAT(str1, str2) FROM WHERE ADD(x, y) > 10
) - Support for nested subqueries (
SELECT * FROM (SELECT a, b FROM T) WHERE b > 42
) - Support for filesystem operations (
SELECT * FROM FILESYSTEM('./assets') WHERE Extension = '.txt'
) - Support for image classification (
SELECT CLASSIFY_IMAGE(FilePath), FilePath, FileName FROM FILESYSTEM('./assets/pics') WHERE EXTENSION = '.jpg' OR EXTENSION = '.jfif'
) - Support for video operations (
SELECT * FROM VIDEO_CHUNKER(10, SELECT * FROM FILESYSTEM('./assets/videos') WHERE Extension = '.mkv')
)
TYPE_INT
(32bit signed)TYPE_DOUBLE
(64 bit)TYPE_STRING(SIZE)
(strings of fixed length)TYPE_STRING(MAX)
(work in progress, strings of unlimited length)
Engine currently supports page level locking and read committed isolation. On startup a pool of locks is created and page id maps to lock id through simple modular arithmetic. Lock Manager does deadlock detection and rollbacks deadlock victims.
Logging is traditional Write Ahead Logging. To ramp-up, recommendation is to take a look at PageManager
constructor.
public PageManager(uint defaultPageSize, IPersistedStream persistedStream, IBufferPool bufferPool, ILockManager lockManager, InstrumentationInterface logger)
Log Manager only sits on top of .NET stream that is used for logging.
public LogManager(BinaryWriter storage)
Lock Manager currently supports only Read and Write locks that prioritize Writers.
Tables can be organized in two data structures:
- linked list
- [btree] (https://github.com/dasatomic/bgdb/blob/master/DataStructures/BTreeCollection.cs)
Query tree is currently assembled through a set of rules that can be found here. When work on Query Optimizer starts this will have to change.
Operators follow this simple interface:
public interface IPhysicalOperator<T>
{
IAsyncEnumerable<T> Iterate(ITransaction tran);
}
Caller gets root operator and keeps draining iterators that are placed deeper in the tree. On the leaf nodes there is Scan
operator (or, in future, Seek
, when support for indexes comes).
Operators work on RowHolder instances which represent single Row
fetched from Storage Engine.
Currently we use F# and FsLexYacc
library. Grammar can be easily understood by looking at type definitions and Parser.
Tests are written using NUNIT
framework. You can run them either through Visual Studio or simply running dotnet test
from root folder.
For Benchmarking we use excellent BenchmarkDotNet. Benchmarks and results can be found here.
There are also simple stress tests in E2E folder.
Project also has set up Continuous Integration pipeline on GitHub that is running unit tests on every push.
You can build docker container by using provided docker file.