MyDataStore is a leaderless distributed database designed for research and learning purposes, drawing inspiration from "Designing Data-Intensive Applications" by Martin Kleppmann. Key advantages include:
- High Availability and Fault Tolerance: The absence of a central leader node eliminates single point of failure and distributing data across multiple nodes, the system remains operational even if some nodes fail.
- Leaderless Consensus: Implmentation of the raft consensus algorithm is used to agree upon cluster organization.
- Variable Consistency: Read and write quorums are configurable, ensuring consistency, fast reads, or faster writes."
MyDataStore implements several core concepts from the book "Designing Data-Intensive Applications". These include:
-
Node Discovery:
- Node discovery is implemented using a gossip protocol, which enables efficient and scalable membership validation and communication.
-
Consistent Hashing:
- A hashring technique is used to evenly distribute data into partitions and distribute partitions across the cluster.
- The cluster leader determins the member list, the following members replciate the hashring ensuring a consistent cluster hashing.
-
Quorums and Read Repairs:
- The system uses a read quorum and a write quorum for operations.
-
- here N is number of replciated partitions, W is minimum nodes used to write, and R is minimum nodes used to read.
-
- R + W > N is used to ensure consistency and no stale reads.
-
- Lower W can be used for faster writes.
-
- Lower R can be used for faster reads.
- Read repairs are used to write to nodes which return stale data in a read operation.
-
Timestamps for Conflict Resolution:
- Timestamp-based conflict resolution is used to handle conflicts in the system.
-
Using Raft for Consensus:
- Raft algorithm is used to elect a leader for the cluster through a democratic election process.
- The elected leader maintains a log that is verified by the consensus algorithm to ensure data integrity and consistency.
- In this project, the log is used to keep track of the members currently in the cluster and the current epoch of the cluster, providing a reliable source of truth for the system state.
-
Periodic Partition Verification and Replication:
- Data replication and verification is a process that ensures data consistency and integrity across the distributed system.
- The raft leader periodically increments a varaible called Epoch.
- Epochs are stored with the value write.
- Upon Epoch increment, each node creates a merkle tree for (Partition, Epoch, Data) then starts comparing the tree hash to the hash of other members.
- Additional details:
-
- Merkle trees organize data into hashed buckets on key.
-
- Validation nodes will construct a remote tree with limied data of tree hash.
-
-
- This is done so that all data does not need to be checked otherwise the process is redudant and takes a long time.
-
-
- BFS is used on local and remote tree to find buckets which are not in sync.
-
- A node can then request the out of sync bucket range instead of the whole partition. This speeds up the process of partition replication if out of sync. For a given partition and epoch, of a 1/64 values will need to be sync in this range.
-
Data Storage Engine and Indexing:
- The system works with Badger and LevelDB as its data storage engines.
- Both Badger and LevelDB are Log-Structured Merge-tree (LSM tree) based data structures, which provide high write throughput.
- Indexes:
-
- (epoch, parition, key)
-
- (key)
-
Kubernetes operator
- Assists autoscaling.
- Bazel: Bazel is a build tool that is required for building and testing MyDataStore. You can install Bazel following the instructions on the official Bazel website.
A dev script is provided to help with common commands. The script is written in bash and can be run from the command line. Here is a brief explanation of the commands supported by the script:
dc
: Starts a new tmuxinator session with the configuration specified intmuxinator-dc.yaml
.dc!
: Kills thedc_dev
tmux session.k8-init
: Deletes and restarts minikube, enables the metrics-server addon, and runs thek8-r
command.k8-r
: Deletes the Kubernetes configuration specified in./operator/config/samples/
, deploys the operator, and deploys Kubernetes.k8-operator
: Deploys the operator.k8
: Starts a new tmuxinator session with the configuration specified intmuxinator-k8.yaml
.k8!
: Kills thek8_dev
tmux session.k8-e2e
: Runs end-to-end tests on Kubernetes.dc-e2e
: Runs end-to-end tests using k6.test
: Executes Bazel tests once. Accepts a test name as an argument for filtering.rtest
: Usesibazel
to rerun tests on file change. Also accepts a test name as an argument for filtering.deps
: Updates dependencies usingupdate_deps.sh
.dc-pprof
: Starts a pprof profiling session.scale
: Scales the number of replicas for the store. The number of replicas must be specified as an argument.
To run the script, use the following command:
./dev.sh <command> [arguments]
Replace with one of the commands listed above, and [arguments] with any arguments required by the command.