Skip to content

Commit

Permalink
update README
Browse files Browse the repository at this point in the history
  • Loading branch information
zh217 committed May 2, 2023
1 parent 6468d28 commit 963720e
Showing 1 changed file with 68 additions and 45 deletions.
113 changes: 68 additions & 45 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,30 +26,49 @@
5. [Status of the project](#Status-of-the-project)
6. [Licensing and contributing](#Licensing-and-contributing)

## 🎉🎉🎉 New version 🎉🎉🎉
## 🎉🎉🎉 New versions 🎉🎉🎉

CozoDB v0.6 released! This version brings vector search with HNSW indices inside Datalog, which can be integrated seamlessly with powerful features like ad-hoc joins, recursive Datalog and classical whole-graph algorithms. This significantly expanded the horizon of possibilities of CozoDB.
Version v0.7: after HNSW vector search from 0.6, in 0.7 we bring to you MinHash-LSH for near-duplicate search, full-text
search, Json value support and more! See [here](https://docs.cozodb.org/en/latest/releases/v0.7.html) for more details.

---

Version v0.6 released! This version brings vector search with HNSW indices inside Datalog, which can be integrated
seamlessly with powerful features like ad-hoc joins, recursive Datalog and classical whole-graph algorithms. This
significantly expanded the horizon of possibilities of CozoDB.

Highlights:

* You can now create HNSW (hierarchical navigable small world) indices on relations containing vectors.
* You can create multiple HNSW indices for the same relation by specifying filters dictating which rows should be indexed, or which vector(s) should be indexed for each row if the row contains multiple vectors.
* The vector search functionality is integrated within Datalog, meaning that you can use vectors (either explicitly given or coming from another relation) as pivots to perform unification into the indexed relations (roughly equivalent to table joins in SQL).
* Unification with vector search is semantically no different from regular unification, meaning that you can even use vector search in recursive Datalog, enabling extremely complex query logic.
* The HNSW index is no more than a hierarchy of proximity graphs. As an open, competent graph database, CozoDB exposes these graphs to the end user to be used as regular graphs in your query, so that all the usual techniques for dealing with them can now be applied, especially: community detection and other classical whole-graph algorithms.
* As with all mutations in CozoDB, the index is protected from corruption in the face of concurrent writes by using Multi-Version Concurrency Control (MVCC), and you can use multi-statement transactions for complex workflows.
* The index resides on disk as a regular relation (unless you use the purely in-memory storage option, of course). During querying, close to the absolute minimum amount of memory is used, and memory is freed as soon as the processing is done (thanks to Rust's RAII), so it can run on memory-constrained systems.
* The HNSW functionality is available for CozoDB on all platforms: in the server as a standalone service, in your Python, NodeJS, or Clojure programs om embedded or client mode, on your phone in embedded mode, even in the browser with the WASM backend.
* HNSW vector search in CozoDB is performant: we have optimized the index to the point where basic vector operations themselves have become a limiting factor (along with memcpy), and we are constantly finding ways to improve our new implementation of the HNSW algorithm further.
* You can create multiple HNSW indices for the same relation by specifying filters dictating which rows should be
indexed, or which vector(s) should be indexed for each row if the row contains multiple vectors.
* The vector search functionality is integrated within Datalog, meaning that you can use vectors (either explicitly
given or coming from another relation) as pivots to perform unification into the indexed relations (roughly equivalent
to table joins in SQL).
* Unification with vector search is semantically no different from regular unification, meaning that you can even use
vector search in recursive Datalog, enabling extremely complex query logic.
* The HNSW index is no more than a hierarchy of proximity graphs. As an open, competent graph database, CozoDB exposes
these graphs to the end user to be used as regular graphs in your query, so that all the usual techniques for dealing
with them can now be applied, especially: community detection and other classical whole-graph algorithms.
* As with all mutations in CozoDB, the index is protected from corruption in the face of concurrent writes by using
Multi-Version Concurrency Control (MVCC), and you can use multi-statement transactions for complex workflows.
* The index resides on disk as a regular relation (unless you use the purely in-memory storage option, of course).
During querying, close to the absolute minimum amount of memory is used, and memory is freed as soon as the processing
is done (thanks to Rust's RAII), so it can run on memory-constrained systems.
* The HNSW functionality is available for CozoDB on all platforms: in the server as a standalone service, in your
Python, NodeJS, or Clojure programs om embedded or client mode, on your phone in embedded mode, even in the browser
with the WASM backend.
* HNSW vector search in CozoDB is performant: we have optimized the index to the point where basic vector operations
themselves have become a limiting factor (along with memcpy), and we are constantly finding ways to improve our new
implementation of the HNSW algorithm further.

See [here](https://docs.cozodb.org/en/latest/releases/v0.6.html) for more details.


## Introduction

CozoDB is a general-purpose, transactional, relational database
that uses **Datalog** for query, is **embeddable** but can also handle huge amounts of data and concurrency,
and focuses on **graph** data and algorithms.
that uses **Datalog** for query, is **embeddable** but can also handle huge amounts of data and concurrency,
and focuses on **graph** data and algorithms.
It supports **time travel** and it is **performant**!

### What does _embeddable_ mean here?
Expand All @@ -59,25 +78,25 @@ if you can use it on a phone which _never_ connects to any network
(this situation is not as unusual as you might think). SQLite is embedded. MySQL/Postgres/Oracle are client-server.

> A database is _embedded_ if it runs in the same process as your main program.
This is in contradistinction to _client-server_ databases, where your program connects to
a database server (maybe running on a separate machine) via a client library. Embedded databases
generally require no setup and can be used in a much wider range of environments.
> This is in contradistinction to _client-server_ databases, where your program connects to
> a database server (maybe running on a separate machine) via a client library. Embedded databases
> generally require no setup and can be used in a much wider range of environments.
>
> We say CozoDB is _embeddable_ instead of _embedded_ since you can also use it in client-server
mode, which can make better use of server resources and allow much more concurrency than
in embedded mode.
> mode, which can make better use of server resources and allow much more concurrency than
> in embedded mode.
### Why _graphs_?

Because data are inherently interconnected. Most insights about data can only be obtained if
you take this interconnectedness into account.

> Most existing _graph_ databases start by requiring you to shoehorn your data into the labelled-property graph model.
We don't go this route because we think the traditional relational model is much easier to work with for
storing data, much more versatile, and can deal with graph data just fine. Even more importantly,
the most piercing insights about data usually come from graph structures _implicit_ several levels deep
in your data. The relational model, being an _algebra_, can deal with it just fine. The property graph model,
not so much, since that model is not very composable.
> We don't go this route because we think the traditional relational model is much easier to work with for
> storing data, much more versatile, and can deal with graph data just fine. Even more importantly,
> the most piercing insights about data usually come from graph structures _implicit_ several levels deep
> in your data. The relational model, being an _algebra_, can deal with it just fine. The property graph model,
> not so much, since that model is not very composable.
### What is so cool about _Datalog_?

Expand All @@ -98,33 +117,39 @@ you can build your queries piece by piece.
### Time travel?

Time travel in the database setting means
tracking changes to data over time
and allowing queries to be logically executed at a point in time
to get a historical view of the data.
Time travel in the database setting means
tracking changes to data over time
and allowing queries to be logically executed at a point in time
to get a historical view of the data.

> In a sense, this makes your database _immutable_,
> In a sense, this makes your database _immutable_,
> since nothing is really deleted from the database ever.
>
>
> In Cozo, instead of having all data automatically support
> time travel, we let you decide if you want the capability
> for each of your relation. Every extra functionality comes
> with its cost, and you don't want to pay the price if you don't use it.
>
> For the reason why you might want time travel for your data,
>
> For the reason why you might want time travel for your data,
> we have written a [short story](https://docs.cozodb.org/en/latest/releases/v0.4.html).
### How performant?

On a 2020 Mac Mini with the RocksDB persistent storage engine (CozoDB supports many storage engines):

* Running OLTP queries for a relation with 1.6M rows, you can expect around 100K QPS (queries per second) for mixed read/write/update transactional queries, and more than 250K QPS for read-only queries, with database peak memory usage around 50MB.
* Speed for backup is around 1M rows per second, for restore is around 400K rows per second, and is insensitive to relation (table) size.
* For OLAP queries, it takes around 1 second (within a factor of 2, depending on the exact operations) to scan a table with 1.6M rows. The time a query takes scales roughly with the number of rows the query touches, with memory usage determined mainly by the size of the return set.
* Running OLTP queries for a relation with 1.6M rows, you can expect around 100K QPS (queries per second) for mixed
read/write/update transactional queries, and more than 250K QPS for read-only queries, with database peak memory usage
around 50MB.
* Speed for backup is around 1M rows per second, for restore is around 400K rows per second, and is insensitive to
relation (table) size.
* For OLAP queries, it takes around 1 second (within a factor of 2, depending on the exact operations) to scan a table
with 1.6M rows. The time a query takes scales roughly with the number of rows the query touches, with memory usage
determined mainly by the size of the return set.
* Two-hop graph traversal completes in less than 1ms for a graph with 1.6M vertices and 31M edges.
* The Pagerank algorithm completes in around 50ms for a graph with 10K vertices and 120K edges, around 1 second for a graph with 100K vertices and 1.7M edges, and around 30 seconds for a graph with 1.6M vertices and 32M edges.
* The Pagerank algorithm completes in around 50ms for a graph with 10K vertices and 120K edges, around 1 second for a
graph with 100K vertices and 1.7M edges, and around 30 seconds for a graph with 1.6M vertices and 32M edges.

For more numbers and further details, we have a writeup
For more numbers and further details, we have a writeup
about performance [here](https://docs.cozodb.org/en/latest/releases/v0.3.html).

## Getting started
Expand Down Expand Up @@ -156,7 +181,6 @@ How many airports are directly connected to `FRA`?
|------------------|
| 310 |


How many airports are reachable from `FRA` by one stop?

```
Expand Down Expand Up @@ -196,10 +220,10 @@ shortest_paths[to, shortest(path)] := shortest_paths[stop, prev_path],
:limit 2
```

| to | path | p_len |
|-----|---------------------------------------------------|-------|
| to | path | p_len |
|-----|-----------------------------------------------------|-------|
| YPO | `["FRA","YYZ","YTS","YMO","YFA","ZKE","YAT","YPO"]` | 8 |
| BVI | `["FRA","AUH","BNE","ISA","BQL","BEU","BVI"]` | 7 |
| BVI | `["FRA","AUH","BNE","ISA","BQL","BEU","BVI"]` | 7 |

What is the shortest path between `FRA` and `YPO`, by actual distance travelled?

Expand All @@ -209,8 +233,8 @@ end[] <- [['YPO]]
?[src, dst, distance, path] <~ ShortestPathDijkstra(*route[], start[], end[])
```

| src | dst | distance | path |
|-----|-----|----------|--------------------------------------------------------|
| src | dst | distance | path |
|-----|-----|----------|-----------------------------------------------------------|
| FRA | YPO | 4544.0 | `["FRA","YUL","YVO","YKQ","YMO","YFA","ZKE","YAT","YPO"]` |

CozoDB attempts to provide nice error messages when you make mistakes:
Expand Down Expand Up @@ -242,7 +266,7 @@ Follow the links in the table below:
| [NodeJS](./cozo-lib-nodejs) | Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64) | MQR |
| [Web browser](./cozo-lib-wasm) | Modern browsers supporting [web assembly](https://developer.mozilla.org/en-US/docs/WebAssembly#browser_compatibility) | M |
| [Java (JVM)](https://github.com/cozodb/cozo-lib-java) | Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64) | MQR |
| [Clojure (JVM)](https://github.com/cozodb/cozo-clj) | Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64) | MQR |
| [Clojure (JVM)](https://github.com/cozodb/cozo-clj) | Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64) | MQR |
| [Android](https://github.com/cozodb/cozo-lib-android) | Android (ARM64, ARMv7, x86_64, x86) | MQ |
| [iOS/MacOS (Swift)](./cozo-lib-swift) | iOS (ARM64, simulators), Mac (ARM64, x86_64) | MQ |
| [Rust](https://docs.rs/cozo/) | Source only, usable on any [platform](https://doc.rust-lang.org/nightly/rustc/platform-support.html) with `std` support | MQRST |
Expand Down Expand Up @@ -283,7 +307,6 @@ from within your database directory, and use that as a base for your customizati
If you are not an expert on RocksDB, we suggest you limit your changes to adjusting those numerical
options that you at least have a vague understanding.


## Architecture

CozoDB consists of three layers stuck on top of each other,
Expand Down Expand Up @@ -320,7 +343,7 @@ custom backend.

The storage engine also defines a _row-oriented_ binary data format, which the storage
engine implementation does not need to know anything about.
This format contains an implementation of the
This format contains an implementation of the
[memcomparable format](https://github.com/facebook/mysql-5.6/wiki/MyRocks-record-format#memcomparable-format)
used for the keys, which enables the storage of rows of data as binary blobs
that, when sorted lexicographically, give the correct order.
Expand Down

0 comments on commit 963720e

Please sign in to comment.