Skip to content

Introduce database import and export protocol messages #224

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jun 9, 2025

Conversation

farost
Copy link
Member

@farost farost commented May 8, 2025

Release notes: usage and product changes

Add migration protocol messages for usage in database import and database export operations:

  • a unidirectional database_export stream from a TypeDB server to export a specific database, similar to TypeDB 2.x;
  • a bidirectional databases_import stream between a client and a server to import an exported 2.x/3.x TypeDB database into a TypeDB 3.x server from a client.

The format of migration items used for these operations is an extended version of TypeDB 2.x's migration items, so it is backward compatible with 2.x database files. Important: it's not intended to import 3.x databases into 2.x servers.

Implementation

Add Migration { Item } message. The format is an extended version of the 2.x protocol, so it contains "outdated" fields for its compatibility with old databases.

Add Migration { Export } message. This operation consists of a single client Req { database } and multiple streamed server responses:

  1. An initial response with the schema.
  2. An unlimited number of migration items (multiple messages with multiple items in one message for potential optimizations)
  3. A Done message to signal that the server is ready to close the stream without errors. It can be substituted by a silent stream closure, but I preferred explicitness here.

Add Migration { Import } message. This operation consists of a stream of client requests:

  1. An initial request with the name of the database and its schema string.
  2. An unlimited number of migration items.
  3. A Done message to signal that the client is finished without errors, and the server can perform the final validation. This Done message is required and cannot be removed because the client has to check whether there were finalization errors or not.

and a stream of server responses (actually, there is either a single Done or a single error, but the stream is needed to return errors at any stage of the communication).

@farost farost changed the title Add database import and export protocol messages Introduce database import and export protocol messages May 27, 2025
message Res {
DatabaseReplicas database = 1;
}
}

message Import {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've placed all migration-related messages in a single file, so it's "encapsulated". However, this introduces an additional layer of optional client and server in Rust while unpacking this message, so it's a little irritating. Not sure if it's worth it or not. I like this design more, I guess.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like what you have here, very readable


// This is an emulation of the google ErrorDetails message. Generally, ErrorDetails are submitted via the GRPC error
// mechanism, but a manual error sending is required in streams
message Error {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was planning to reuse this error, but I understood that I don't need non-terminal errors in my protocol. However, I think that it's reasonable to generalize error messages for the whole TypeDB protocol.

UNSPECIFIED = 0;
VERSION = 6;
VERSION = 7;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flyingsilverfin We need to discuss the new versioning approach. I may roll it back.

@farost farost marked this pull request as ready for review May 29, 2025 08:40
@farost farost requested a review from haikalpribadi as a code owner May 29, 2025 08:40
Copy link
Member

@dmitrii-ubskii dmitrii-ubskii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, couple notes.

message Migration {
message Export {
message Req {
string database = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and name on L45 should have the same name. Either name or database_name, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. I thought there are more exceptions, but it's called string database only in the transaction opening request

int64 relation_count = 3;
int64 role_count = 4;
int64 ownership_count = 5;
// 6 was deleted and cannot be used until a breaking change occurs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 should be reserved then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was copied from the 2.x implementation, but I didn't even know there was such a feature. Cool, done!

bool boolean = 2;
int64 integer = 3;
double double = 4;
int64 datetime_millis = 5; // reserved for 2.x, time since epoch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
int64 datetime_millis = 5; // reserved for 2.x, time since epoch
int64 datetime_millis = 5; // compatibility with 2.x, milliseconds since epoch

Comment on lines +64 to +66
// ATTENTION: the messages below are used to import multiple versions of TypeDB.
// DO NOT reorder or delete existing and reserved indices. Be careful while extending this.
//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bigger please!! like we had it in server - this is very dangerous

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASCII art here we go???

//               _  _____ _____ _____ _   _ _____ ___ ___  _   _ _
//              / \|_   _|_   _| ____| \ | |_   _|_ _/ _ \| \ | | |
//             / _ \ | |   | | |  _| |  \| | | |  | | | | |  \| | |
//            / ___ \| |   | | | |___| |\  | | |  | | |_| | |\  |_|
//           /_/   \_\_|   |_| |_____|_| \_| |_| |___\___/|_| \_(_)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do it :D

Copy link
Member

@flyingsilverfin flyingsilverfin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@farost farost merged commit f6528be into typedb:master Jun 9, 2025
@farost farost deleted the 3.x-export-import branch June 9, 2025 16:15
farost added a commit to typedb/typedb that referenced this pull request Jun 10, 2025
## Product change and motivation
Add database export and database import operations. Unlike in TypeDB
2.x, these operations are run through TypeDB **GRPC clients** such as
the **Rust driver** or **TypeDB Console**, which solves a number of
issues with networking and encryption, especially relevant for users of
TypeDB Enterprise. With this, it becomes an official part of the [TypeDB
GPRC protocol](typedb/typedb-protocol#224).
Both operations are performed through the network, but the server and
the client can be used on the same host.

Each TypeDB database can be represented as two files:
1. A text file with its TypeQL schema description: a complete `define`
query for the whole schema.
2. A binary file with its data.

This format is an extension to the TypeDB 2.x's export format. See more
details below for version compatibility information.

### Database export
Database export allows a client to download database schema and data
files from a TypeDB server for future import to the same or higher
TypeDB version.

The files are created on the client side. While the database data is
being exported, parallel queries are allowed, but none of them will
affect the exported data thanks to TypeDB's transactionality.
However, the database will not be available for such operations as
deletion.

**Exported TypeDB 3.x databases cannot be imported into servers of older
versions.**

### Database import
Database import allows a client to upload previously exported database
schema and data files to a TypeDB server. It is possible to assign any
new name to the imported database.

While the database data is being imported, it is not recommended to
perform parallel operations against it.
Interfering actions may lead to import errors or database corruption.

**Import supports all exported TypeDB 2.x and TypeDB 3.x databases.** It
can be used to migrate between TypeDB versions with breaking changes.
Please visit [our docs](https://typedb.com/docs/manual/migration/2_to_3)
for more information.

## Implementation
Implement [the new
protocol](typedb/typedb-protocol#224).

The two operations are implemented as two separate services running in
`tokio` tasks, similar to transaction services.

### Database export 
This operation is simple. After establishing a stream, we just export
the database's schema (the same operation as schema retrieval) and then
send a number of items containing the header, the database's data
(encoded concepts), and data checksums at the end.

### Database import
This operation is more tricky. It is executed in steps, containing a
couple of "states":
1. The database's name and schema are expected. Without it, we can't
continue. Ignoring this step leads to an error, signaling that the
client is probably implemented incorrectly.
2. After the schema is received, it is executed and committed as is, to
check that the provided schema is correct. Otherwise, a user error is
returned, so they can rewrite the schema and try again.
3. After the schema is persisted, it is relaxed. This contains: a)
substituting default cardinalities and card/key annotations with
`@card(0..)` for all capabilities in the schema b) making all attributes
independent (`@independent`) so they are not cleaned up between
transactions without owners if they are not yet received c) making all
relations independent (a new system `system independent` property, not
exposed to users) so they are not cleaned up between transactions
without role players. All errors at this stage are considered internal
errors and bugs.
4. After the schema is prepared, we receive items, decode them into
concepts, and persist in the database. There is a transaction buffer
size we use to execute commits from time to time, reducing memory
consumption and final commit time. This optimization requires 3b and 3c.
One of the expected and required items is checksums. Another optional
item is the header (we don't really use it outside of logs, so it's not
required. Not sure if we need to force it).
5. After the `Done` message is received, signaling that the stream of
items is completed, we perform the final data commit and unrelax the
schema (undo 3).
6. If everything is good, a verification `Done` response is sent.
At any point, an error can be returned. To make it possible, the
protocol introduces streaming from the server, although it remains
silent until the end unless there are errors in the provided messages.

While being imported, databases are not accessible through
`database_manager`: they are owned by `database_importer`. If a server
crashes, the uncompleted databases will be cleaned up on the next
bootup.

To avoid overflowing memory, `InstanceIDMapping` uses `SpilloverCache`,
a new component combining `HashMap` and RocksDB to spill over the
excessive data which is too much to fit in memory.
farost added a commit to typedb/typedb-driver that referenced this pull request Jun 10, 2025
## Usage and product changes
Introduce interfaces to export databases into schema definition and data
files and to import databases using these files. Database import
supports files exported from both TypeDB 2.x and TypeDB 3.x.

Both operations are blocking and may take a significant amount of time
to execute for large databases. Use parallel connections to continue
operating with the server and its other databases.

Usage examples in Rust:
```rust
// export
let db = driver.databases().get(db_name).await.unwrap();
db.export_to_file(schema_file_path, data_file_path).await.unwrap();

// import
let schema = read_to_string(schema_file_path).unwrap();
driver.databases().import_from_file(db_name2, schema, data_file_path).await.unwrap();
```

Usage examples in Python:
```py
# export
database = driver.databases.get(db_name)
database.export_to_file(schema_file_path, data_file_path)

# import
with open(schema_file_path, 'r', encoding='utf-8') as f:
    schema = f.read()
driver.databases.import_from_file(db_name2, schema, data_file_path)
```

Usage examples in Java:
```java
// export
Database database = driver.databases().get(dbName);
database.exportToFile(schemaFilePath, dataFilePath);

// import
String schema = Files.readString(Path.of(schemaFilePath));
driver.databases().importFromFile(dbName2, schema, dataFilePath);
```

## Implementation
Implemented the updated
[protocol](typedb/typedb-protocol#224).

As both operations work with streaming, the implementation is similar to
transactions. The behavior is split into the file processing logic and
networking (specialized for sync and async modes). The exposed
interfaces present only the file-based versions, but additional
interfaces for direct work with streams can be presented in future
updates.

In Rust, paths are accepted as Rust `Path`s. In other languages working
through C interfaces, it's pure strings for C layer transmission.

### Database export 
Implemented through the `database` interface and accepts two target
files for export. Does not require a specific format of naming (so it's
necessarily `.typeql` or `.typedb`. If any of the target files already
exist, an error is returned.

The export operation consists of these steps:
* prepare the output files
* open a unidirectional GRPC stream from the server to the client
* "block" on server response listening until an error or a "done" is
received (blocking is implemented through a loop which resolves a
`listen` promise presented by the network layer)
* if there is a schema message, write it to the schema file and flush it
right away
* if there is a data items message, encode the items and write them to
the data file
* in case of an error, the output files are deleted (we own them as we
create them at the beginning)

The network layer is basically just a task listening for the GRPC stream
and transmitting the converted messages to the processing loop.

### Database import
Implemented through the `database_manager` interface and accepts a
database name, a schema definition query string (can be read from the
exported file), and an exported data file. No naming requirements as
well.

The import operation consists of these steps:
* open the input file
* open a bidirectional GRPC stream between the server and the client,
send the initial request with the database's name and schema
* eagerly start reading and decoding data items from the input file one
by one, storing up to 250 items in the buffer (this number can be easily
changed)
* once the buffer is full, attempt items sending operation: this
operation will check a potential early error signal from the server and
then send the batch, returning to processing the rest of the file
* once the file is read, send a "done" message and block until the
server responds with either an error or its "done" message

The network layer consists of a blocking task for client-side requests
and a listening task waiting for a one-shot signal from the server
(either an error or a "done" message). Errors can be received at any
time of processing, while "done" is expected only after a client-side
"done" request. When a response is received, either an async or a sync
sink receives this message, which should be checked before any
client-side network operations to ensure proper interruption.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants