How to scale cubestore for reliability?

I am researching, whether cube store can be scaled for better availability. So far my conclusions are that it can not be easily scaled within the cluster, because the cluster is actually more fragile with increasing numbers of machines due to how they communicate internally (the whole RPC business). The documentation supports this (https://cube.dev/docs/product/caching/running-in-production#replication-and-high-availability).

The natural follow-up question is whether perhaps cube store could be scale by replicating entire clusters. The documentation somewhat addresses this in the cloud section, however the architectures on the diagrams in https://cube.dev/docs/product/administration/deployment/deployment-types#production-cluster all seem to point to the fact that if parts of a multi-cluster goes down, at least some tenants are going to experience outage.

I  ran tests with multiple clusters side by side, comparing their state. Their state (at least the one they store on the filesystem) diverged almost instantly - not just file names and sizes, but also file counts. This makes me worry that these clusters would give different answers to queries, which isn't great.

So then I re-ran the same test with the two clusters sharing the filesystem (perhaps they can share safely?). This setup immediately ran into internal errors, such as:

```
Error during Metastore upload: CubeError { message: "Operation failed. Try again.: Create a new iterator to fetch the new tail.", backtrace: " 
0: std::backtrace::Backtrace::create
1: cubestore::CubeError::from_error
2: <cubestore::CubeError as core::convert::From<rocksdb::Error>>::from
3: cubestore::config::CubeServices::spawn_processing_loops::{{closure}}::{{closure}}
4: tokio::runtime::task::core::Core<T,S>::poll
5: tokio::runtime::task::harness::Harness<T,S>::poll
6: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
7: tokio::runtime::scheduler::multi_thread::worker::Context::run
8: tokio::runtime::context::scoped::Scoped<T>::set
9: tokio::runtime::context::runtime::enter_runtime
10: tokio::runtime::scheduler::multi_thread::worker::run
11: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
12: tokio::runtime::task::core::Core<T,S>::poll
13: tokio::runtime::task::harness::Harness<T,S>::poll
14: tokio::runtime::blocking::pool::Inner::run
15: std::sys::backtrace::__rust_begin_short_backtrace
16: core::ops::function::FnOnce::call_once{{vtable.shim}}
17: std::sys::pal::unix::thread::Thread::new::thread_start
18: <unknown>
19: <unknown>
", cause: Internal }
```

Please advise on how to scale.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to scale cubestore for reliability? #10410

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to scale cubestore for reliability? #10410

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions