-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
I am researching, whether cube store can be scaled for better availability. So far my conclusions are that it can not be easily scaled within the cluster, because the cluster is actually more fragile with increasing numbers of machines due to how they communicate internally (the whole RPC business). The documentation supports this (https://cube.dev/docs/product/caching/running-in-production#replication-and-high-availability).
The natural follow-up question is whether perhaps cube store could be scale by replicating entire clusters. The documentation somewhat addresses this in the cloud section, however the architectures on the diagrams in https://cube.dev/docs/product/administration/deployment/deployment-types#production-cluster all seem to point to the fact that if parts of a multi-cluster goes down, at least some tenants are going to experience outage.
I ran tests with multiple clusters side by side, comparing their state. Their state (at least the one they store on the filesystem) diverged almost instantly - not just file names and sizes, but also file counts. This makes me worry that these clusters would give different answers to queries, which isn't great.
So then I re-ran the same test with the two clusters sharing the filesystem (perhaps they can share safely?). This setup immediately ran into internal errors, such as:
Error during Metastore upload: CubeError { message: "Operation failed. Try again.: Create a new iterator to fetch the new tail.", backtrace: "
0: std::backtrace::Backtrace::create
1: cubestore::CubeError::from_error
2: <cubestore::CubeError as core::convert::From<rocksdb::Error>>::from
3: cubestore::config::CubeServices::spawn_processing_loops::{{closure}}::{{closure}}
4: tokio::runtime::task::core::Core<T,S>::poll
5: tokio::runtime::task::harness::Harness<T,S>::poll
6: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
7: tokio::runtime::scheduler::multi_thread::worker::Context::run
8: tokio::runtime::context::scoped::Scoped<T>::set
9: tokio::runtime::context::runtime::enter_runtime
10: tokio::runtime::scheduler::multi_thread::worker::run
11: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
12: tokio::runtime::task::core::Core<T,S>::poll
13: tokio::runtime::task::harness::Harness<T,S>::poll
14: tokio::runtime::blocking::pool::Inner::run
15: std::sys::backtrace::__rust_begin_short_backtrace
16: core::ops::function::FnOnce::call_once{{vtable.shim}}
17: std::sys::pal::unix::thread::Thread::new::thread_start
18: <unknown>
19: <unknown>
", cause: Internal }
Please advise on how to scale.