Skip to content

How to scale cubestore for reliability? #10410

@yanchith

Description

@yanchith

I am researching, whether cube store can be scaled for better availability. So far my conclusions are that it can not be easily scaled within the cluster, because the cluster is actually more fragile with increasing numbers of machines due to how they communicate internally (the whole RPC business). The documentation supports this (https://cube.dev/docs/product/caching/running-in-production#replication-and-high-availability).

The natural follow-up question is whether perhaps cube store could be scale by replicating entire clusters. The documentation somewhat addresses this in the cloud section, however the architectures on the diagrams in https://cube.dev/docs/product/administration/deployment/deployment-types#production-cluster all seem to point to the fact that if parts of a multi-cluster goes down, at least some tenants are going to experience outage.

I ran tests with multiple clusters side by side, comparing their state. Their state (at least the one they store on the filesystem) diverged almost instantly - not just file names and sizes, but also file counts. This makes me worry that these clusters would give different answers to queries, which isn't great.

So then I re-ran the same test with the two clusters sharing the filesystem (perhaps they can share safely?). This setup immediately ran into internal errors, such as:

Error during Metastore upload: CubeError { message: "Operation failed. Try again.: Create a new iterator to fetch the new tail.", backtrace: " 
0: std::backtrace::Backtrace::create
1: cubestore::CubeError::from_error
2: <cubestore::CubeError as core::convert::From<rocksdb::Error>>::from
3: cubestore::config::CubeServices::spawn_processing_loops::{{closure}}::{{closure}}
4: tokio::runtime::task::core::Core<T,S>::poll
5: tokio::runtime::task::harness::Harness<T,S>::poll
6: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
7: tokio::runtime::scheduler::multi_thread::worker::Context::run
8: tokio::runtime::context::scoped::Scoped<T>::set
9: tokio::runtime::context::runtime::enter_runtime
10: tokio::runtime::scheduler::multi_thread::worker::run
11: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
12: tokio::runtime::task::core::Core<T,S>::poll
13: tokio::runtime::task::harness::Harness<T,S>::poll
14: tokio::runtime::blocking::pool::Inner::run
15: std::sys::backtrace::__rust_begin_short_backtrace
16: core::ops::function::FnOnce::call_once{{vtable.shim}}
17: std::sys::pal::unix::thread::Thread::new::thread_start
18: <unknown>
19: <unknown>
", cause: Internal }

Please advise on how to scale.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionThe issue is a question. Please use Stack Overflow for questions.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions