[Discussion] How should Cassandra and ScyllaDB (API-compatible with Cassandra) be benchmarked?

_Disclaimer: I am not an expert in these two systems. I am just sharing my short analysis and thoughts. If my analysis contains errors, I would be glad if someone points them out!_

The README lists [Cassandra and ScyllaDB (API-compatible with Cassandra)](https://github.com/ClickHouse/ClickBench?tab=readme-ov-file#systems-included) in the `Systems Included` section, but there are no benchmarks for these systems available yet.

**TL;DR**    

How can Cassandra and ScyllaDB, despite their limitations, be benchmarked on ClickBench?

**Long version**   

I am wondering how Cassandra and ScyllaDB could be benchmarked on ClickBench in a way that the results are comparable to other systems? 

While there are workarounds for some limitations, e.g., replacing the `NOT EQUAL` operator with `NOT IN` in predicates, and replacing arithmetic functions and the `REGEX_REPLACE` function by user-defined functions (Lua, WASM), I do think that many queries _cannot_ be ran like intended.

> **Issue 1:** There seems to be no support for 
> - [`COUNT(DISTINCT <column>)`](https://opensource.docs.scylladb.com/stable/cql/dml/select.html)
> - `OFFSET <offset>` 

This affects queries Q5, Q6, Q9, Q10, Q11, Q12, Q14, Q23 and Q39, Q40, Q41, Q42, Q43, respectively.

Furthermore, there are also limitations on `ORDER BY`.

> **Issue 2:** Queries containing an `ORDER BY` need to be restricted by an `EQ` or `IN` on the _partition key_. 


Some queries use derived fields in the `ORDER BY`. Hence, these queries _may_ not be runnable on these systems at all, or at least not in the intended way with on-the-fly calculation of the derived fields. This affects queries Q8, Q9, Q10, Q11, Q12, Q13, Q14, Q15, Q16, Q17, Q19, Q22, Q28, Q29, Q31, Q32, Q33, Q34, Q35, Q36, Q37, Q38, Q39, Q40, Q41, Q42, Q43.

In general, Cassandra and ScyllaDB seem to heavily rely on the primary/partition/clustering key for data querying. Data modeling should be (has to be? 🤔) done according to the expected query patterns. In case of heterogeneous query patterns, tables need to be duplicated and stored with different primary/partition/clustering keys. 

In addition to the restrictions on `ORDER BY`, there are also restrictions on `GROUP BY`.

> **Issue 3:** [`GROUP BY` arguments must form a prefix of the primary key](https://opensource.docs.scylladb.com/branch-5.1/cql/dml.html#grouping-results).

For example, Q17 and Q18 (`GROUP BY UserID, SearchPhrase`) could be ran on the same table by setting the primary key accordingly (at least in theory, because the combination of `UserID, SearchPhrase` might not be unique). However, then it is impossible to run Q9 and Q10 (`GROUP BY RegionID`), Q11 (`GROUP BY MobilePhoneModel`), Q12 (`GROUP BY MobilePhone, MobilePhoneModel`), etc. on the same table.

---

Any thoughts on this? Am I missing something?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Discussion] How should Cassandra and ScyllaDB (API-compatible with Cassandra) be benchmarked? #384

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Discussion] How should Cassandra and ScyllaDB (API-compatible with Cassandra) be benchmarked? #384

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions