Skip to content

Tags: scylladb/scylla-bench

Tags

v0.1.24

Toggle v0.1.24's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #152 from dimakr/hostname_validation_140

fix(host-verification): switch to gocql builtin hostname verification

v0.1.23

Toggle v0.1.23's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #150 from vponomaryov/partition-offset-for-more-wo…

…rkloads

Make all write-capable workloads support partition offset

v0.1.22

Toggle v0.1.22's commit message
Create counter table only in counter read/updates modes

v0.1.21

Toggle v0.1.21's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Merge pull request #141 from vponomaryov/gocql-v1.14.1

Bump gocql version to "v1.14.1"

v0.1.20

Toggle v0.1.20's commit message
tablet-aware-gocql-with-fixed-issue-134

v0.1.19

Toggle v0.1.19's commit message
tablet-aware-gocql

v0.1.18

Toggle v0.1.18's commit message
fix DoBatchedWrites to behave properly on batchSize == 0

v0.1.17

Toggle v0.1.17's commit message
Reduce err msg size for batch write queries

v0.1.16

Toggle v0.1.16's commit message
Add query retries on the scylla-bench level

Add possibility to handle query retries by the scylla-bench
to avoid various 'gocql' bugs in this field.
Also, enable it by default.
To change the retry handler use the following new option:

    -retry-handler=gocql
    -retry-handler=sb

Only one approach at a time will be used.
Only 'sb' and 'gocql' values for the new option are supported.

v0.1.15

Toggle v0.1.15's commit message
modes: account for lag when reporting partial results

scylla-bench spawns a bunch of worker goroutines that are responsible
for executing CQL operations and a single collector goroutine that
collects partial results from the workers. Each worker goroutine has
its own separate channel that it uses to send partial results to the
collector. The collector goroutine, in a loop, fetches a single partial
result from each channel, merges them and prints.

There are some problems with this setup:

- The channel capacities are very large (10k elements each),
- The size of a partial result can be large (each one has
  a hdrhistogram),
- The intervals that the worker use to send partial results are prone
  to skew and workers might go out of sync.

More specifically, the algorithm that the worker uses looks like this:

```
last_send_timestamp := now
loop:
    do_an_operation()
    if now - last_send_timestamp >= log_interval: # log_interval == 1s
        send_partial_results()
        last_send_timestamp := now
```

Notice that when the condition for sending partial results is met, more
than one second might have elapsed. Because the loop just resets
the last_send_timestamp to the current timestamp, the worker does not
try to catch up the lag in any way.

Because the lag is dependent on operation latency - which is random
from our POV - the difference between the worker with the lowest lag
and the highest lag will keep on increasing. The collector takes
a single value from each channel in each round, so it will be
bottlenecked on the slowest worker. If the difference in lag between
the slowest worker and some other worker is large enough, then the
other worker's channel will be non-empty and will always contain
a number of partial results proportional to the worker's lag.

Because of the above, the memory consumption may grow over time.
In addition to this, results might become inaccurate because
the collector will merge current results of the slow workers with
outdated results from several seconds ago of the fast workers.

This commit fixes this in a very simple way: instead of resetting
last_send_timestamp to the current timestamp, the last_send_timestamp
is just advanced by the log interval. This will ensure that the worker
goroutines try to keep synchronized with each other.

Might be related to: #112
This is the likely cause, although I don't have a definitive proof.
In my one hour test I observed that the memory use looks correlated to
the number of items sitting in the channels, and indeed was higher
before the fix - but not much higher (by 100MB) and the number of items
fluctuated instead of steadily growing. Perhaps a longer test,
like in the original issue, would be needed to reproduce this.