Skip to content

Commit

Permalink
New ConsistencyScan (apple#10265)
Browse files Browse the repository at this point in the history
* Remove duplicate getRange() for DB handles and update existing GetRange to accept DB handles.

* Initial progress checkpoint on new ConsistencyScan role.

* Updated TODOs, finished most if not all state updates.

* placeholder

* Add more TODOs, documentation and comment improvements.

* Checkpoint round state to avoid advancing progress if commit fails.

* Bug fix, check is supposed to be for overlap, not lack of overlap.

* Added more TODO's and added faked read results / exceptions and faked DB size retrieval to prove the consistencyScanCore logic works.

* Update JSON schemas and command help.

* Add comment about lifetime stats reset.

* More TODO comments and some renames for clarity, some bug fixes.

* properly stopping consistency scan in simulation so that it doesn't run forever and cause quiet database to fail

* removing trailing comma from consistency_scan json schema

* Making CC inconsistency not an error if it's intentional tss corruption

* consistency scan actually reads storage locations

* added check that consistency scan actually completes a round in simulation, fixed bug and added debugging around consistency scan getting stuck

* made consistency scan properly fetch database size

* refactoring data check to be used in both consistency scan and consistency check

* checking that consistency scan always completes at least one round and doesn't get stuck

* cleanup

* fixing ide build

* consistencyscan fdbcli command wasn't actually changing db state

* consistencyscan fdbcli command always said enabled even when it wasn't

---------

Co-authored-by: Steve Atherton <steve.atherton@snowflake.com>
  • Loading branch information
sfc-gh-jslocum and sfc-gh-satherton authored May 18, 2023
1 parent 8194997 commit 2916a11
Show file tree
Hide file tree
Showing 13 changed files with 1,400 additions and 675 deletions.
17 changes: 9 additions & 8 deletions documentation/sphinx/source/command-line-interface.rst
Original file line number Diff line number Diff line change
Expand Up @@ -138,24 +138,25 @@ This command controls a native data consistency scan role that is automatically

The syntax is

``consistencyscan [ off | on [maxRate <RATE>] [targetInterval <INTERVAL>] [restart <RESTART>] ]``
``consistencyscan [on|off] [restart] [maxRate <BYTES_PER_SECOND>] [targetInterval <SECONDS>]``

* ``off`` will disable the consistency scan
* ``on`` enables the scan.

* ``on`` will enable the scan and can be accompanied by additional options shown above
* ``off`` disables the scan but keeps the current cycle's progress so it will resume later if enabled again.

* ``RATE`` - sets the maximum read speed of the scan in bytes/s.
* ``restart`` will end the current scan cycle. A new cycle will start if the scan is enabled, or later when it is re-enabled.

* ``INTERVAL`` - sets the target completion time, in seconds, for each full pass over all data in the cluster. Scan speed will target this interval with a hard limit of RATE.
* ``maxRate <BYTES_PER_SECOND>`` sets the maximum scan read speed rate to BYTES_PER_SECOND, post-replication.

* ``RESTART`` - a 1 or 0 and controls whether the process should restart from the beginning of userspace on startup or not. This should normally be set to 0 which will resume progress from the last time the scan was running.
* ``targetInterval <SECONDS>`` sets the target interval for the scan to SECONDS. The scan will adjust speed to attempt to complete in that amount of time but it will not exceed BYTES_PER_SECOND.

The consistency scan role publishes its configuration and metrics in Status JSON under the path ``.cluster.consistency_scan_info``.
The consistency scan role publishes its configuration and metrics in Status JSON under the path ``.cluster.consistency_scan``.

consistencycheck
----------------

Note: This command exists for backward compatibility, it is suggested to use the ``consistencyscan`` command to control FDB's internal consistency scan role instead.
.. note::
This command exists for backward compatibility, it is suggested to use the ``consistencyscan`` command above to control FDB's internal consistency scan role instead.

This command controls a key which controls behavior of any externally configured consistency check roles. You must be running an ``fdbserver`` process with the ``consistencycheck`` role to perform consistency checking.

Expand Down
47 changes: 46 additions & 1 deletion documentation/sphinx/source/mr-status-json-schemas.rst.inc
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,51 @@
"finished_wiggle": 1
}
},
"consistency_scan" : {
"configuration" : {
"enabled" : true,
"max_rate_bytes_per_second" : 50000000,
"min_interval_seconds" : 2592000,
"min_start_version" : 0,
"round_history_days" : 90,
"target_interval_seconds" : 2592000
},
"current_round" : {
"complete" : false,
"end_datetime" : "1970-01-01 00:00:00.000 +0000",
"end_timestamp" : 0,
"end_version" : 0,
"errors" : 0,
"last_end_key" : "",
"logical_bytes_scanned" : 0,
"replicated_bytes_scanned" : 0,
"skippedRanges" : 0,
"start_datetime" : "1970-01-01 00:00:00.000 +0000",
"start_timestamp" : 0,
"start_version" : 0
},
"lifetime_stats" : {
"errors" : 0,
"logical_bytes_scanned" : 0,
"replicated_bytes_scanned" : 0
},
"previous_rounds" : [
{
"complete" : false,
"end_datetime" : "1970-01-01 00:00:00.000 +0000",
"end_timestamp" : 0,
"end_version" : 0,
"errors" : 0,
"last_end_key" : "",
"logical_bytes_scanned" : 0,
"replicated_bytes_scanned" : 0,
"skippedRanges" : 0,
"start_datetime" : "1970-01-01 00:00:00.000 +0000",
"start_timestamp" : 0,
"start_version" : 0
}
]
},
"layers":{
"_valid":true,
"_error":"some error description"
Expand Down Expand Up @@ -548,7 +593,7 @@
"primary_dc_missing",
"fetch_primary_dc_timeout",
"fetch_storage_wiggler_stats_timeout",
"fetch_consistency_scan_info_timeout",
"fetch_consistency_scan_status_timeout",
"metacluster_metrics_missing"
]
},
Expand Down
140 changes: 68 additions & 72 deletions fdbcli/ConsistencyScanCommand.actor.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -18,105 +18,101 @@
* limitations under the License.
*/

#include <boost/lexical_cast.hpp>
#include <list>
#include "fdbcli/fdbcli.actor.h"

#include "fdbclient/FDBOptions.g.h"
#include "fdbclient/IClientApi.h"

#include "flow/Arena.h"
#include "flow/FastRef.h"
#include "flow/ThreadHelper.actor.h"
#include "fdbclient/ReadYourWrites.h"
#include "fdbclient/RunTransaction.actor.h"
#include "fdbclient/ConsistencyScanInterface.actor.h"

#include "flow/actorcompiler.h" // This must be the last #include.

namespace fdb_cli {

ACTOR Future<bool> consistencyScanCommandActor(Database db, std::vector<StringRef> tokens) {
// Skip the command token so start at begin+1
state std::list<StringRef> args(tokens.begin() + 1, tokens.end());

state ConsistencyScanState cs = ConsistencyScanState();
state Reference<ReadYourWritesTransaction> tr = makeReference<ReadYourWritesTransaction>(db);
// Here we do not proceed in a try-catch loop since the transaction is always supposed to succeed.
// If not, the outer loop catch block(fdbcli.actor.cpp) will handle the error and print out the error message
state int usageError = 0;
state ConsistencyScanInfo csInfo = ConsistencyScanInfo();
tr->setOption(FDBTransactionOptions::SPECIAL_KEY_SPACE_ENABLE_WRITES);
tr->setOption(FDBTransactionOptions::PRIORITY_SYSTEM_IMMEDIATE);
state bool error = false;

// Get the exisiting consistencyScanInfo object if present
state Optional<Value> consistencyScanInfo = wait(ConsistencyScanInfo::getInfo(tr));
wait(tr->commit());
if (consistencyScanInfo.present())
csInfo = ObjectReader::fromStringRef<ConsistencyScanInfo>(consistencyScanInfo.get(), IncludeVersion());
tr->reset();
loop {
try {
SystemDBWriteLockedNow(db.getReference())->setOptions(tr);

if (tokens.size() == 1) {
printf("Consistency Scan Info: %s\n", csInfo.toString().c_str());
} else if ((tokens.size() == 2) && tokencmp(tokens[1], "off")) {
csInfo.consistency_scan_enabled = false;
wait(ConsistencyScanInfo::setInfo(tr, csInfo));
wait(tr->commit());
} else if ((tokencmp(tokens[1], "on") && tokens.size() > 2)) {
csInfo.consistency_scan_enabled = true;
state std::vector<StringRef>::iterator t;
for (t = tokens.begin() + 2; t != tokens.end(); ++t) {
if (tokencmp(t->toString(), "restart")) {
if (++t != tokens.end()) {
if (tokencmp(t->toString(), "0")) {
csInfo.restart = false;
} else if (tokencmp(t->toString(), "1")) {
csInfo.restart = true;
} else {
usageError = 1;
}
} else {
usageError = 1;
}
} else if (tokencmp(t->toString(), "maxRate")) {
if (++t != tokens.end()) {
char* end;
csInfo.max_rate = std::strtod(t->toString().data(), &end);
if (!std::isspace(*end) && (*end != '\0')) {
fprintf(stderr, "ERROR: %s failed to parse.\n", t->toString().c_str());
return false;
state ConsistencyScanState::Config config = wait(ConsistencyScanState().config().getD(tr));

if (args.empty()) {
printf(
"%s\n",
json_spirit::write_string(json_spirit::mValue(config.toJSON()), json_spirit::pretty_print).c_str());
break;
}

// TODO: Expose/document additional configuration options
// TODO: Range configuration.
while (!error && !args.empty()) {
auto next = args.front();
args.pop_front();
if (next == "on") {
config.enabled = true;
} else if (next == "off") {
config.enabled = false;
} else if (next == "restart") {
config.minStartVersion = tr->getReadVersion().get();
} else if (next == "maxRate") {
error = args.empty();
if (!error) {
config.maxReadByteRate = boost::lexical_cast<int>(args.front().toString());
args.pop_front();
}
} else {
usageError = 1;
}
} else if (tokencmp(t->toString(), "targetInterval")) {
if (++t != tokens.end()) {
char* end;
csInfo.target_interval = std::strtod(t->toString().data(), &end);
if (!std::isspace(*end) && (*end != '\0')) {
fprintf(stderr, "ERROR: %s failed to parse.\n", t->toString().c_str());
return false;
} else if (next == "targetInterval") {
error = args.empty();
if (!error) {
config.targetRoundTimeSeconds = boost::lexical_cast<int>(args.front().toString());
args.pop_front();
}
} else {
usageError = 1;
}
} else {
usageError = 1;
}
}

if (!usageError) {
wait(ConsistencyScanInfo::setInfo(tr, csInfo));
if (error) {
break;
}
cs.config().set(tr, config);
wait(tr->commit());
break;
} catch (Error& e) {
wait(tr->onError(e));
}
} else {
usageError = 1;
}

if (usageError) {
if (error) {
printUsage(tokens[0]);
return false;
}

return true;
}

CommandFactory consistencyScanFactory(
"consistencyscan",
CommandHelp("consistencyscan <on|off> <restart 0|1> <maxRate val> <targetInterval val>",
"enables or disables consistency scan",
"Calling this command with `on' enables the consistency scan process to run the scan with given "
"arguments and `off' will halt the scan. "
"Calling this command with no arguments will display if consistency scan is currently enabled.\n"));
CommandHelp(
// TODO: Expose/document additional configuration options
"consistencyscan [on|off] [restart] [maxRate <BYTES_PER_SECOND>] [targetInterval <SECONDS>]",
"Enables, disables, or sets options for the Consistency Scan role which repeatedly scans "
"shard replicas for consistency.",
"`on' enables the scan.\n\n"
"`off' disables the scan but keeps the current cycle's progress so it will resume later if enabled again.\n\n"
"`restart' will end the current scan cycle. A new cycle will start if the scan is enabled, or later when "
"it is enabled.\n\n"
"`maxRate <BYTES_PER_SECOND>' sets the maximum scan read speed rate to BYTES_PER_SECOND, post-replication.\n\n"
"`targetInterval <SECONDS>' sets the target interval for the scan to SECONDS. The scan will adjust speed "
"to attempt to complete in that amount of time but it will not exceed BYTES_PER_SECOND\n\n"
"The consistency scan role publishes its configuration and metrics in Status JSON under the path "
"`.cluster.consistency_scan'\n"
// TODO: Syntax hint generator
));

} // namespace fdb_cli
58 changes: 45 additions & 13 deletions fdbclient/Schemas.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -584,7 +584,7 @@ const KeyRef JSONSchemas::statusSchema = R"statusSchema(
"primary_dc_missing",
"fetch_primary_dc_timeout",
"fetch_storage_wiggler_stats_timeout",
"fetch_consistency_scan_info_timeout",
"fetch_consistency_scan_status_timeout",
"metacluster_metrics_missing"
]
},
Expand Down Expand Up @@ -871,18 +871,50 @@ const KeyRef JSONSchemas::statusSchema = R"statusSchema(
"cluster_aware"
]}
},
"consistency_scan_info":{
"consistency_scan_enabled":false,
"restart":false,
"max_rate":0,
"target_interval":0,
"bytes_read_prev_round":0,
"last_round_start_datetime":"2022-04-20 00:05:05.123 +0000",
"last_round_finish_datetime":"1970-01-01 00:00:00.000 +0000",
"last_round_start_timestamp":1648857905.123,
"last_round_finish_timestamp":0,
"smoothed_round_seconds":1,
"finished_rounds":1
"consistency_scan" : {
"configuration" : {
"enabled" : true,
"max_rate_bytes_per_second" : 50000000,
"min_interval_seconds" : 2592000,
"min_start_version" : 0,
"round_history_days" : 90,
"target_interval_seconds" : 2592000
},
"current_round" : {
"complete" : false,
"end_datetime" : "1970-01-01 00:00:00.000 +0000",
"end_timestamp" : 0,
"end_version" : 0,
"errors" : 0,
"last_end_key" : "",
"logical_bytes_scanned" : 0,
"replicated_bytes_scanned" : 0,
"skippedRanges" : 0,
"start_datetime" : "1970-01-01 00:00:00.000 +0000",
"start_timestamp" : 0,
"start_version" : 0
},
"lifetime_stats" : {
"errors" : 0,
"logical_bytes_scanned" : 0,
"replicated_bytes_scanned" : 0
},
"previous_rounds" : [
{
"complete" : false,
"end_datetime" : "1970-01-01 00:00:00.000 +0000",
"end_timestamp" : 0,
"end_version" : 0,
"errors" : 0,
"last_end_key" : "",
"logical_bytes_scanned" : 0,
"replicated_bytes_scanned" : 0,
"skippedRanges" : 0,
"start_datetime" : "1970-01-01 00:00:00.000 +0000",
"start_timestamp" : 0,
"start_version" : 0
}
]
},
"data":{
"least_operating_space_bytes_log_server":0,
Expand Down
2 changes: 0 additions & 2 deletions fdbclient/SystemData.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -970,8 +970,6 @@ const KeyRef perpetualStorageWigglePrefix("\xff/storageWiggle/"_sr);

const KeyRef triggerDDTeamInfoPrintKey("\xff/triggerDDTeamInfoPrint"_sr);

const KeyRef consistencyScanInfoKey = "\xff/consistencyScanInfo"_sr;

const KeyRef encryptionAtRestModeConfKey("\xff/conf/encryption_at_rest_mode"_sr);
const KeyRef tenantModeConfKey("\xff/conf/tenant_mode"_sr);

Expand Down
Loading

0 comments on commit 2916a11

Please sign in to comment.