Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kafka: improve CreateTopics error and boolean handling #23682

Merged

Conversation

WillemKauf
Copy link
Contributor

@WillemKauf WillemKauf commented Oct 8, 2024

This PR covers three cases:

1. Invalid boolean properties

Previously, invalid boolean property values would be silently ignored during the handling of a CreateTopics request, e.g

$ rpk topic create tapioca -c redpanda.remote.read=yes_this_is_true
TOPIC    STATUS
tapioca  OK
$ rpk topic describe topic
CONFIGS
=======
KEY                   VALUE   SOURCE
redpanda.remote.read  false   DEFAULT_CONFIG

would result in the topic tapioca being created instead of returning an error. The topic's value for redpanda.remote.read silently defaults to the cluster default (false in the example above).

This is unlike Kafka, in which an invalid configuration error is propagated back to the user:

$ ./kafka-topics.sh --create --topic tapioca --bootstrap-server localhost:9092 --config preallocate=yes_this_is_true
Error while executing topic command : Invalid value yes_this_is_true for configuration preallocate: Expected value to be either true or false
[2024-10-08 12:39:32,885] ERROR org.apache.kafka.common.config.ConfigException: Invalid value yes_this_is_true for configuration preallocate: Expected value to be either true or false
	at org.apache.kafka.common.config.ConfigDef.parseType(ConfigDef.java:706)
	at org.apache.kafka.common.config.ConfigDef.parseValue(ConfigDef.java:531)
	at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:524)
	at org.apache.kafka.storage.internals.log.LogConfig.validate(LogConfig.java:692)
	at org.apache.kafka.storage.internals.log.LogConfig.validate(LogConfig.java:684)
	at org.apache.kafka.tools.TopicCommand.parseTopicConfigsToBeAdded(TopicCommand.java:176)
	at org.apache.kafka.tools.TopicCommand.access$000(TopicCommand.java:81)
	at org.apache.kafka.tools.TopicCommand$CommandTopicPartition.<init>(TopicCommand.java:267)
	at org.apache.kafka.tools.TopicCommand$TopicService.createTopic(TopicCommand.java:460)
	at org.apache.kafka.tools.TopicCommand.execute(TopicCommand.java:105)
	at org.apache.kafka.tools.TopicCommand.mainNoExit(TopicCommand.java:90)
	at org.apache.kafka.tools.TopicCommand.main(TopicCommand.java:85)
 (org.apache.kafka.tools.TopicCommand)

Fix this inconsistency by returning an kafka::error_code::invalid_config when an invalid boolean value is provided.

$ rpk topic create tapioca -c redpanda.remote.read=yes_this_is_true
TOPIC    STATUS
tapioca  INVALID_CONFIG: Configuration is invalid

2. Case-insensitive boolean properties

Previously, the only boolean values that would be considered valid in a CreateTopics request would be true and false. All other cases, such as True, TRUE, fAlSe, etc., would again map to the "silently ignored" state mentioned above, e.g.

$ rpk topic create tapioca -c redpanda.remote.read=True
TOPIC    STATUS
tapioca  OK
$ rpk topic describe tapioca
CONFIGS
=======
KEY                   VALUE   SOURCE
redpanda.remote.read  false   DEFAULT_CONFIG

This is unlike Kafka, in which the boolean property's value is case-insensitive:

$ ./kafka-topics.sh --create --topic tapioca --bootstrap-server localhost:9092 --config preallocate=True
Created topic tapioca.

(Kafka topic has set `preallocate=true`)

Fix this inconsistency by making CreateTopics requests in redpanda case-insensitive for boolean values.

$ rpk topic create tapioca -c redpanda.remote.read=tRuE
TOPIC    STATUS
tapioca  OK
$ rpk topic describe tapioca
CONFIGS
=======
KEY                   VALUE   SOURCE
redpanda.remote.read  true    DYNAMIC_TOPIC_CONFIG

3. Other invalid topic configuration error handling

A few functions use unprotected boost::lexical_cast calls to convert input parameters to their required type.

Before, when these calls failed and threw a bad_lexical_cast exception, it would not be caught and instead propagated to the request_context layer, retrying the command until it eventually times out.

$ rpk topic create tapioca -c retention.bytes=this_should_be_an_integer

redpanda output:

WARN  2024-10-09 13:24:15,329 [shard  1:main] kafka - connection_context.cc:872 - Error processing request: boost::wrapexcept<boost::bad_lexical_cast> (bad lexical cast: source type value could not be interpreted as target)
WARN  2024-10-09 13:24:15,626 [shard  1:main] kafka - connection_context.cc:872 - Error processing request: boost::wrapexcept<boost::bad_lexical_cast> (bad lexical cast: source type value could not be interpreted as target)
WARN  2024-10-09 13:24:16,081 [shard  1:main] kafka - connection_context.cc:872 - Error processing request: boost::wrapexcept<boost::bad_lexical_cast> (bad lexical cast: source type value could not be interpreted as target)
WARN  2024-10-09 13:24:16,914 [shard  1:main] kafka - connection_context.cc:872 - Error processing request: boost::wrapexcept<boost::bad_lexical_cast> (bad lexical cast: source type value could not be interpreted as target)
WARN  2024-10-09 13:24:19,145 [shard  1:main] kafka - connection_context.cc:872 - Error processing request: boost::wrapexcept<boost::bad_lexical_cast> (bad lexical cast: source type value could not be interpreted as target)
WARN  2024-10-09 13:24:21,654 [shard  1:main] kafka - connection_context.cc:872 - Error processing request: boost::wrapexcept<boost::bad_lexical_cast> (bad lexical cast: source type value could not be interpreted as target)
WARN  2024-10-09 13:24:24,164 [shard  1:main] kafka - connection_context.cc:872 - Error processing request: boost::wrapexcept<boost::bad_lexical_cast> (bad lexical cast: source type value could not be interpreted as target)

rpk output:

unable to create topics [tapioca]: broker closed the connection immediately after a request was issued, which happens when SASL is required but not provided: is SASL missing?

This also has the effect of "all-or-nothing" topic creation, in which one bad configuration results in none of the topics being created, which is unlike the Kafka behaviour.

Fix this inconsistency by catching and gracefully returning error codes instead for invalid topic configurations.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

Improvements

  • Improve handling of boolean property values during a CreateTopics request by making parsing case-insensitive.
  • Improve handling of boolean values during a CreateTopics request by no longer silently ignoring an invalid value, instead throwing a configuration error.
  • Improve handling of certain invalid topic configuration parameters that would lead to a timeout failure instead of a graceful error code during a CreateTopics request.

@dotnwat dotnwat requested review from twmb, pgellert and r-vasquez October 8, 2024 17:50
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Oct 8, 2024

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/56026#01926d55-7886-4ec7-9d34-7c7b67f77be9:

"rptest.tests.retention_policy_test.ShadowIndexingLocalRetentionTest.test_shadow_indexing_default_local_retention.cluster_remote_write=False.topic_remote_write=-1.cloud_storage_type=CloudStorageType.ABS"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/56026#01926d55-7883-4cfd-9b69-d1abb3340db3:

"rptest.tests.retention_policy_test.ShadowIndexingLocalRetentionTest.test_shadow_indexing_default_local_retention.cluster_remote_write=True.topic_remote_write=-1.cloud_storage_type=CloudStorageType.S3"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/56026#01926d55-788a-4412-8a4e-6e65644fc108:

"rptest.tests.retention_policy_test.ShadowIndexingLocalRetentionTest.test_shadow_indexing_default_local_retention.cluster_remote_write=False.topic_remote_write=-1.cloud_storage_type=CloudStorageType.S3"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/56026#01926d55-788d-4fc3-9720-49e098f057eb:

"rptest.tests.retention_policy_test.ShadowIndexingLocalRetentionTest.test_shadow_indexing_default_local_retention.cluster_remote_write=True.topic_remote_write=-1.cloud_storage_type=CloudStorageType.ABS"

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Oct 8, 2024

Retry command for Build#56026

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/retention_policy_test.py::ShadowIndexingLocalRetentionTest.test_shadow_indexing_default_local_retention@{"cloud_storage_type":2,"cluster_remote_write":false,"topic_remote_write":"-1"}
tests/rptest/tests/retention_policy_test.py::ShadowIndexingLocalRetentionTest.test_shadow_indexing_default_local_retention@{"cloud_storage_type":1,"cluster_remote_write":true,"topic_remote_write":"-1"}
tests/rptest/tests/retention_policy_test.py::ShadowIndexingLocalRetentionTest.test_shadow_indexing_default_local_retention@{"cloud_storage_type":1,"cluster_remote_write":false,"topic_remote_write":"-1"}
tests/rptest/tests/retention_policy_test.py::ShadowIndexingLocalRetentionTest.test_shadow_indexing_default_local_retention@{"cloud_storage_type":2,"cluster_remote_write":true,"topic_remote_write":"-1"}

@WillemKauf WillemKauf force-pushed the create_topics_boolean_handling branch from b4dee11 to 9aec1e9 Compare October 8, 2024 19:42
@WillemKauf
Copy link
Contributor Author

Force push to:

  • Fix test bugs in retention_policy_test.py::ShadowIndexingLocalRetentionTest.test_shadow_indexing_default_local_retention.

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generally looks good. the first commit could be split into 3 commits (maybe four)

Comment on lines +250 to +261
config = {"cleanup.policy": TopicSpec.CLEANUP_DELETE}

# None case means use cluster default for topic creation.
if topic_remote_write is not None:
config |= {"redpanda.remote.write": topic_remote_write}

self.rpk.create_topic(topic=self.topic_name,
partitions=1,
replicas=1,
config={
"cleanup.policy": TopicSpec.CLEANUP_DELETE,
"redpanda.remote.write": topic_remote_write
})
config=config)

if topic_remote_write != "-1":
if topic_remote_write is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. is this change related to the handling of bools, or is it a cleanup?
  2. prior to this change -1 was the actual value passed in as the value for redpanda.remote.write, but that seems to no longer be the case. was that a value that was being tested for explicitly for a particular expected behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails with the changes in this PR if the test changes are not made.

You are correct in seeing that topic create topic -c redpanda.remote.write=-1 would be the original command executed in this test, which previously would succeed because we would hit the "silently ignore" case discussed in the cover letter. That was the author's intention, to use the cluster default, but this behavior is not supported in Kafka and therefore shouldn't be supported in redpanda.

Using -1 as the value for a boolean returns an error in Kafka:

$ ./kafka-topics.sh --create --topic tapioca --bootstrap-server localhost:9092 --config preallocate=-1
Error while executing topic command : Invalid value -1 for configuration preallocate: Expected value to be either true or false
[2024-10-08 15:31:09,189] ERROR org.apache.kafka.common.config.ConfigException: Invalid value -1 for configuration preallocate: Expected value to be either true or false
	at org.apache.kafka.common.config.ConfigDef.parseType(ConfigDef.java:706)
	at org.apache.kafka.common.config.ConfigDef.parseValue(ConfigDef.java:531)
	at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:524)
	at org.apache.kafka.storage.internals.log.LogConfig.validate(LogConfig.java:692)
	at org.apache.kafka.storage.internals.log.LogConfig.validate(LogConfig.java:684)
	at org.apache.kafka.tools.TopicCommand.parseTopicConfigsToBeAdded(TopicCommand.java:176)
	at org.apache.kafka.tools.TopicCommand.access$000(TopicCommand.java:81)
	at org.apache.kafka.tools.TopicCommand$CommandTopicPartition.<init>(TopicCommand.java:267)
	at org.apache.kafka.tools.TopicCommand$TopicService.createTopic(TopicCommand.java:460)
	at org.apache.kafka.tools.TopicCommand.execute(TopicCommand.java:105)
	at org.apache.kafka.tools.TopicCommand.mainNoExit(TopicCommand.java:90)
	at org.apache.kafka.tools.TopicCommand.main(TopicCommand.java:85)
 (org.apache.kafka.tools.TopicCommand)

The test was written with this assumed behavior for -1 meaning "cluster default" in mind, and they were accidentally correct.

I fixed the test here by just not setting a value for redpanda.remote.write, which achieves the same thing as originally intended.

Comment on lines +120 to +125
return string_switch<std::optional<bool>>(str_value)
.match("true", true)
.match("false", false);
} catch (const std::runtime_error&) {
throw boost::bad_lexical_cast();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not leave default_match and then throw bad_lexical_cast if std::nullopt is returned? this formulation seems to assume that runtime_error can only be thrown due to a bad match. that's probably the case, but i'm not sure string_switch has any sort of strict contract.

Copy link
Contributor Author

@WillemKauf WillemKauf Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the same pattern I was familiar with from

} catch (const std::runtime_error&) {
// Our callers expect this exception type on malformed values
throw boost::bad_lexical_cast();
, but we can certainly implement it in the way you described if you feel strongly. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think string_switch has a strict contract per se, but if

constexpr operator R() { // NOLINT
if (!result) {
throw std::runtime_error(
std::string(
"Fell off the end of a string-switch while matching: ")
+ std::string(view));
}
return std::move(*result);
were to change, I think it would break a few existing spots in the code.

Copy link
Contributor

@nvartolomei nvartolomei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent find and fix! 🙇‍♂️

@WillemKauf WillemKauf force-pushed the create_topics_boolean_handling branch from 9aec1e9 to 8f9cfcf Compare October 9, 2024 14:37
@WillemKauf
Copy link
Contributor Author

Force push to:

  • Split commits up per Noah's request
  • Add WARN level logs around caught exceptions due to bad configurations per Nicolae's request

@vbotbuildovich
Copy link
Collaborator

In the case that some topic configurations were invalid while others
were valid, the `create_topics_response` vector may have had results
pushed back to it in an order that is different than the order in
which the topics were supplied in the request.

Add a new function `sort_topic_response()` that sorts the response's
`topics` vector based on the order of the `create_topics_request`
provided before returning it.
This commit changes the behavior of `to_cluster_type()`, as well
as the `create_topics` handler as a whole.

Before, if any invalid topic configurations in a `create_topics_request`
lead to a function called from `to_cluster_type()` throwing an exception,
the request would continually be retried until a timeout was hit.

This had the effect of "all-or-nothing" behavior with topic creation,
in which just one invalid topic config would result in none of the topics
being created.

By catching and handling the exception in `to_cluster_type()`, we now change
the behavior to allow for partial creation of topics, in which those topics
with valid configurations are succesfully created, while the invalid topics
have their errors logged and returned.

An additional iterator is passed in order to generate any necessary
errors in the `create_topics_response`, which is returned back to the
`kafka::request_context`.
To ensure a dry run `CreateTopics` request reflects the proper
result, try parsing the passed config parameters. Any errors
found in the configuration will be logged and returned.
Invalid boolean property values will no longer be silently ignored
due to the previous use of `default_match`. An exception is handled
and a `kafka::error_code::invalid_config` is returned to the user.
Boolean property values are now case-insensitive, just like in Kafka.
This means any of `true`, `True`, `TRUE`, `tRuE` are all evaluated
the same way when creating a topic.
@WillemKauf WillemKauf force-pushed the create_topics_boolean_handling branch from 8f9cfcf to fa9a760 Compare October 9, 2024 20:14
@WillemKauf
Copy link
Contributor Author

Force push to:

  • Amend commit message for commit kafka: fixes and interface change to to_cluster_type() to indicate bug fix beyond simple interface change for CreateTopics requests with a mixture of valid and invalid configurations.
  • Add commit kafka: add sort_topic_response() in create_topics.cc , which now sorts the topic_create_response's vector of topics per the initial order of topic names in the topic_create_request.
  • Add test cases create_multiple_topics_mixed_invalid and create_multiple_topics_all_invalid to src/v/kafka/server/tests/create_topics_test.cc

@WillemKauf WillemKauf changed the title kafka: improve CreateTopics boolean handling kafka: improve CreateTopics error and boolean handling Oct 9, 2024
Copy link
Contributor

@pgellert pgellert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

// order as the topics were passed. This would be out of order in the case that
// some topics were not successfully created (error response returned) while
// others were.
static void sort_topic_response(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to note why we are sorting these. Do some clients expect these to be ordered? Observed behavior with kafka? Us being worried about changing the behavior?

Copy link
Contributor Author

@WillemKauf WillemKauf Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Test cases can be written more easily knowing that the topic responses are returned in a sane ordering (the same order as the request).
  2. Kafka seems to also respond with this ordering.

// others were.
static void sort_topic_response(
const create_topics_request& req, create_topics_response& resp) {
absl::flat_hash_map<model::topic, size_t> topic_name_order_map;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. Consider in the future naming maps something like pos_by_name. Code reads more fluently with names like these I think.

@@ -111,7 +111,14 @@ static std::optional<bool>
get_bool_value(const config_map_t& config, std::string_view key) {
Copy link
Contributor

@nvartolomei nvartolomei Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like not all bools are treated equally 😠. We can create another ticket for write.caching. No need to address it in this PR.

→ ~/redpanda/vbuild/go/linux/amd64/bin/rpk topic create baz -c write.caching=TruE
TOPIC STATUS
baz INVALID_CONFIG: Unsupported write caching configuration.

(This comment is edited because initially i used wrong commands to test).

Copy link
Contributor Author

@WillemKauf WillemKauf Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write.caching is an enum, disguised as a bool. We don't actually make calls to get_bool_value() for this property. This INVALID_CONFIG in particular comes from the validate_write_caching check:

auto mode = model::write_caching_mode_from_string(it->value.value());
// Failed to parse the value.
if (!mode) {
return false;

I'm glad it's getting caught here though, because write_caching_mode_from_string() also has a default match case...:

write_caching_mode_from_string(std::string_view s) {
return string_switch<std::optional<write_caching_mode>>(s)
.match(
model::write_caching_mode_to_string(
model::write_caching_mode::default_true),
model::write_caching_mode::default_true)
.match(
model::write_caching_mode_to_string(
model::write_caching_mode::default_false),
model::write_caching_mode::default_false)
.match(
model::write_caching_mode_to_string(
model::write_caching_mode::disabled),
model::write_caching_mode::disabled)
.default_match(std::nullopt);

Agreed that this should be cleaned up in the future as well.

Copy link
Contributor

@nvartolomei nvartolomei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@WillemKauf WillemKauf merged commit d8732b4 into redpanda-data:dev Oct 10, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants