Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(slices): Make consumers "slice-aware" #3259

Merged
merged 41 commits into from
Nov 14, 2022
Merged

Conversation

ayirr7
Copy link
Member

@ayirr7 ayirr7 commented Oct 14, 2022

This PR:

  • Passes in slice id as a CLI argument to snuba consumer
  • When relevant, converts the default logical topic to the physical topic (specific to the slice) using sliced topic configurations in settings. This mapping is performed for the input consumption topic and the commit log topic.
  • Mainly modifies the ConsumerBuilder logic to take into account any slice id that is passed in
  • Builds upon the changes in the "Topic enum to class" PR, and uses register_topic() to register slice-specific physical topics that are defined in SLICED_KAFKA_TOPIC_MAP in settings

Next steps:

  • Add some sort of verification that allows us to check whether the slice id being passed in is valid for the specific storage
  • Further testing

@ayirr7 ayirr7 requested a review from a team as a code owner October 14, 2022 05:58
@ayirr7 ayirr7 changed the base branch from master to enoch/remove-topics-enum October 14, 2022 06:00
@codecov-commenter
Copy link

codecov-commenter commented Oct 14, 2022

Codecov Report

Base: 92.93% // Head: 21.81% // Decreases project coverage by -71.11% ⚠️

Coverage data is based on head (dde5c5a) compared to base (54e8a43).
Patch coverage: 6.94% of modified lines in pull request are covered.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #3259       +/-   ##
===========================================
- Coverage   92.93%   21.81%   -71.12%     
===========================================
  Files         702      663       -39     
  Lines       32256    31249     -1007     
===========================================
- Hits        29976     6817    -23159     
- Misses       2280    24432    +22152     
Impacted Files Coverage Δ
snuba/cli/consumer.py 0.00% <0.00%> (-96.08%) ⬇️
snuba/cli/test_consumer.py 0.00% <ø> (-71.43%) ⬇️
snuba/consumers/consumer.py 0.00% <0.00%> (-84.40%) ⬇️
snuba/consumers/consumer_builder.py 0.00% <0.00%> (-91.90%) ⬇️
tests/datasets/test_table_storage.py 0.00% <0.00%> (ø)
tests/utils/streams/test_kafka_config.py 0.00% <0.00%> (-100.00%) ⬇️
snuba/utils/streams/configuration_builder.py 47.82% <16.66%> (-52.18%) ⬇️
snuba/datasets/table_storage.py 67.93% <28.57%> (-26.47%) ⬇️
snuba/datasets/partitioning.py 38.88% <33.33%> (-61.12%) ⬇️
tests/base.py 0.00% <0.00%> (-100.00%) ⬇️
... and 650 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@onewland
Copy link
Contributor

onewland commented Oct 14, 2022

Your change is currently not selecting the correct ClickHouse cluster for the given slice, so that needs to be done before a merge is possible (you can't rely on the changes we made to the query path).

Given that the TableWriter gives you access to the Kafka topic (indirectly via the stream loader) AND selects the cluster, I wonder if you could pass through the slice ID to TableWriters creation or to get_table_writer() and hide more slice awareness from the consumer builder

return get_cluster(self.__storage_set).get_batch_writer(
metrics,
insert_statement,
encoding=None,
options=options,
chunk_size=chunk_size,
buffer_size=0,
)

@enochtangg enochtangg force-pushed the enoch/remove-topics-enum branch from e7ee9ab to 311b3f8 Compare October 14, 2022 16:59
snuba/consumers/consumer_builder.py Outdated Show resolved Hide resolved
snuba/consumers/consumer_builder.py Outdated Show resolved Hide resolved
snuba/datasets/storages/functions.py Outdated Show resolved Hide resolved
@ayirr7
Copy link
Member Author

ayirr7 commented Oct 14, 2022

Given that the TableWriter gives you access to the Kafka topic (indirectly via the stream loader) AND selects the cluster, I wonder if you could pass through the slice ID to TableWriters creation or to get_table_writer() and hide more slice awareness from the consumer builder

In order to facilitate cluster selection using slice id, I modified get_table_writer() and the associated superclass to take in slice id. This populates a TableWriter object's slice id attribute, which I added. Now, we can get a slice- and cluster-specific BatchWriter from the TableWriter object.

One alternative to this would have been to modify the WritableTableStorage class to take in slice id upon initialization. However, it seemed to me that this would be a more invasive/refactoring-heavy change.

I'll be making more organizational changes to the code next (e.g. putting the logical to physical topic mapping into a function, etc.)

Base automatically changed from enoch/remove-topics-enum to master October 17, 2022 14:16
@ayirr7 ayirr7 changed the base branch from master to enoch/remove-topics-enum October 17, 2022 23:03
@ayirr7 ayirr7 requested a review from onewland October 18, 2022 20:10
snuba/datasets/storage.py Outdated Show resolved Hide resolved
Copy link
Contributor

@onewland onewland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pending my comments, I think I'm fine with this approach.

Have you tried testing locally by creating a sliced cluster/topic, manually starting a consumer, and seeing if you can write to slice 1?

snuba/consumers/consumer_builder.py Outdated Show resolved Hide resolved
snuba/consumers/consumer_builder.py Outdated Show resolved Hide resolved
snuba/consumers/consumer_builder.py Outdated Show resolved Hide resolved
snuba/consumers/consumer_builder.py Outdated Show resolved Hide resolved
@ayirr7
Copy link
Member Author

ayirr7 commented Oct 20, 2022

Have you tried testing locally by creating a sliced cluster/topic, manually starting a consumer, and seeing if you can write to slice 1?

A few updates to this PR:

Tested consumer behavior using the snuba-generic-metrics topic on slice_id = 1 (physical topic snuba-generic-metrics-1). I was able to successfully produce to and consume from snuba-generic-metrics-1, with the stroage being generic_metrics_distributions_raw. I've written up the configuration changes and commands I used for testing on a Notion doc, which I am happy to share.

Removed the logic for mapping logical Snuba topics to physical/sliced topics. Snuba topics are only used for retrieving Kafka broker config in the Consumer builder, and broker config can be found using simply a (logical topic name, slice id) pair.

Kept the logic for mapping logical Arroyo topics to physical/sliced topics, and passed this sliced topic into StreamProcessor logic so that the consumer connects to this correct topic (this is snuba-generic-metrics-1 for our example above).

Finally, made a modification which may be a small typo in the Topic enum to class PR (since this PR builds on that PR). Will update here as necessary.

Next steps include addressing other comments above regarding KafkaTopicSpec, etc.

Copy link
Contributor

@onewland onewland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we're going in the right direction to me. Thanks for doing more in-depth testing! It might be annoying but it's definitely worse to find errors post-deploy

snuba/consumers/consumer_builder.py Outdated Show resolved Hide resolved
snuba/datasets/storage.py Outdated Show resolved Hide resolved
snuba/consumers/consumer_builder.py Outdated Show resolved Hide resolved
@ayirr7 ayirr7 changed the base branch from enoch/remove-topics-enum to master October 20, 2022 18:17
@ayirr7 ayirr7 requested a review from nikhars October 31, 2022 07:03
Copy link
Member

@lynnagara lynnagara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this is looking better. My main comments are around consistency. We need to make the slices behave the same as slice 0, currently there are too many subtle differences.

  • Passing slice=0 should work, and it should behave the same as slice=None, or not passing the slice ID. Currently it behaves in pretty unexpected ways and you will probably get something broken if you pass slice=0.
  • We need to support the topic overrides and the same features in general for slices. How come you can't override a topic if you are using a slice? I think that code for various slices should behave the same, these small differences are very unexpected
  • Validation should also be done in the shared KafkaTopicSpec so it will be automatically run every time we fetch the slice data not implemented and called independently in every separate entrypoint, of which there will be many.

snuba/consumers/consumer_builder.py Outdated Show resolved Hide resolved
@@ -109,29 +129,42 @@ def __init__(

stream_loader = self.storage.get_table_writer().get_stream_loader()

self.raw_topic: Topic
self.raw_topic: ArroyoTopic
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference, I'd rather you just left all these names.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with leaving the names is the potential confusion between (Snuba) Topic and an (Arroyo) Topic. If both are Topic (how it was originally), it can become confusing to work with the code in future iterations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is the other topic (SnubaTopic) is being imported here but that should never be the case. It should always be accessed via the stream loader and KafkaTopicSpec mechanism to ensure that topic resolving is done correctly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. I no longer have those imports. Shall I change ArroyoTopic back to Topic?

Copy link
Member

@lynnagara lynnagara Nov 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. That seems good to me. It feels a little bit more consistent with what the other files do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping this. Can you please revert the change?

snuba/consumers/consumer_builder.py Outdated Show resolved Hide resolved
snuba/utils/streams/configuration_builder.py Show resolved Hide resolved
snuba/datasets/storage.py Outdated Show resolved Hide resolved
Copy link
Member

@nikhars nikhars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ayirr7 ayirr7 enabled auto-merge (squash) November 14, 2022 18:08
@ayirr7 ayirr7 requested a review from lynnagara November 14, 2022 18:09
@ayirr7 ayirr7 merged commit 3603555 into master Nov 14, 2022
@ayirr7 ayirr7 deleted the sliced-consumers-cli branch November 14, 2022 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants