Shovel: handle `connection.(un)blocked` messages #6224

gomoripeti · 2022-10-21T21:34:56Z

Proposed Changes

This is a follow-up of #5715 to allow a shovel to be blocked not just by credit-flow but also by an explicit connection.blocked message. This is in hope so that less messages would be in flight in buffers (which can be lost if a connection is closed) when the destination cluster has resource alarms. This mostly targets on-publish mode, but works for the other ack-modes as well.

(EDIT: the below part has been moved out of the scope of this PR

Also now that shovels can be blocked for different reasons this is also reflected on the UI shovel status page.

Currently there is a known limitation that if the shovel is idle after
it was unblocked by credit flow, then the status won't get updated until
the next message. I would appreciate some advice if this needs to be fixed and how.

Historically why does credit_flow:state/0 have a state-change-interval delay? Is it because of performance optimisation or is it for UX so that even a very short period of flow state won't get unnoticed on the mgmt UI?

As a solution I can imagine

starting a timer every time a shovel transitions from flow to running and reporting the status after 1 second (I'm concerned about the performance impact of starting and canceling a lot of timers)

don't use credit_flow:state/0 but store credit_blocked_at in the shovel status table, and replicate the logic at lookup time (this sounds more efficient but duplicates code)

do nothing, keep the limitation)

TODO make sure docs are up-to-date

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

Bug fix (non-breaking change which fixes issue #NNNN)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause an observable behavior change in existing systems)
Documentation improvements (corrections, new content, etc)
Cosmetic change (whitespace, formatting, etc)
Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

I have read the CONTRIBUTING.md document
I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
I have added tests that prove my fix is effective or that my feature works
All tests pass locally with my changes
If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

gomoripeti · 2022-10-21T21:52:44Z

Indeed in a mixed version cluster a new node can return a shovel status (flow or blocked) that an old node cannot format.

There is also a failing test case in the shovel_management plugin suite that I need to fix.

michaelklishin

Builds fail because rabbit_amqp091_shovel:reset_pending/1 is unused and our build system treats warnings in the RabbitMQ's own code as errors.

Once that is fixed, there are Dialyzer errors.

michaelklishin · 2022-10-23T07:53:19Z

compile: warnings being treated as errors
rabbit_amqp091_shovel.erl:406:1: function reset_pending/1 is unused

rabbit_shovel_worker.erl:59:10: Record construction
          #state{type :: 'dynamic' | 'static',
                 last_reported_status :: 'undefined'} violates the declared type of field last_reported_status ::
          'blocked' | 'flow' | 'running'

gomoripeti · 2022-10-23T11:05:18Z

I will move the "report flow/blocked status" part to a separate PR as there are aspects of it I did not think of

should work in mixed version clusters
should address the "state-change-interval delay" of credit flow
should not send shovel_worker_status rabbit events for every flow/blocked/running status change

gomoripeti · 2022-10-23T15:04:01Z

the dynamic_SUITE fails across multiple jobs, so it must be a legitimate failure, but it does not fail for me if I just run this suite locally with make -C deps/rabbitmq_shovel ct-dynamic. Is there any way to see some more details about the failures from the CI?

michaelklishin · 2022-10-23T20:56:11Z

CI uses Bazel.

bazel test --config=local //deps/rabbitmq_shovel:dynamic_SUITE

bazel test --config=local //deps/rabbitmq_shovel:all

My user.bazelrc in the repo looks like this:

build:local --@rules_erlang//:erlang_home=/path/to/otp/25.1
build:local --@rules_erlang//:erlang_version=25.1
build:local --//:elixir_home=/path/to/.kiex/elixirs/elixir-1.13.4/lib/elixir/

# rabbitmqctl wait shells out to 'ps', which is broken in the bazel macOS
# sandbox (https://github.com/bazelbuild/bazel/issues/7448)
# adding "--spawn_strategy=local" to the invocation is a workaround
build --spawn_strategy=local

# --experimental_strict_action_env breaks memory size detection on macOS,
# so turn it off for local runs
build --noexperimental_strict_action_env
build:buildbuddy --experimental_strict_action_env

# don't re-run flakes automatically on the local machine
build --flaky_test_attempts=1

build:buildbuddy --remote_header=x-buildbuddy-api-key=YOUR_API_KEY

# cross compile for linux (if on macOS) with rbe
# build:rbe --host_cpu=k8
# build:rbe --cpu=k8

build --@io_bazel_rules_docker//transitions:enable=false

michaelklishin · 2022-10-23T21:13:37Z

You can download build artifacts from the Build Buddy build results page.

It takes a while to figure out what goes where at first but here are the target run results: for dynamic_SUITE:

RabbitMQ node(s) in directory /buildbuddy/remotebuilds/4a322cf0-42f5-4381-b303-da3689b2ce4b/bazel-out/k8-fastbuild/testlogs/deps/rabbitmq_shovel/dynamic_SUITE/test.outputs/ct_run.ct-rabbitmq_shovel-dynamic_SUITE@localhost.2022-10-23_11.26.43/deps.rabbitmq_shovel.dynamic_SUITE.logs/run.2022-10-23_11.26.43/log_private
Waiting for pid file '/buildbuddy/remotebuilds/4a322cf0-42f5-4381-b303-da3689b2ce4b/bazel-out/k8-fastbuild/testlogs/deps/rabbitmq_shovel/dynamic_SUITE/test.outputs/ct_run.ct-rabbitmq_shovel-dynamic_SUITE@localhost.2022-10-23_11.26.43/deps.rabbitmq_shovel.dynamic_SUITE.logs/run.2022-10-23_11.26.43/log_private//rmq-ct-dynamic_SUITE-1-21000@localhost/rmq-ct-dynamic_SUITE-1-21000@localhost.pid' to appear
pid is 289
Waiting for erlang distribution on node 'rmq-ct-dynamic_SUITE-1-21000@localhost' while OS process '289' is running
Waiting for applications 'rabbit_and_plugins' to start on node 'rmq-ct-dynamic_SUITE-1-21000@localhost'
Applications 'rabbit_and_plugins' are running on node 'rmq-ct-dynamic_SUITE-1-21000@localhost'
Exit code: 0 (pid <0.113.0>)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
rabbit_top_util:obtain_name failed
Reason: {undef,[{rabbit_top_util,...},{...}|...]}
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Testing deps.rabbitmq_shovel.dynamic_SUITE: *** FAILED test case 14 of 19 ***
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
gen_server:call failed on line 382
Reason: {{shutdown,{server_initiated_close,...}},{gen_server,call,...}}
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Testing deps.rabbitmq_shovel.dynamic_SUITE: *** FAILED test case 15 of 19 ***
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
gen_server:call failed on line 382
Reason: {{shutdown,{server_initiated_close,...}},{gen_server,call,...}}
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Testing deps.rabbitmq_shovel.dynamic_SUITE: *** FAILED test case 16 of 19 ***
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
gen_server:call failed on line 382
Reason: {{shutdown,{server_initiated_close,...}},{gen_server,call,...}}
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Testing deps.rabbitmq_shovel.dynamic_SUITE: *** FAILED test case 17 of 19 ***

and then there are CT logs to download.

michaelklishin · 2022-10-23T21:15:20Z

Finally, if you want to switch to Bazel from erlang.mk, you have to either gmake distclean then wipe deps and then git reset --hard or use a separate clone.

Bazel does not assume that external dependencies would be stored under deps/.

michaelklishin · 2022-10-23T21:57:13Z

The undef about rabbit_top_util is due to a missing runtime dependency in deps/rabbitmq_shovel/BUILD.bazel:

# this is WRONG and just an example, we want to add a test dependency of some kind
RUNTIME_DEPS = [
    "//deps/amqp10_client:erlang_app",
    "//deps/rabbitmq_top:erlang_app",
]

fixed it for me.

gomoripeti · 2022-10-23T23:06:34Z

thank you very much for all the pointers and details, they are very helpful

the use of rabbit_top_util:obtain_name is really not necessary, and could be considered a leftover from my debugging the test case which I should remove.

michaelklishin · 2022-10-24T05:00:37Z

Hm, the idea to add rabbitmq_top to runtime dependencies in deps/rabbitmq_shovel/BUILD.bazel is obvious a terrible one. We should add a test dependency equivalent.

Also rework shovel credit_flow testcase to be more deterministic.

The pending queue can grow up to prefetch count (which is 1000 by default and can be more) while it is forwarded in chunks every time a bump_credit grants more credit to the shovel process (which is 200 by default). This change avoids trying to forward all pending entries just to put back most of them in the pending queue.

gomoripeti · 2022-10-24T18:38:07Z

I removed the rabbit_top_util call from the test suite and now all checks pass 🎉

michaelklishin · 2022-10-25T08:11:36Z

The change delay must have to do with the management UI minimum refresh rate but also the fact that credit flow can go in and out of effect dozens of times a second.

I don't think we should discuss what to do with it in comments in a PR that intentionally ignores this part of Shovel behavior :)

Shovel: handle `connection.(un)blocked` messages (backport #6224)

michaelklishin requested changes Oct 22, 2022

View reviewed changes

gomoripeti force-pushed the shovel_connection_blocked branch from 6d7c296 to 246bd62 Compare October 23, 2022 11:19

gomoripeti changed the title ~~Shovel: handle connection.(un)blocked messages and report flow/blocked status~~ Shovel: handle connection.(un)blocked messages Oct 23, 2022

gomoripeti added 2 commits October 24, 2022 19:06

Shovel: handle connection.(un)blocked messages from dest broker

836dfc4

Also rework shovel credit_flow testcase to be more deterministic.

gomoripeti force-pushed the shovel_connection_blocked branch from 246bd62 to 68e931a Compare October 24, 2022 17:14

michaelklishin approved these changes Oct 25, 2022

View reviewed changes

michaelklishin added the backport-v3.11.x label Oct 25, 2022

michaelklishin merged commit f4a7d2f into rabbitmq:main Oct 25, 2022

mergify bot mentioned this pull request Oct 25, 2022

Shovel: handle connection.(un)blocked messages (backport #6224) #6235

Merged

michaelklishin added this to the 3.11.3 milestone Oct 25, 2022

michaelklishin added a commit that referenced this pull request Oct 25, 2022

Merge pull request #6235 from rabbitmq/mergify/bp/v3.11.x/pr-6224

bdbed63

Shovel: handle `connection.(un)blocked` messages (backport #6224)

the-mikedavis mentioned this pull request Oct 1, 2025

Block direct shovel publishes during alarms #14657

Closed

Shovel: handle connection.(un)blocked messages #6224

Shovel: handle connection.(un)blocked messages #6224

Uh oh!

Conversation

gomoripeti commented Oct 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Changes

Types of Changes

Checklist

Further Comments

Uh oh!

gomoripeti commented Oct 21, 2022

Uh oh!

michaelklishin left a comment

Choose a reason for hiding this comment

Uh oh!

michaelklishin commented Oct 23, 2022

Uh oh!

gomoripeti commented Oct 23, 2022

Uh oh!

gomoripeti commented Oct 23, 2022

Uh oh!

michaelklishin commented Oct 23, 2022

Uh oh!

michaelklishin commented Oct 23, 2022

Uh oh!

michaelklishin commented Oct 23, 2022

Uh oh!

michaelklishin commented Oct 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gomoripeti commented Oct 23, 2022

Uh oh!

michaelklishin commented Oct 24, 2022

Uh oh!

gomoripeti commented Oct 24, 2022

Uh oh!

michaelklishin commented Oct 25, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Shovel: handle `connection.(un)blocked` messages #6224

Shovel: handle `connection.(un)blocked` messages #6224

gomoripeti commented Oct 21, 2022 •

edited

Loading

michaelklishin commented Oct 23, 2022 •

edited

Loading