Skip to content

Conversation

@gomoripeti
Copy link
Contributor

@gomoripeti gomoripeti commented Oct 21, 2022

Proposed Changes

This is a follow-up of #5715 to allow a shovel to be blocked not just by credit-flow but also by an explicit connection.blocked message. This is in hope so that less messages would be in flight in buffers (which can be lost if a connection is closed) when the destination cluster has resource alarms. This mostly targets on-publish mode, but works for the other ack-modes as well.

(EDIT: the below part has been moved out of the scope of this PR

Also now that shovels can be blocked for different reasons this is also reflected on the UI shovel status page.

Currently there is a known limitation that if the shovel is idle after
it was unblocked by credit flow, then the status won't get updated until
the next message. I would appreciate some advice if this needs to be fixed and how.

Historically why does credit_flow:state/0 have a state-change-interval delay? Is it because of performance optimisation or is it for UX so that even a very short period of flow state won't get unnoticed on the mgmt UI?

As a solution I can imagine

  • starting a timer every time a shovel transitions from flow to running and reporting the status after 1 second (I'm concerned about the performance impact of starting and canceling a lot of timers)
  • don't use credit_flow:state/0 but store credit_blocked_at in the shovel status table, and replicate the logic at lookup time (this sounds more efficient but duplicates code)
  • do nothing, keep the limitation)

TODO make sure docs are up-to-date

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

@gomoripeti
Copy link
Contributor Author

Indeed in a mixed version cluster a new node can return a shovel status (flow or blocked) that an old node cannot format.

There is also a failing test case in the shovel_management plugin suite that I need to fix.

Copy link
Collaborator

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Builds fail because rabbit_amqp091_shovel:reset_pending/1 is unused and our build system treats warnings in the RabbitMQ's own code as errors.

Once that is fixed, there are Dialyzer errors.

@michaelklishin
Copy link
Collaborator

compile: warnings being treated as errors
rabbit_amqp091_shovel.erl:406:1: function reset_pending/1 is unused
rabbit_shovel_worker.erl:59:10: Record construction
          #state{type :: 'dynamic' | 'static',
                 last_reported_status :: 'undefined'} violates the declared type of field last_reported_status ::
          'blocked' | 'flow' | 'running'

@gomoripeti
Copy link
Contributor Author

I will move the "report flow/blocked status" part to a separate PR as there are aspects of it I did not think of

  • should work in mixed version clusters
  • should address the "state-change-interval delay" of credit flow
  • should not send shovel_worker_status rabbit events for every flow/blocked/running status change

@gomoripeti gomoripeti force-pushed the shovel_connection_blocked branch from 6d7c296 to 246bd62 Compare October 23, 2022 11:19
@gomoripeti gomoripeti changed the title Shovel: handle connection.(un)blocked messages and report flow/blocked status Shovel: handle connection.(un)blocked messages Oct 23, 2022
@gomoripeti
Copy link
Contributor Author

the dynamic_SUITE fails across multiple jobs, so it must be a legitimate failure, but it does not fail for me if I just run this suite locally with make -C deps/rabbitmq_shovel ct-dynamic. Is there any way to see some more details about the failures from the CI?

@michaelklishin
Copy link
Collaborator

CI uses Bazel.

bazel test --config=local //deps/rabbitmq_shovel:dynamic_SUITE
bazel test --config=local //deps/rabbitmq_shovel:all

My user.bazelrc in the repo looks like this:

build:local --@rules_erlang//:erlang_home=/path/to/otp/25.1
build:local --@rules_erlang//:erlang_version=25.1
build:local --//:elixir_home=/path/to/.kiex/elixirs/elixir-1.13.4/lib/elixir/

# rabbitmqctl wait shells out to 'ps', which is broken in the bazel macOS
# sandbox (https://github.com/bazelbuild/bazel/issues/7448)
# adding "--spawn_strategy=local" to the invocation is a workaround
build --spawn_strategy=local

# --experimental_strict_action_env breaks memory size detection on macOS,
# so turn it off for local runs
build --noexperimental_strict_action_env
build:buildbuddy --experimental_strict_action_env

# don't re-run flakes automatically on the local machine
build --flaky_test_attempts=1

build:buildbuddy --remote_header=x-buildbuddy-api-key=YOUR_API_KEY

# cross compile for linux (if on macOS) with rbe
# build:rbe --host_cpu=k8
# build:rbe --cpu=k8

build --@io_bazel_rules_docker//transitions:enable=false

@michaelklishin
Copy link
Collaborator

You can download build artifacts from the Build Buddy build results page.

It takes a while to figure out what goes where at first but here are the target run results: for dynamic_SUITE:

RabbitMQ node(s) in directory /buildbuddy/remotebuilds/4a322cf0-42f5-4381-b303-da3689b2ce4b/bazel-out/k8-fastbuild/testlogs/deps/rabbitmq_shovel/dynamic_SUITE/test.outputs/ct_run.ct-rabbitmq_shovel-dynamic_SUITE@localhost.2022-10-23_11.26.43/deps.rabbitmq_shovel.dynamic_SUITE.logs/run.2022-10-23_11.26.43/log_private
Waiting for pid file '/buildbuddy/remotebuilds/4a322cf0-42f5-4381-b303-da3689b2ce4b/bazel-out/k8-fastbuild/testlogs/deps/rabbitmq_shovel/dynamic_SUITE/test.outputs/ct_run.ct-rabbitmq_shovel-dynamic_SUITE@localhost.2022-10-23_11.26.43/deps.rabbitmq_shovel.dynamic_SUITE.logs/run.2022-10-23_11.26.43/log_private//rmq-ct-dynamic_SUITE-1-21000@localhost/rmq-ct-dynamic_SUITE-1-21000@localhost.pid' to appear
pid is 289
Waiting for erlang distribution on node 'rmq-ct-dynamic_SUITE-1-21000@localhost' while OS process '289' is running
Waiting for applications 'rabbit_and_plugins' to start on node 'rmq-ct-dynamic_SUITE-1-21000@localhost'
Applications 'rabbit_and_plugins' are running on node 'rmq-ct-dynamic_SUITE-1-21000@localhost'
Exit code: 0 (pid <0.113.0>)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
rabbit_top_util:obtain_name failed
Reason: {undef,[{rabbit_top_util,...},{...}|...]}
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Testing deps.rabbitmq_shovel.dynamic_SUITE: *** FAILED test case 14 of 19 ***
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
gen_server:call failed on line 382
Reason: {{shutdown,{server_initiated_close,...}},{gen_server,call,...}}
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Testing deps.rabbitmq_shovel.dynamic_SUITE: *** FAILED test case 15 of 19 ***
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
gen_server:call failed on line 382
Reason: {{shutdown,{server_initiated_close,...}},{gen_server,call,...}}
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Testing deps.rabbitmq_shovel.dynamic_SUITE: *** FAILED test case 16 of 19 ***
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
gen_server:call failed on line 382
Reason: {{shutdown,{server_initiated_close,...}},{gen_server,call,...}}
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Testing deps.rabbitmq_shovel.dynamic_SUITE: *** FAILED test case 17 of 19 ***

and then there are CT logs to download.

@michaelklishin
Copy link
Collaborator

Finally, if you want to switch to Bazel from erlang.mk, you have to either gmake distclean then wipe deps and then git reset --hard or use a separate clone.

Bazel does not assume that external dependencies would be stored under deps/.

@michaelklishin
Copy link
Collaborator

michaelklishin commented Oct 23, 2022

The undef about rabbit_top_util is due to a missing runtime dependency in deps/rabbitmq_shovel/BUILD.bazel:

# this is WRONG and just an example, we want to add a test dependency of some kind
RUNTIME_DEPS = [
    "//deps/amqp10_client:erlang_app",
    "//deps/rabbitmq_top:erlang_app",
]

fixed it for me.

@gomoripeti
Copy link
Contributor Author

thank you very much for all the pointers and details, they are very helpful

the use of rabbit_top_util:obtain_name is really not necessary, and could be considered a leftover from my debugging the test case which I should remove.

@michaelklishin
Copy link
Collaborator

Hm, the idea to add rabbitmq_top to runtime dependencies in deps/rabbitmq_shovel/BUILD.bazel is obvious a terrible one. We should add a test dependency equivalent.

Also rework shovel credit_flow testcase to be more deterministic.
The pending queue can grow up to prefetch count (which is 1000 by
default and can be more) while it is forwarded in chunks every time a
bump_credit grants more credit to the shovel process (which is 200 by
default). This change avoids trying to forward all pending entries just
to put back most of them in the pending queue.
@gomoripeti gomoripeti force-pushed the shovel_connection_blocked branch from 246bd62 to 68e931a Compare October 24, 2022 17:14
@gomoripeti
Copy link
Contributor Author

I removed the rabbit_top_util call from the test suite and now all checks pass 🎉

@michaelklishin
Copy link
Collaborator

The change delay must have to do with the management UI minimum refresh rate but also the fact that credit flow can go in and out of effect dozens of times a second.

I don't think we should discuss what to do with it in comments in a PR that intentionally ignores this part of Shovel behavior :)

@michaelklishin michaelklishin merged commit f4a7d2f into rabbitmq:main Oct 25, 2022
@michaelklishin michaelklishin added this to the 3.11.3 milestone Oct 25, 2022
michaelklishin added a commit that referenced this pull request Oct 25, 2022
Shovel: handle `connection.(un)blocked` messages (backport #6224)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants