Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpk: add rpk debug remote-bundle; collect a cluster-wide bundle #23986

Merged
merged 2 commits into from
Nov 26, 2024

Conversation

r-vasquez
Copy link
Contributor

@r-vasquez r-vasquez commented Nov 1, 2024

This PR adds the new command: rpk debug remote-bundle which lets the user collects a set of debug bundles from each node in the cluster. It uses the Admin API to do so. In order to collect the bundle, we have created 4 new commands:

  • rpk debug remote-bundle start
  • rpk debug remote-bundle download
  • rpk debug remote-bundle status
  • rpk debug remote-bundle cancel

Examples:

These are interactive-by-default commands, each interactive command has their respective --no-confirm flag to avoid confirmation prompts.

rpk debug remote-bundle start

BROKER
127.0.0.1:11644
127.0.0.1:10644
127.0.0.1:9644
? Confirm debug bundle collection from these brokers? Yes
BROKER           JOB-ID
127.0.0.1:10644  7fe70e0d-472b-48a7-bc8a-046d04658074  
127.0.0.1:11644  7fe70e0d-472b-48a7-bc8a-046d04658074  
127.0.0.1:9644   7fe70e0d-472b-48a7-bc8a-046d04658074  

The debug bundle collection process has started with Job-ID 7fe70e0d-472b-48a7-bc8a-046d04658074, to check the 
status, run:
  rpk debug remote-bundle status

rpk debug remote-bundle download

BROKER           STATUS   JOB-ID
127.0.0.1:9644   success  7fe70e0d-472b-48a7-bc8a-046d04658074
127.0.0.1:11644  success  7fe70e0d-472b-48a7-bc8a-046d04658074
127.0.0.1:10644  success  7fe70e0d-472b-48a7-bc8a-046d04658074
? Confirm debug bundle download from these brokers? Yes
BROKER           DOWNLOADED
127.0.0.1:9644   true  
127.0.0.1:10644  true  
127.0.0.1:11644  true  

Successfully downloaded remote debug bundle to 1730484634-remote-bundle.zip

rpk debug remote-bundle status

BROKER           STATUS   JOB-ID
127.0.0.1:9644   running  7fe70e0d-472b-48a7-bc8a-046d04658074
127.0.0.1:11644  running  7fe70e0d-472b-48a7-bc8a-046d04658074
127.0.0.1:10644  running  7fe70e0d-472b-48a7-bc8a-046d04658074

After the process is completed, you may retrieve the debug bundle using:
  rpk debug remote-bundle download

rpk debug remote-bundle cancel

BROKER           STATUS   JOB-ID
127.0.0.1:10644  running  863d7e4f-f581-4f0e-9f3f-40b73dc7dcce
127.0.0.1:9644   running  863d7e4f-f581-4f0e-9f3f-40b73dc7dcce
127.0.0.1:11644  running  863d7e4f-f581-4f0e-9f3f-40b73dc7dcce
? Confirm debug bundle cancel from these brokers? Yes
BROKER           CANCELED
127.0.0.1:10644  true  
127.0.0.1:11644  true  
127.0.0.1:9644   true  

Additional work will be added in the future to collect everything on a single command, and also allow the user to clean-up the current cluster.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

Features

  • rpk: Introduce rpk debug remote-bundle to gather a debug bundle from a remote cluster.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Nov 1, 2024

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/57483#0192e917-5e91-479f-8c4c-ab61be9a455b:

"rptest.tests.control_character_flag_test.ControlCharacterPermittedAfterUpgrade.test_upgrade_from_pre_v23_2.initial_version=.22.2.9"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/57483#0192e945-a73d-4fa9-8cce-8cec8bf9bc8a:

"rptest.tests.cluster_config_test.ClusterConfigAliasTest.test_aliasing_with_upgrade.wipe_cache=False.prop_set=PropertyAliasData.primary_name=.log_retention_ms.aliased_name=.delete_retention_ms.redpanda_version=.23.3.test_values=.1000000.300000.500000.expect_restart=False"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58586#01935535-e046-4f01-824d-6d44fde8f537:

"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58586#01935535-e047-4325-a0d6-ce6629d58d5d:

"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58586#01935535-e047-47c9-8964-6eb7305ce2e1:

"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58586#01935539-a0d5-43a5-9a85-e64ac24cc400:

"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58586#01935539-a0d5-4042-b034-25f13450ca49:

"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"

non flaky failures in https://buildkite.com/redpanda/redpanda/builds/58586#01935539-a0d6-4f6b-b276-7505301c6581:

"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"
"rptest.tests.rpk_debug_bundle_test.RpkDebugBundleTest.test_debug_bundle"

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Nov 1, 2024

Retry command for Build#57483

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/control_character_flag_test.py::ControlCharacterPermittedAfterUpgrade.test_upgrade_from_pre_v23_2@{"initial_version":[22,2,9]}
tests/rptest/tests/cluster_config_test.py::ClusterConfigAliasTest.test_aliasing_with_upgrade@{"prop_set":["log_retention_ms","delete_retention_ms",[23,3],[1000000,300000,500000],false],"wipe_cache":false}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Nov 1, 2024

// InstallFlags installs the debug bundle flags that fills the debug bundle
// options.
func (o *DebugBundleSharedOptions) InstallFlags(f *pflag.FlagSet) {
f.StringVar(&o.ControllerLogsSizeLimit, "controller-logs-size-limit", "132MB", "The size limit of the controller logs that can be stored in the bundle (e.g. 3MB, 1GiB)")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f.StringVar(&o.ControllerLogsSizeLimit, "controller-logs-size-limit", "132MB", "The size limit of the controller logs that can be stored in the bundle (e.g. 3MB, 1GiB)")
f.StringVar(&o.ControllerLogsSizeLimit, "controller-logs-size-limit", "132MB", "The size limit of the controller logs that can be stored in the bundle. For example: 3MB, 1GiB.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Except for adding the period as we don't add them to our flags help text which we found was a common pattern in other CLIs

f.StringVar(&o.ControllerLogsSizeLimit, "controller-logs-size-limit", "132MB", "The size limit of the controller logs that can be stored in the bundle (e.g. 3MB, 1GiB)")
f.DurationVar(&o.CPUProfilerWait, "cpu-profiler-wait", 30*time.Second, "For how long to collect samples for the CPU profiler (e.g. 30s, 1.5m). Must be higher than 15s")
f.StringVar(&o.LogsSizeLimit, "logs-size-limit", "100MiB", "Read the logs until the given size is reached (e.g. 3MB, 1GiB)")
f.StringVar(&o.LogsSince, "logs-since", "yesterday", "Include logs dated from specified date onward; (journalctl date format: YYYY-MM-DD, 'yesterday', or 'today'). Refer to journalctl documentation for more options")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f.StringVar(&o.LogsSince, "logs-since", "yesterday", "Include logs dated from specified date onward; (journalctl date format: YYYY-MM-DD, 'yesterday', or 'today'). Refer to journalctl documentation for more options")
f.StringVar(&o.LogsSince, "logs-since", "yesterday", "Include logs dated from specified date onward. For example: journalctl date format: YYYY-MM-DD, 'yesterday', or 'today'. See the journalctl documentation for more options.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of leaving just: journalctl date format as is the only option, and not an example of how to pass the flag?

f.StringVar(&o.LogsSizeLimit, "logs-size-limit", "100MiB", "Read the logs until the given size is reached (e.g. 3MB, 1GiB)")
f.StringVar(&o.LogsSince, "logs-since", "yesterday", "Include logs dated from specified date onward; (journalctl date format: YYYY-MM-DD, 'yesterday', or 'today'). Refer to journalctl documentation for more options")
f.StringVar(&o.LogsUntil, "logs-until", "", "Include logs older than the specified date; (journalctl date format: YYYY-MM-DD, 'yesterday', or 'today'). Refer to journalctl documentation for more options")
f.DurationVar(&o.MetricsInterval, "metrics-interval", 10*time.Second, "Interval between metrics snapshots (e.g. 30s, 1.5m)")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f.DurationVar(&o.MetricsInterval, "metrics-interval", 10*time.Second, "Interval between metrics snapshots (e.g. 30s, 1.5m)")
f.DurationVar(&o.MetricsInterval, "metrics-interval", 10*time.Second, "Interval between metrics snapshots. For example: 30s, 1.5m.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

One comment though on e.g vs For example that is worth having into consideration:

We attempt to keep the flags help text as short as possible as some commands are too cramped, that's why we used both e.g and () so it was 'easier' to spot the examples. take this one (rpk debug bundle) as an example:

Now:
image

After the changes:
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is one of the cases where the CLI style-guide conflicts with docs style-guide. In this case, keep consistent to your own style-guide. cc @micheleRP

f.StringVar(&o.LogsSince, "logs-since", "yesterday", "Include logs dated from specified date onward; (journalctl date format: YYYY-MM-DD, 'yesterday', or 'today'). Refer to journalctl documentation for more options")
f.StringVar(&o.LogsUntil, "logs-until", "", "Include logs older than the specified date; (journalctl date format: YYYY-MM-DD, 'yesterday', or 'today'). Refer to journalctl documentation for more options")
f.DurationVar(&o.MetricsInterval, "metrics-interval", 10*time.Second, "Interval between metrics snapshots (e.g. 30s, 1.5m)")
f.IntVar(&o.MetricsSampleCount, "metrics-samples", 2, "Number of metrics samples to take (at the interval of --metrics-interval). Must be >= 2")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f.IntVar(&o.MetricsSampleCount, "metrics-samples", 2, "Number of metrics samples to take (at the interval of --metrics-interval). Must be >= 2")
f.IntVar(&o.MetricsSampleCount, "metrics-samples", 2, "Number of metrics samples to take (at the interval of --metrics-interval). Must be >= 2.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as the above on periods.

'rpk debug remote-bundle status' and download when is ready with
'rpk debug remote-bundle download'.

The flag '--no-confirm' can be used to avoid the confirmation prompt.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The flag '--no-confirm' can be used to avoid the confirmation prompt.
Use the flag '--no-confirm' to avoid the confirmation prompt.

gene-redpanda
gene-redpanda previously approved these changes Nov 4, 2024
Copy link
Contributor

@gene-redpanda gene-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@r-vasquez
Copy link
Contributor Author

@dotnwat Thanks! good catch. Indeed I added the Bazel changes to the last commit. Fixed 👍

metricsSampleCount int
cpuProfilerWait time.Duration
timeout time.Duration
opts common.DebugBundleSharedOptions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this model, it would definitely make sense for topic stuff as well.

bojand
bojand previously approved these changes Nov 21, 2024
Copy link
Contributor

@michael-redpanda michael-redpanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your ducktape tests, for all asserts, please provide a human readable error message

tests/rptest/tests/rpk_debug_bundle_test.py Outdated Show resolved Hide resolved
@r-vasquez
Copy link
Contributor Author

Force Push:

  • Add assert failure messages

bojand
bojand previously approved these changes Nov 21, 2024
twmb
twmb previously approved these changes Nov 22, 2024
@r-vasquez r-vasquez dismissed michael-redpanda’s stale review November 22, 2024 17:05

Fixed in the last push: #23986 (comment), but Github didn't reset the review request.

@r-vasquez
Copy link
Contributor Author

/ci-repeat 3
skip-units
dt-repeat=20
tests/rptest/tests/rpk_debug_bundle_test.py::RpkDebugBundleTest

@redpanda-data redpanda-data deleted a comment from vbotbuildovich Nov 22, 2024
@r-vasquez
Copy link
Contributor Author

/ci-repeat 3
skip-units
dt-repeat=20
tests/rptest/tests/rpk_debug_bundle_test.py::RpkDebugBundleTest.test_remote_debug_bundle_default

These are options that are suitable for sharing
with Debug Remote Bundle
@r-vasquez r-vasquez dismissed stale reviews from twmb and bojand via fc3cd21 November 22, 2024 23:02
@r-vasquez
Copy link
Contributor Author

r-vasquez commented Nov 22, 2024

Force Push

  • I found an issue in the test that we moved from cluster_test.py to this new file: changed to a random string to generate the name of the bundle to avoid possible races.
  • Rebased with Dev to avoid the build-bazel error.

Note:

The new test added had a successful 60x run: https://buildkite.com/redpanda/redpanda/builds/58607#_, I'm going to retry aa 60x run again after the normal CI passes.

Fixes DEVEX-44

This commit introduces the rpk debug remote bundle
command, which allows the user to request a debug
bundle using the Admin API.
@r-vasquez
Copy link
Contributor Author

/ci-repeat 3
skip-units
dt-repeat=20
tests/rptest/tests/rpk_debug_bundle_test.py::RpkDebugBundleTest

@r-vasquez r-vasquez merged commit dc5e858 into redpanda-data:dev Nov 26, 2024
23 checks passed
@r-vasquez
Copy link
Contributor Author

/backport v24.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants