enable-rbd-metrics--test_ceph_rbd_metrics_available 4.15 #9506

DanielOsypenko · 2024-03-18T15:00:37Z

This PR was carried to master branch. Originally it was tested against old master branch, which was the release-4.13

copy from #8457

We need to solve the problem frequently happening on External mode deployments when the test_ceph_rbd_metrics_available fails because ceph is not configured to enable rbd metrics.
My intention was to enable it and disable after the test.
More info about problem is here -> https://bugzilla.redhat.com/show_bug.cgi?id=2237412

Test passes on internal deployment -> http://pastebin.test.redhat.com/1109172

Test fails on my currently available External mode deployment
I am getting an error when trying to run any 'ceph' command on external mode cluster (for instance 'ceph -s ')

via 'oc rsh' to external toolbox

{CommandFailed}Error during execution of command: oc -n openshift-storage rsh rook-ceph-tools-6ccd65499-cwplh ceph config get mgr mgr/prometheus/rbd_stats_pools --format json-pretty.
Error is 2023-09-11T13:13:12.312+0000 7f815a6f3640 -1 auth: error parsing file /etc/ceph/keyring: error setting modifier for [client.admin] type=key val=admin-secret: Malformed input2023-09-11T13:13:12.312+0000 7f815a6f3640 -1 auth: failed to load /etc/ceph/keyring: (5) Input/output error2023-09-11T13:13:12.317+0000 7f815a6f3640 -1 auth: error parsing file /etc/ceph/keyring: error setting modifier for [client.admin] type=key val=admin-secret: Malformed input2023-09-11T13:13:12.317+0000 7f815a6f3640 -1 auth: failed to load /etc/ceph/keyring: (5) Input/output error2023-09-11T13:13:12.317+0000 7f815a6f3640 -1 auth: error parsing file /etc/ceph/keyring: error setting modifier for [client.admin] type=key val=admin-secret: Malformed input2023-09-11T13:13:12.317+0000 7f815a6f3640 -1 auth: failed to load /etc/ceph/keyring: ...

via 'oc debug' to external toolbox

ceph -s
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')

ocs-ci

PR validation on existing cluster

Cluster Name: vavuthuextdevyp1
Cluster Configuration:
PR Test Suite: tier1
PR Test Path: tests/functional/monitoring/prometheus/metrics/test_monitoring_defaults.py::TestCephMonitoringAvailable
Additional Test Params:
OCP VERSION: 4.15
OCS VERSION: 4.15
tested against branch: master

Job UNSTABLE (some or all tests failed).

ocs-ci

PR validation on existing cluster

Cluster Name: vavuthuextdevyp1
Cluster Configuration:
PR Test Suite: tier1
PR Test Path: tests/functional/monitoring/prometheus/metrics/test_monitoring_defaults.py::TestCephMonitoringAvailable
Additional Test Params:
OCP VERSION: 4.15
OCS VERSION: 4.15
tested against branch: master

Job UNSTABLE (some or all tests failed).

ocs-ci

PR validation on existing cluster

Cluster Name: vavuthuextdevyp1
Cluster Configuration:
PR Test Suite: tier1
PR Test Path: tests/functional/monitoring/prometheus/metrics/test_monitoring_defaults.py::TestCephMonitoringAvailable
Additional Test Params:
OCP VERSION: 4.15
OCS VERSION: 4.15
tested against branch: master

Job UNSTABLE (some or all tests failed).

ocs-ci

PR validation on existing cluster

Cluster Name: vavuthuextdevyp1
Cluster Configuration:
PR Test Suite: tier1
PR Test Path: tests/functional/monitoring/prometheus/metrics/test_monitoring_defaults.py::TestCephMonitoringAvailable
Additional Test Params:
OCP VERSION: 4.15
OCS VERSION: 4.15
tested against branch: master

Job UNSTABLE (some or all tests failed).

DanielOsypenko · 2024-03-18T17:41:24Z

list of metrics for the test test_ceph_metrics_available that are still unavailable:

'ceph_bluestore_state_aio_wait_lat_sum',
'ceph_paxos_store_state_latency_sum',
'ceph_osd_op_out_bytes',
'ceph_bluestore_txc_submit_lat_sum',
'ceph_paxos_commit',
'ceph_paxos_new_pn_latency_count',
'ceph_osd_op_r_process_latency_count',
'ceph_bluestore_txc_submit_lat_count',
'ceph_bluestore_kv_final_lat_sum',
'ceph_paxos_collect_keys_sum',
'ceph_paxos_accept_timeout',
'ceph_paxos_begin_latency_count',
'ceph_bluefs_wal_total_bytes',
'ceph_paxos_refresh',
'ceph_bluestore_read_lat_count',
'ceph_mon_num_sessions',
'ceph_bluefs_bytes_written_wal',
'ceph_mon_num_elections',
'ceph_rocksdb_compact',
'ceph_bluestore_kv_sync_lat_sum',
'ceph_osd_op_process_latency_count',
'ceph_osd_op_w_prepare_latency_count',
'ceph_paxos_begin_latency_sum',
'ceph_osd_op_r',
'ceph_osd_op_rw_prepare_latency_sum',
'ceph_paxos_new_pn',
'ceph_rocksdb_get_latency_count',
'ceph_paxos_commit_latency_count',
'ceph_bluestore_txc_throttle_lat_count',
'ceph_paxos_lease_ack_timeout',
'ceph_bluestore_txc_commit_lat_sum',
'ceph_paxos_collect_bytes_sum',
'ceph_osd_op_rw_latency_count',
'ceph_paxos_collect_uncommitted',
'ceph_osd_op_rw_latency_sum',
'ceph_paxos_share_state',
'ceph_osd_op_r_prepare_latency_sum',
'ceph_bluestore_kv_flush_lat_sum',
'ceph_osd_op_rw_process_latency_sum',
'ceph_rocksdb_rocksdb_write_memtable_time_count',
'ceph_paxos_collect_latency_count',
'ceph_osd_op_rw_prepare_latency_count',
'ceph_paxos_collect_latency_sum',
'ceph_rocksdb_rocksdb_write_delay_time_count',
'ceph_paxos_begin_bytes_sum',
'ceph_osd_numpg',
'ceph_osd_stat_bytes',
'ceph_rocksdb_submit_sync_latency_sum',
'ceph_rocksdb_compact_queue_merge',
'ceph_paxos_collect_bytes_count',
'ceph_osd_op',
'ceph_paxos_commit_keys_sum',
'ceph_osd_op_rw_in_bytes',
'ceph_osd_op_rw_out_bytes',
'ceph_bluefs_bytes_written_sst',
'ceph_osd_op_rw_process_latency_count',
'ceph_rocksdb_compact_queue_len',
'ceph_bluestore_txc_throttle_lat_sum',
'ceph_bluefs_slow_used_bytes',
'ceph_osd_op_r_latency_sum',
'ceph_bluestore_kv_flush_lat_count',
'ceph_rocksdb_compact_range',
'ceph_osd_op_latency_sum',
'ceph_mon_session_add',
'ceph_paxos_share_state_keys_count',
'ceph_paxos_collect',
'ceph_osd_op_w_in_bytes',
'ceph_osd_op_r_process_latency_sum',
'ceph_paxos_start_peon',
'ceph_mon_session_trim',
'ceph_rocksdb_get_latency_sum',
'ceph_osd_op_rw',
'ceph_paxos_store_state_keys_count',
'ceph_rocksdb_rocksdb_write_delay_time_sum',
'ceph_osd_recovery_ops',
'ceph_bluefs_logged_bytes',
'ceph_bluefs_db_total_bytes',
'ceph_osd_op_w_latency_count',
'ceph_bluestore_txc_commit_lat_count',
'ceph_bluestore_state_aio_wait_lat_count',
'ceph_paxos_begin_bytes_count',
'ceph_paxos_start_leader',
'ceph_mon_election_call',
'ceph_rocksdb_rocksdb_write_pre_and_post_time_count',
'ceph_mon_session_rm',
'ceph_paxos_store_state',
'ceph_paxos_store_state_bytes_count',
'ceph_osd_op_w_latency_sum',
'ceph_rocksdb_submit_latency_count',
'ceph_paxos_commit_latency_sum',
'ceph_rocksdb_rocksdb_write_memtable_time_sum',
'ceph_paxos_share_state_bytes_sum',
'ceph_osd_op_process_latency_sum',
'ceph_paxos_begin_keys_sum',
'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum',
'ceph_bluefs_wal_used_bytes',
'ceph_rocksdb_rocksdb_write_wal_time_sum',
'ceph_osd_op_wip',
'ceph_paxos_lease_timeout',
'ceph_osd_op_r_out_bytes',
'ceph_paxos_begin_keys_count',
'ceph_bluestore_kv_sync_lat_count',
'ceph_osd_op_prepare_latency_count',
'ceph_bluefs_bytes_written_slow',
'ceph_rocksdb_submit_latency_sum',
'ceph_osd_op_r_latency_count',
'ceph_paxos_share_state_keys_sum',
'ceph_paxos_store_state_bytes_sum',
'ceph_osd_op_latency_count',
'ceph_paxos_commit_bytes_count',
'ceph_paxos_restart',
'ceph_bluefs_slow_total_bytes',
'ceph_paxos_collect_timeout',
'ceph_osd_op_w_process_latency_sum',
'ceph_paxos_collect_keys_count',
'ceph_paxos_share_state_bytes_count',
'ceph_osd_op_w_prepare_latency_sum',
'ceph_bluestore_read_lat_sum',
'ceph_osd_stat_bytes_used',
'ceph_paxos_begin',
'ceph_mon_election_win',
'ceph_osd_op_w_process_latency_count',
'ceph_rocksdb_rocksdb_write_wal_time_count',
'ceph_paxos_store_state_keys_sum',
'ceph_osd_numpg_removing',
'ceph_paxos_commit_keys_count',
'ceph_paxos_new_pn_latency_sum',
'ceph_osd_op_in_bytes',
'ceph_paxos_store_state_latency_count',
'ceph_paxos_refresh_latency_count',
'ceph_osd_op_r_prepare_latency_count',
'ceph_bluefs_num_files',
'ceph_mon_election_lose',
'ceph_osd_op_prepare_latency_sum',
'ceph_bluefs_db_used_bytes',
'ceph_bluestore_kv_final_lat_count',
'ceph_paxos_refresh_latency_sum',
'ceph_osd_recovery_bytes',
'ceph_osd_op_w',
'ceph_paxos_commit_bytes_sum',
'ceph_bluefs_log_bytes',
'ceph_rocksdb_submit_sync_latency_count'

Consulting with Awan Thakkar

ebenahar · 2024-03-19T12:47:21Z

tests/conftest.py

+
+@pytest.fixture(scope="session")
+def enable_rbd_metrics(request):
+    ct_pod = pod.get_ceph_tools_pod()


shouldn't we add a condition for external mode only?

good question. Even though the fixture returns back the values of exclude_perf_counters and rbd_stats_pools it may cover regression bug, when these values are not configured by default.
I will add skip for this fixture for non-external mode clusters.

github-actions · 2024-06-29T20:05:07Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs.

github-actions · 2024-09-29T20:05:24Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs.

github-actions · 2024-10-29T20:06:03Z

This pull request has been automatically closed due to inactivity. Please re-open if these changes are still required.

Signed-off-by: Daniel Osypenko <dosypenk@redhat.com>

…unters=false Signed-off-by: Daniel Osypenko <dosypenk@redhat.com>

…unters=false 0.1 Signed-off-by: Daniel Osypenko <dosypenk@redhat.com>

Signed-off-by: Daniel Osypenko <dosypenk@redhat.com>

DanielOsypenko · 2024-11-19T16:33:33Z

even after setting external ceph cluster report metrics we still see some of them are unavailable

2024-11-19 17:38:46  >       assert list_of_metrics_without_results == [], msg
2024-11-19 17:38:46  E       AssertionError: OCS Monitoring should provide some value(s) for tested rbd metrics, so that the list of metrics without results is empty.
2024-11-19 17:38:46  E       assert ['ceph_rbd_wr...atency_count'] == []
2024-11-19 17:38:46  E         Left contains 6 more items, first extra item: 'ceph_rbd_write_ops'
2024-11-19 17:38:46  E         Full diff:
2024-11-19 17:38:46  E           [
2024-11-19 17:38:46  E         -  ,
2024-11-19 17:38:46  E         +  'ceph_rbd_write_ops',
2024-11-19 17:38:46  E         +  'ceph_rbd_read_ops',
2024-11-19 17:38:46  E         +  'ceph_rbd_write_bytes',
2024-11-19 17:38:46  E         +  'ceph_rbd_read_bytes',
2024-11-19 17:38:46  E         +  'ceph_rbd_write_latency_sum',
2024-11-19 17:38:46  E         +  'ceph_rbd_write_latency_count',
2024-11-19 17:38:46  E           ]
2024-11-19 17:38:46  
2024-11-19 17:38:46  tests/functional/monitoring/prometheus/metrics/test_monitoring_defaults.py:143: AssertionError

https://url.corp.redhat.com/bbe0c24

Hello @fbalak
I remember we discussed metrics related tests on external cluster, this was my attempt which is not fully successful.
I think I'd rather not invest time on this and put skip_if_external_mode by the reason:

In past I was talking with Awan Thakkar and he did not have quick answer on this being not sure if it is possible to make all metrics available or not, he was also stating that we do not show metrics to external users, it is not supported by ODF and never been a part of ODF product.

I also think that by default on internal mode cluster ODF manages all mgr settings to make ceph cluster broadcast metrics. Trying to make ceph storage show up metrics by our own manual actions means:

being dependand on ceph version and not easily maintainable
doubtful benefits, it is not a user behavior and not a part of the ODF product.
doubts if any bug can be open based on such metrics, since it will be odf-qe custom settings

Question, what if I add skip_if_external_mode on external mode metrics? they are apx 80% of test failures in my test ownership

fbalak · 2024-11-25T08:01:22Z

Ok, we can add those markers until we resolve how it should work consistently.

openshift-ci · 2024-11-25T08:02:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: DanielOsypenko, fbalak

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

DanielOsypenko self-assigned this Mar 18, 2024

openshift-ci bot added the do-not-merge/work-in-progress label Mar 18, 2024

pull-request-size bot added the size/L PR that changes 100-499 lines label Mar 18, 2024

DanielOsypenko marked this pull request as ready for review March 18, 2024 15:00

DanielOsypenko requested a review from a team as a code owner March 18, 2024 15:00

openshift-ci bot removed the do-not-merge/work-in-progress label Mar 18, 2024

DanielOsypenko requested a review from a team March 18, 2024 15:00

DanielOsypenko added TestCase failing do-not-merge/work-in-progress and removed do-not-merge/work-in-progress labels Mar 18, 2024

ocs-ci reviewed Mar 18, 2024

View reviewed changes

DanielOsypenko mentioned this pull request Mar 19, 2024

make test_monitoring_reporting_ok_when_idle work on external mode cluster #9510

Closed

ebenahar reviewed Mar 19, 2024

View reviewed changes

DanielOsypenko added the Squad/Blue label Mar 21, 2024

openshift-merge-robot added the needs-rebase label Mar 30, 2024

github-actions bot added the lifecycle/stale No recent activity label Jun 29, 2024

DanielOsypenko removed the lifecycle/stale No recent activity label Jul 1, 2024

github-actions bot added the lifecycle/stale No recent activity label Sep 29, 2024

github-actions bot closed this Oct 29, 2024

DanielOsypenko reopened this Nov 14, 2024

DanielOsypenko added 3 commits November 19, 2024 17:23

enable-rbd-metrics--test_ceph_rbd_metrics_available 4.15

7d309cd

Signed-off-by: Daniel Osypenko <dosypenk@redhat.com>

update enable_rbd_metrics fixture

4cb6da6

Signed-off-by: Daniel Osypenko <dosypenk@redhat.com>

update enable_rbd_metrics fixture with mgr/prometheus/exclude_perf_co…

6c50de0

…unters=false Signed-off-by: Daniel Osypenko <dosypenk@redhat.com>

update enable_rbd_metrics fixture with mgr/prometheus/exclude_perf_co…

4eea528

…unters=false 0.1 Signed-off-by: Daniel Osypenko <dosypenk@redhat.com>

DanielOsypenko force-pushed the enable-rbd-metrics--test_ceph_rbd_metrics_available-415 branch from 998712d to 4eea528 Compare November 19, 2024 15:23

openshift-merge-robot removed the needs-rebase label Nov 19, 2024

DanielOsypenko added 2 commits November 19, 2024 17:28

external mode metrics - review comment addressed

165811e

Signed-off-by: Daniel Osypenko <dosypenk@redhat.com>

cleanup finalizers

0129796

Signed-off-by: Daniel Osypenko <dosypenk@redhat.com>

github-actions bot removed the lifecycle/stale No recent activity label Nov 19, 2024

fbalak approved these changes Nov 25, 2024

View reviewed changes

openshift-ci bot assigned fbalak Nov 25, 2024

openshift-ci bot added the lgtm label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable-rbd-metrics--test_ceph_rbd_metrics_available 4.15 #9506

enable-rbd-metrics--test_ceph_rbd_metrics_available 4.15 #9506

DanielOsypenko commented Mar 18, 2024

ocs-ci left a comment

ocs-ci left a comment

ocs-ci left a comment

ocs-ci left a comment

DanielOsypenko commented Mar 18, 2024

ebenahar Mar 19, 2024

DanielOsypenko Mar 20, 2024

github-actions bot commented Jun 29, 2024

github-actions bot commented Sep 29, 2024

github-actions bot commented Oct 29, 2024

DanielOsypenko commented Nov 19, 2024 •

edited

Loading

fbalak commented Nov 25, 2024

openshift-ci bot commented Nov 25, 2024

enable-rbd-metrics--test_ceph_rbd_metrics_available 4.15 #9506

Are you sure you want to change the base?

enable-rbd-metrics--test_ceph_rbd_metrics_available 4.15 #9506

Conversation

DanielOsypenko commented Mar 18, 2024

This PR was carried to master branch. Originally it was tested against old master branch, which was the release-4.13

ocs-ci left a comment

Choose a reason for hiding this comment

ocs-ci left a comment

Choose a reason for hiding this comment

ocs-ci left a comment

Choose a reason for hiding this comment

ocs-ci left a comment

Choose a reason for hiding this comment

DanielOsypenko commented Mar 18, 2024

ebenahar Mar 19, 2024

Choose a reason for hiding this comment

DanielOsypenko Mar 20, 2024

Choose a reason for hiding this comment

github-actions bot commented Jun 29, 2024

github-actions bot commented Sep 29, 2024

github-actions bot commented Oct 29, 2024

DanielOsypenko commented Nov 19, 2024 • edited Loading

fbalak commented Nov 25, 2024

openshift-ci bot commented Nov 25, 2024

DanielOsypenko commented Nov 19, 2024 •

edited

Loading