Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable-rbd-metrics--test_ceph_rbd_metrics_available 4.15 #9506

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

DanielOsypenko
Copy link
Contributor

This PR was carried to master branch. Originally it was tested against old master branch, which was the release-4.13

copy from #8457

We need to solve the problem frequently happening on External mode deployments when the test_ceph_rbd_metrics_available fails because ceph is not configured to enable rbd metrics.
My intention was to enable it and disable after the test.
More info about problem is here -> https://bugzilla.redhat.com/show_bug.cgi?id=2237412

Test passes on internal deployment -> http://pastebin.test.redhat.com/1109172

Test fails on my currently available External mode deployment
I am getting an error when trying to run any 'ceph' command on external mode cluster (for instance 'ceph -s ')

via 'oc rsh' to external toolbox

{CommandFailed}Error during execution of command: oc -n openshift-storage rsh rook-ceph-tools-6ccd65499-cwplh ceph config get mgr mgr/prometheus/rbd_stats_pools --format json-pretty.
Error is 2023-09-11T13:13:12.312+0000 7f815a6f3640 -1 auth: error parsing file /etc/ceph/keyring: error setting modifier for [client.admin] type=key val=admin-secret: Malformed input2023-09-11T13:13:12.312+0000 7f815a6f3640 -1 auth: failed to load /etc/ceph/keyring: (5) Input/output error2023-09-11T13:13:12.317+0000 7f815a6f3640 -1 auth: error parsing file /etc/ceph/keyring: error setting modifier for [client.admin] type=key val=admin-secret: Malformed input2023-09-11T13:13:12.317+0000 7f815a6f3640 -1 auth: failed to load /etc/ceph/keyring: (5) Input/output error2023-09-11T13:13:12.317+0000 7f815a6f3640 -1 auth: error parsing file /etc/ceph/keyring: error setting modifier for [client.admin] type=key val=admin-secret: Malformed input2023-09-11T13:13:12.317+0000 7f815a6f3640 -1 auth: failed to load /etc/ceph/keyring: ...

via 'oc debug' to external toolbox

ceph -s
Error initializing cluster client: ObjectNotFound('RADOS object not found (error calling conf_read_file)')

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: vavuthuextdevyp1
Cluster Configuration:
PR Test Suite: tier1
PR Test Path: tests/functional/monitoring/prometheus/metrics/test_monitoring_defaults.py::TestCephMonitoringAvailable
Additional Test Params:
OCP VERSION: 4.15
OCS VERSION: 4.15
tested against branch: master

Job UNSTABLE (some or all tests failed).

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: vavuthuextdevyp1
Cluster Configuration:
PR Test Suite: tier1
PR Test Path: tests/functional/monitoring/prometheus/metrics/test_monitoring_defaults.py::TestCephMonitoringAvailable
Additional Test Params:
OCP VERSION: 4.15
OCS VERSION: 4.15
tested against branch: master

Job UNSTABLE (some or all tests failed).

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: vavuthuextdevyp1
Cluster Configuration:
PR Test Suite: tier1
PR Test Path: tests/functional/monitoring/prometheus/metrics/test_monitoring_defaults.py::TestCephMonitoringAvailable
Additional Test Params:
OCP VERSION: 4.15
OCS VERSION: 4.15
tested against branch: master

Job UNSTABLE (some or all tests failed).

Copy link

@ocs-ci ocs-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR validation on existing cluster

Cluster Name: vavuthuextdevyp1
Cluster Configuration:
PR Test Suite: tier1
PR Test Path: tests/functional/monitoring/prometheus/metrics/test_monitoring_defaults.py::TestCephMonitoringAvailable
Additional Test Params:
OCP VERSION: 4.15
OCS VERSION: 4.15
tested against branch: master

Job UNSTABLE (some or all tests failed).

@DanielOsypenko
Copy link
Contributor Author

list of metrics for the test test_ceph_metrics_available that are still unavailable:

'ceph_bluestore_state_aio_wait_lat_sum',
'ceph_paxos_store_state_latency_sum',
'ceph_osd_op_out_bytes',
'ceph_bluestore_txc_submit_lat_sum',
'ceph_paxos_commit',
'ceph_paxos_new_pn_latency_count',
'ceph_osd_op_r_process_latency_count',
'ceph_bluestore_txc_submit_lat_count',
'ceph_bluestore_kv_final_lat_sum',
'ceph_paxos_collect_keys_sum',
'ceph_paxos_accept_timeout',
'ceph_paxos_begin_latency_count',
'ceph_bluefs_wal_total_bytes',
'ceph_paxos_refresh',
'ceph_bluestore_read_lat_count',
'ceph_mon_num_sessions',
'ceph_bluefs_bytes_written_wal',
'ceph_mon_num_elections',
'ceph_rocksdb_compact',
'ceph_bluestore_kv_sync_lat_sum',
'ceph_osd_op_process_latency_count',
'ceph_osd_op_w_prepare_latency_count',
'ceph_paxos_begin_latency_sum',
'ceph_osd_op_r',
'ceph_osd_op_rw_prepare_latency_sum',
'ceph_paxos_new_pn',
'ceph_rocksdb_get_latency_count',
'ceph_paxos_commit_latency_count',
'ceph_bluestore_txc_throttle_lat_count',
'ceph_paxos_lease_ack_timeout',
'ceph_bluestore_txc_commit_lat_sum',
'ceph_paxos_collect_bytes_sum',
'ceph_osd_op_rw_latency_count',
'ceph_paxos_collect_uncommitted',
'ceph_osd_op_rw_latency_sum',
'ceph_paxos_share_state',
'ceph_osd_op_r_prepare_latency_sum',
'ceph_bluestore_kv_flush_lat_sum',
'ceph_osd_op_rw_process_latency_sum',
'ceph_rocksdb_rocksdb_write_memtable_time_count',
'ceph_paxos_collect_latency_count',
'ceph_osd_op_rw_prepare_latency_count',
'ceph_paxos_collect_latency_sum',
'ceph_rocksdb_rocksdb_write_delay_time_count',
'ceph_paxos_begin_bytes_sum',
'ceph_osd_numpg',
'ceph_osd_stat_bytes',
'ceph_rocksdb_submit_sync_latency_sum',
'ceph_rocksdb_compact_queue_merge',
'ceph_paxos_collect_bytes_count',
'ceph_osd_op',
'ceph_paxos_commit_keys_sum',
'ceph_osd_op_rw_in_bytes',
'ceph_osd_op_rw_out_bytes',
'ceph_bluefs_bytes_written_sst',
'ceph_osd_op_rw_process_latency_count',
'ceph_rocksdb_compact_queue_len',
'ceph_bluestore_txc_throttle_lat_sum',
'ceph_bluefs_slow_used_bytes',
'ceph_osd_op_r_latency_sum',
'ceph_bluestore_kv_flush_lat_count',
'ceph_rocksdb_compact_range',
'ceph_osd_op_latency_sum',
'ceph_mon_session_add',
'ceph_paxos_share_state_keys_count',
'ceph_paxos_collect',
'ceph_osd_op_w_in_bytes',
'ceph_osd_op_r_process_latency_sum',
'ceph_paxos_start_peon',
'ceph_mon_session_trim',
'ceph_rocksdb_get_latency_sum',
'ceph_osd_op_rw',
'ceph_paxos_store_state_keys_count',
'ceph_rocksdb_rocksdb_write_delay_time_sum',
'ceph_osd_recovery_ops',
'ceph_bluefs_logged_bytes',
'ceph_bluefs_db_total_bytes',
'ceph_osd_op_w_latency_count',
'ceph_bluestore_txc_commit_lat_count',
'ceph_bluestore_state_aio_wait_lat_count',
'ceph_paxos_begin_bytes_count',
'ceph_paxos_start_leader',
'ceph_mon_election_call',
'ceph_rocksdb_rocksdb_write_pre_and_post_time_count',
'ceph_mon_session_rm',
'ceph_paxos_store_state',
'ceph_paxos_store_state_bytes_count',
'ceph_osd_op_w_latency_sum',
'ceph_rocksdb_submit_latency_count',
'ceph_paxos_commit_latency_sum',
'ceph_rocksdb_rocksdb_write_memtable_time_sum',
'ceph_paxos_share_state_bytes_sum',
'ceph_osd_op_process_latency_sum',
'ceph_paxos_begin_keys_sum',
'ceph_rocksdb_rocksdb_write_pre_and_post_time_sum',
'ceph_bluefs_wal_used_bytes',
'ceph_rocksdb_rocksdb_write_wal_time_sum',
'ceph_osd_op_wip',
'ceph_paxos_lease_timeout',
'ceph_osd_op_r_out_bytes',
'ceph_paxos_begin_keys_count',
'ceph_bluestore_kv_sync_lat_count',
'ceph_osd_op_prepare_latency_count',
'ceph_bluefs_bytes_written_slow',
'ceph_rocksdb_submit_latency_sum',
'ceph_osd_op_r_latency_count',
'ceph_paxos_share_state_keys_sum',
'ceph_paxos_store_state_bytes_sum',
'ceph_osd_op_latency_count',
'ceph_paxos_commit_bytes_count',
'ceph_paxos_restart',
'ceph_bluefs_slow_total_bytes',
'ceph_paxos_collect_timeout',
'ceph_osd_op_w_process_latency_sum',
'ceph_paxos_collect_keys_count',
'ceph_paxos_share_state_bytes_count',
'ceph_osd_op_w_prepare_latency_sum',
'ceph_bluestore_read_lat_sum',
'ceph_osd_stat_bytes_used',
'ceph_paxos_begin',
'ceph_mon_election_win',
'ceph_osd_op_w_process_latency_count',
'ceph_rocksdb_rocksdb_write_wal_time_count',
'ceph_paxos_store_state_keys_sum',
'ceph_osd_numpg_removing',
'ceph_paxos_commit_keys_count',
'ceph_paxos_new_pn_latency_sum',
'ceph_osd_op_in_bytes',
'ceph_paxos_store_state_latency_count',
'ceph_paxos_refresh_latency_count',
'ceph_osd_op_r_prepare_latency_count',
'ceph_bluefs_num_files',
'ceph_mon_election_lose',
'ceph_osd_op_prepare_latency_sum',
'ceph_bluefs_db_used_bytes',
'ceph_bluestore_kv_final_lat_count',
'ceph_paxos_refresh_latency_sum',
'ceph_osd_recovery_bytes',
'ceph_osd_op_w',
'ceph_paxos_commit_bytes_sum',
'ceph_bluefs_log_bytes',
'ceph_rocksdb_submit_sync_latency_count'

Consulting with Awan Thakkar


@pytest.fixture(scope="session")
def enable_rbd_metrics(request):
ct_pod = pod.get_ceph_tools_pod()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we add a condition for external mode only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question. Even though the fixture returns back the values of exclude_perf_counters and rbd_stats_pools it may cover regression bug, when these values are not configured by default.
I will add skip for this fixture for non-external mode clusters.

Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs.

@github-actions github-actions bot added the lifecycle/stale No recent activity label Jun 29, 2024
@DanielOsypenko DanielOsypenko removed the lifecycle/stale No recent activity label Jul 1, 2024
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 30 days if no further activity occurs.

@github-actions github-actions bot added the lifecycle/stale No recent activity label Sep 29, 2024
Copy link

This pull request has been automatically closed due to inactivity. Please re-open if these changes are still required.

@github-actions github-actions bot closed this Oct 29, 2024
Signed-off-by: Daniel Osypenko <dosypenk@redhat.com>
Signed-off-by: Daniel Osypenko <dosypenk@redhat.com>
…unters=false

Signed-off-by: Daniel Osypenko <dosypenk@redhat.com>
…unters=false 0.1

Signed-off-by: Daniel Osypenko <dosypenk@redhat.com>
@DanielOsypenko DanielOsypenko force-pushed the enable-rbd-metrics--test_ceph_rbd_metrics_available-415 branch from 998712d to 4eea528 Compare November 19, 2024 15:23
Signed-off-by: Daniel Osypenko <dosypenk@redhat.com>
Signed-off-by: Daniel Osypenko <dosypenk@redhat.com>
@DanielOsypenko
Copy link
Contributor Author

DanielOsypenko commented Nov 19, 2024

even after setting external ceph cluster report metrics we still see some of them are unavailable

2024-11-19 17:38:46  >       assert list_of_metrics_without_results == [], msg
2024-11-19 17:38:46  E       AssertionError: OCS Monitoring should provide some value(s) for tested rbd metrics, so that the list of metrics without results is empty.
2024-11-19 17:38:46  E       assert ['ceph_rbd_wr...atency_count'] == []
2024-11-19 17:38:46  E         Left contains 6 more items, first extra item: 'ceph_rbd_write_ops'
2024-11-19 17:38:46  E         Full diff:
2024-11-19 17:38:46  E           [
2024-11-19 17:38:46  E         -  ,
2024-11-19 17:38:46  E         +  'ceph_rbd_write_ops',
2024-11-19 17:38:46  E         +  'ceph_rbd_read_ops',
2024-11-19 17:38:46  E         +  'ceph_rbd_write_bytes',
2024-11-19 17:38:46  E         +  'ceph_rbd_read_bytes',
2024-11-19 17:38:46  E         +  'ceph_rbd_write_latency_sum',
2024-11-19 17:38:46  E         +  'ceph_rbd_write_latency_count',
2024-11-19 17:38:46  E           ]
2024-11-19 17:38:46  
2024-11-19 17:38:46  tests/functional/monitoring/prometheus/metrics/test_monitoring_defaults.py:143: AssertionError

https://url.corp.redhat.com/bbe0c24

Hello @fbalak
I remember we discussed metrics related tests on external cluster, this was my attempt which is not fully successful.
I think I'd rather not invest time on this and put skip_if_external_mode by the reason:

In past I was talking with Awan Thakkar and he did not have quick answer on this being not sure if it is possible to make all metrics available or not, he was also stating that we do not show metrics to external users, it is not supported by ODF and never been a part of ODF product.

I also think that by default on internal mode cluster ODF manages all mgr settings to make ceph cluster broadcast metrics. Trying to make ceph storage show up metrics by our own manual actions means:

  1. being dependand on ceph version and not easily maintainable
  2. doubtful benefits, it is not a user behavior and not a part of the ODF product.
  3. doubts if any bug can be open based on such metrics, since it will be odf-qe custom settings

Question, what if I add skip_if_external_mode on external mode metrics? they are apx 80% of test failures in my test ownership

@github-actions github-actions bot removed the lifecycle/stale No recent activity label Nov 19, 2024
@fbalak
Copy link
Contributor

fbalak commented Nov 25, 2024

Ok, we can add those markers until we resolve how it should work consistently.

Copy link

openshift-ci bot commented Nov 25, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: DanielOsypenko, fbalak

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants