Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph input plugin cluster metrics broken on Luminous #5277

Closed
dvx76 opened this issue Jan 10, 2019 · 3 comments · Fixed by #5466
Closed

ceph input plugin cluster metrics broken on Luminous #5277

dvx76 opened this issue Jan 10, 2019 · 3 comments · Fixed by #5466
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@dvx76
Copy link

dvx76 commented Jan 10, 2019

Relevant telegraf.conf:

[[inputs.ceph]]
  gather_cluster_stats = true
  gather_admin_socket_stats = false

System info:

$ telegraf --version
Telegraf 1.9.1 (git: HEAD 20636091)

$ ceph --version
ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable)

Steps to reproduce:

  1. telegraf --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d --input-filter ceph --test | grep ceph_usage
  2. All reported field values are 0
  3. Compare with the output of $ ceph df --format json

E.g.

$ telegraf --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d --input-filter ceph --test | grep ceph_usage
2019-01-10T14:59:39Z I! Starting Telegraf 1.9.1
> ceph_usage,host=myhost.org total_avail=0,total_space=0,total_used=0 1547132380000000000

$ ceph df --format json
{"stats":{"total_bytes":7181923516416,"total_used_bytes":20532953088,"total_avail_bytes":7161390563328},"pools":[]}

Expected behavior:

ceph_usage measurement is reported with the correct values from the ceph df output.

Actual behavior:

All reported field values are 0

Additional info:

It looks like this is a consequence of #4721 . Perhaps that was only tested against Ceph Mimic (13.x).

That new code expects the ceph df json to have fields total_space, total_used and total_avail but in Ceph Luminous the fields have different names, e.g.

$ ceph df --format json

{"stats":{"total_bytes":7181923516416,"total_used_bytes":20532953088,"total_avail_bytes":7161390563328},"pools":[]}

Note that this also results in the field names changing. The current README for the ceph plugin also still shows the old output.

Oddly when this new code unmarshals the json above it doesn't error, i.e. this is not logged.

cc @spjmurray

@danielnelson
Copy link
Contributor

@fdevaux Can you check if this worked in 1.8.x? There have been other reports of issues with Luminous #3387.

@danielnelson danielnelson added the bug unexpected problem or unintended behavior label Jan 10, 2019
@dvx76
Copy link
Author

dvx76 commented Jan 11, 2019

@danielnelson It does still work in 1.8 indeed. I didn't quite check in which release #4721 got shipped and just looked at the date of the PR.

1.9 definitely broken.

# telegraf --version
Telegraf v1.7.4 (git: release-1.7 578db7ef)

# telegraf --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d --input-filter ceph --test | grep ceph_usage
> ceph_usage,host=monitor1 total_avail_bytes=10690818048,total_bytes=10725863424,total_used_bytes=35045376 1547230678000000000

# telegraf --version
Telegraf 1.8.3 (git: HEAD f2979106)

# telegraf --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d --input-filter ceph --test | grep ceph_usage
> ceph_usage,host=monitor1 total_avail_bytes=10690818048,total_bytes=10725863424,total_used_bytes=35045376 1547230802000000000

# telegraf --version
Telegraf 1.9.2 (git: HEAD dda80799)

# telegraf --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d --input-filter ceph --test | grep ceph_usage
2019-01-11T18:21:31Z I! Starting Telegraf 1.9.2
> ceph_usage,host=monitor1 total_avail=0,total_space=0,total_used=0 1547230892000000000

FWIW wrt #3387 pool_stats seems to be partially working. Same problem: field name mismatches. E.g read_bytes_sec exists so it's populated but op_per_sec does not exist so the value is always 0 (there is a read_op_per_sec and write_op_per_sec)

So basically I think the code in the telegraph ceph plugin can't make too many assumptions about the exact json format since it can change across Ceph releases ....

$ telegraf --config /etc/telegraf/telegraf.conf --config-directory /etc/telegraf/telegraf.d --input-filter ceph --test | grep ceph_pool_stats
2019-01-11T18:30:30Z I! Starting Telegraf 1.9.1
> ceph_pool_stats,host=monitor1,name=cinder op_per_sec=0,read_bytes_sec=11896469,recovering_bytes_per_sec=0,recovering_keys_per_sec=0,recovering_objects_per_sec=0,write_bytes_sec=151387551 1547231431000000000
> ceph_pool_stats,host=monitor1,name=glance op_per_sec=0,read_bytes_sec=58009735,recovering_bytes_per_sec=0,recovering_keys_per_sec=0,recovering_objects_per_sec=0,write_bytes_sec=0 1547231431000000000

$ ceph osd pool stats --format json | jq
[
  {
    "pool_name": "cinder",
    "pool_id": 1,
    "recovery": {},
    "recovery_rate": {},
    "client_io_rate": {
      "read_bytes_sec": 13981182,
      "write_bytes_sec": 198101251,
      "read_op_per_sec": 2400,
      "write_op_per_sec": 5376
    }
  },
  {
    "pool_name": "glance",
    "pool_id": 2,
    "recovery": {},
    "recovery_rate": {},
    "client_io_rate": {
      "read_bytes_sec": 24735412,
      "read_op_per_sec": 587,
      "write_op_per_sec": 0
    }
  },

@danielnelson danielnelson added this to the 1.9.3 milestone Jan 11, 2019
@danielnelson danielnelson modified the milestones: 1.9.3, 1.9.4 Jan 22, 2019
@danielnelson danielnelson modified the milestones: 1.9.4, 1.10.0 Feb 4, 2019
@glinton
Copy link
Contributor

glinton commented Feb 20, 2019

So basically I think the code in the telegraph ceph plugin can't make too many assumptions about the exact json format since it can change across Ceph releases ....

Agreed, and luckily documented by ceph. We'll have to revert some of the changes in 4721 to be more generic again. ...or determine how to handle all the unique version's outputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants