Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RabbitMQ Input Plugin v1.19.0 unmarshal errors #9383

Closed
aslgithub opened this issue Jun 16, 2021 · 26 comments · Fixed by #9443
Closed

RabbitMQ Input Plugin v1.19.0 unmarshal errors #9383

aslgithub opened this issue Jun 16, 2021 · 26 comments · Fixed by #9443
Assignees
Labels
area/rabbitmq bug unexpected problem or unintended behavior regression something that used to work, but is now broken

Comments

@aslgithub
Copy link

aslgithub commented Jun 16, 2021

Relevant telegraf.conf:

[[inputs.rabbitmq]]
  ## Management Plugin url. (default: http://localhost:15672)
   url = "http://1.2.3.4:15672"
   username = "username"
   password = "password"

System info:

Windows Server 2019
Telegraf v1.19.0
RabbitMQ 3.8.14
Erlang 23.2.7

Steps to reproduce:

  1. Configure inputs.rabbitmq
  2. Run a test collection
    telegraf.exe --config "C:\Program Files\Telegraf\telegraf.conf" --test --input-filter rabbitmq --debug

Expected behavior:

Collection should complete without errors.

Actual behavior:

There are unmarshal errors within the collection

2021-06-16T16:17:01Z E! [inputs.rabbitmq] Error in plugin: json: cannot unmarshal object into Go value of type []rabbitmq.FederationLink
2021-06-16T16:17:01Z E! [inputs.rabbitmq] Error in plugin: json: cannot unmarshal number 0.0 into Go struct field Node.io_read_avg_time of type int64
2021-06-16T16:17:01Z E! [telegraf] Error running agent: input plugins recorded 2 errors

Additional info:

Telegraf v1.18.1 on the same system does not show the unmarshal errors.

@aslgithub aslgithub added the bug unexpected problem or unintended behavior label Jun 16, 2021
@ghost
Copy link

ghost commented Jun 18, 2021

I can confirm same thing is happening on Centos 7 with telegraf v1.19.0-1

@aslgithub
Copy link
Author

@srebhan Tagging you in on this to see if you can help please?

@srebhan
Copy link
Member

srebhan commented Jun 22, 2021

@aslgithub it seems like the output of RabbitMQ changed. 0.0 is not an integer value so we would have to change the struct field to float in the plugin. The other issue points to a change in the JSON structure (field added/removed).
A quick search did not reveal a changelog or anything that documents what changed in RabbitMQ and when. However, this is a required information to make the plugin work with both the former and current version... Do you know where to find such a document?

@aslgithub
Copy link
Author

@srebhan The output of RabbitMQ is not changing, I'm simply running the two different versions of Telegraf against the same version of RabbitMQ .
If this is RabbitMQ related then as you say I'm not sure how to track this.
For me it has to be some change in how Telegraf is handling {or not} the conversion maybe one of the go updates or #8969 ?

Taking one error as the example,
[inputs.rabbitmq] Error in plugin: json: cannot unmarshal number 0.0 into Go struct field Node.io_read_avg_time of type int64

v1.19.0 is failing to bring back rabbitmq_node
This in the source appears to be defined as part of the structure
IoReadAvgTime int64 json:"io_read_avg_time"

And is read from the /api/nodes
"io_read_avg_time": node.IoReadAvgTime,

My raw json output from the api is
"io_read_avg_time":0.0,

But Telegraf v1.18.0 was happy with this, but v1.19.0 obviously isn't happy..

The API is defined
https://rawcdn.githack.com/rabbitmq/rabbitmq-server/v3.8.17/deps/rabbitmq_management/priv/www/api/index.html

I've not found anywhere the field types are defined..yet..

@aslgithub
Copy link
Author

I give up ! I cannot find the answer, question submitted on the RabbitMQ Githhub discussion
rabbitmq/rabbitmq-server#3135

@reimda reimda added the regression something that used to work, but is now broken label Jun 23, 2021
@akrantz01 akrantz01 self-assigned this Jun 24, 2021
@helenosheaa
Copy link
Member

@akrantz01 and I were just pairing on this we are also able to replicate the problem. If we change the type on the struct of io_read_avg_time to float64 there is then the same error message for io_write_avg_time. Once thats updated the same error message for the memory.total field:

2021-06-24T21:14:20Z E! [inputs.rabbitmq] Error in plugin: json: cannot unmarshal object into Go value of type []rabbitmq.FederationLink
2021-06-24T21:14:20Z E! [inputs.rabbitmq] Error in plugin: json: cannot unmarshal object into Go struct field Memory.memory.total of type int64

I found this description here for the type of nodes.io_write_avg_time:

rabbitmq.nodes.io_write_avg_time | average wall time (milliseconds) for each disk write operation in the last statistics interval

I can confirm we're not seeing this problem on Telegraf v1.18.1.

Alex is now looking into what may have changed to cause these output types to change from RabbitMQ.

@srebhan
Copy link
Member

srebhan commented Jun 28, 2021

@helenosheaa I think the problem is here as the code changed

-       json.NewDecoder(resp.Body).Decode(target)
-
-       return nil
+       return json.NewDecoder(resp.Body).Decode(target)

now checking the returned error, while before those errors went unnoticed.

Can anyone of you @aslgithub or @helenosheaa provide some sample data from /api/nodes and /api/federation-links so I can fix the underlying problem(s)?

@aslgithub
Copy link
Author

@srebhan sure!
Starting with the easy one for me, we don't use Federation Links.
This data was not included in v1.18.1 output, but v1.19.0 is giving the error trying to include it.

/api/federation-links
{"error":"Object Not Found","reason":"Not Found"}

This data was included in v.1.18.1 output as rabbitmq_node without apparent errors, but v1.19.0 is giving the unmarshal error.
This is just a single node/server within the /nodes, also sorry its not very busy, will see if I can find a better one if you need it.

/api/nodes
[{"partitions":[],"os_pid":"8268","fd_total":65536,"sockets_total":58893,"mem_limit":1717546188,"mem_alarm":false,"disk_free_limit":50000000,"disk_free_alarm":false,"proc_total":1048576,"rates_mode":"basic","uptime":4136807904,"run_queue":1,"processors":4,"exchange_types":[{"name":"topic","description":"AMQP topic exchange, as per the AMQP specification","enabled":true},{"name":"fanout","description":"AMQP fanout exchange, as per the AMQP specification","enabled":true},{"name":"direct","description":"AMQP direct exchange, as per the AMQP specification","enabled":true},{"name":"headers","description":"AMQP headers exchange, as per the AMQP specification","enabled":true}],"auth_mechanisms":[{"name":"PLAIN","description":"SASL PLAIN authentication mechanism","enabled":true},{"name":"AMQPLAIN","description":"QPid AMQPLAIN mechanism","enabled":true},{"name":"RABBIT-CR-DEMO","description":"RabbitMQ Demo challenge-response authentication mechanism","enabled":false}],"applications":[{"name":"amqp_client","description":"RabbitMQ AMQP Client","version":"3.8.14"},{"name":"asn1","description":"The Erlang ASN1 compiler version 5.0.14","version":"5.0.14"},{"name":"aten","description":"Erlang node failure detector","version":"0.5.5"},{"name":"compiler","description":"ERTS CXC 138 10","version":"7.6.6"},{"name":"cowboy","description":"Small, fast, modern HTTP server.","version":"2.8.0"},{"name":"cowlib","description":"Support library for manipulating Web protocols.","version":"2.9.1"},{"name":"credentials_obfuscation","description":"Helper library that obfuscates sensitive values in process state","version":"2.4.0"},{"name":"crypto","description":"CRYPTO","version":"4.8.3"},{"name":"cuttlefish","description":"cuttlefish configuration abstraction","version":"2.6.0"},{"name":"gen_batch_server","description":"Generic batching server","version":"0.8.4"},{"name":"goldrush","description":"Erlang event stream processor","version":"0.1.9"},{"name":"inets","description":"INETS CXC 138 49","version":"7.3.2"},{"name":"jsx","description":"a streaming, evented json parsing toolkit","version":"2.11.0"},{"name":"kernel","description":"ERTS CXC 138 10","version":"7.2.1"},{"name":"lager","description":"Erlang logging framework","version":"3.8.2"},{"name":"mnesia","description":"MNESIA CXC 138 12","version":"4.18.1"},{"name":"observer_cli","description":"Visualize Erlang Nodes On The Command Line","version":"1.6.1"},{"name":"os_mon","description":"CPO CXC 138 46","version":"2.6.1"},{"name":"public_key","description":"Public key infrastructure","version":"1.9.2"},{"name":"ra","description":"Raft library","version":"1.1.8"},{"name":"rabbit","description":"RabbitMQ","version":"3.8.14"},{"name":"rabbit_common","description":"Modules shared by rabbitmq-server and rabbitmq-erlang-client","version":"3.8.14"},{"name":"rabbitmq_management","description":"RabbitMQ Management Console","version":"3.8.14"},{"name":"rabbitmq_management_agent","description":"RabbitMQ Management Agent","version":"3.8.14"},{"name":"rabbitmq_prelaunch","description":"RabbitMQ prelaunch setup","version":"3.8.14"},{"name":"rabbitmq_web_dispatch","description":"RabbitMQ Web Dispatcher","version":"3.8.14"},{"name":"ranch","description":"Socket acceptor pool for TCP protocols.","version":"1.7.1"},{"name":"recon","description":"Diagnostic tools for production use","version":"2.5.1"},{"name":"sasl","description":"SASL CXC 138 11","version":"4.0.1"},{"name":"ssl","description":"Erlang/OTP SSL application","version":"10.2.4"},{"name":"stdlib","description":"ERTS CXC 138 10","version":"3.14"},{"name":"stdout_formatter","description":"Tools to format paragraphs, lists and tables as plain text","version":"0.2.4"},{"name":"syntax_tools","description":"Syntax tools","version":"2.4"},{"name":"sysmon_handler","description":"Rate-limiting system_monitor event handler","version":"1.3.0"},{"name":"tools","description":"DEVTOOLS CXC 138 16","version":"3.4.3"},{"name":"xmerl","description":"XML parser","version":"1.3.26"}],"contexts":[{"description":"RabbitMQ Management","path":"/","cowboy_opts":"[{sendfile,false}]","port":"15672"}],"log_files":["c:/Users/user/AppData/Roaming/RabbitMQ/log/rabbit@rmqserver.log","c:/Users/user/AppData/Roaming/RabbitMQ/log/rabbit@rmqserver_upgrade.log"],"db_dir":"c:/Users/user/AppData/Roaming/RabbitMQ/db/rabbit@rmqserver-mnesia","config_files":["c:/Users/user/AppData/Roaming/RabbitMQ/advanced.config"],"net_ticktime":60,"enabled_plugins":["rabbitmq_management"],"mem_calculation_strategy":"rss","ra_open_file_metrics":{"ra_log_wal":1,"ra_log_segment_writer":0},"name":"rabbit@rmqserver","type":"disc","running":true,"mem_used":386596864,"mem_used_details":{"rate":0.0},"fd_used":84,"fd_used_details":{"rate":0.2},"sockets_used":43,"sockets_used_details":{"rate":0.0},"proc_used":1134,"proc_used_details":{"rate":0.2},"disk_free":25087606784,"disk_free_details":{"rate":1638.4},"gc_num":328408882,"gc_num_details":{"rate":160.2},"gc_bytes_reclaimed":13607991394016,"gc_bytes_reclaimed_details":{"rate":10160481.6},"context_switches":971502146,"context_switches_details":{"rate":322.0},"io_read_count":1,"io_read_count_details":{"rate":0.0},"io_read_bytes":1,"io_read_bytes_details":{"rate":0.0},"io_read_avg_time":0.0,"io_read_avg_time_details":{"rate":0.0},"io_write_count":45,"io_write_count_details":{"rate":0.0},"io_write_bytes":193066,"io_write_bytes_details":{"rate":0.0},"io_write_avg_time":0.0,"io_write_avg_time_details":{"rate":0.0},"io_sync_count":45,"io_sync_count_details":{"rate":0.0},"io_sync_avg_time":0.0,"io_sync_avg_time_details":{"rate":0.0},"io_seek_count":31,"io_seek_count_details":{"rate":0.0},"io_seek_avg_time":0.0,"io_seek_avg_time_details":{"rate":0.0},"io_reopen_count":0,"io_reopen_count_details":{"rate":0.0},"mnesia_ram_tx_count":2257,"mnesia_ram_tx_count_details":{"rate":0.0},"mnesia_disk_tx_count":103,"mnesia_disk_tx_count_details":{"rate":0.0},"msg_store_read_count":0,"msg_store_read_count_details":{"rate":0.0},"msg_store_write_count":1,"msg_store_write_count_details":{"rate":0.0},"queue_index_journal_write_count":165,"queue_index_journal_write_count_details":{"rate":0.0},"queue_index_write_count":0,"queue_index_write_count_details":{"rate":0.0},"queue_index_read_count":0,"queue_index_read_count_details":{"rate":0.0},"io_file_handle_open_attempt_count":882,"io_file_handle_open_attempt_count_details":{"rate":0.0},"io_file_handle_open_attempt_avg_time":0.05442176870748299,"io_file_handle_open_attempt_avg_time_details":{"rate":0.0},"connection_created":2310,"connection_created_details":{"rate":0.0},"connection_closed":2268,"connection_closed_details":{"rate":0.0},"channel_created":2310,"channel_created_details":{"rate":0.0},"channel_closed":2267,"channel_closed_details":{"rate":0.0},"queue_declared":144281,"queue_declared_details":{"rate":0.0},"queue_created":663,"queue_created_details":{"rate":0.0},"queue_deleted":629,"queue_deleted_details":{"rate":0.0},"cluster_links":[],"metrics_gc_queue_length":{"connection_closed":0,"channel_closed":0,"consumer_deleted":0,"exchange_deleted":0,"queue_deleted":0,"vhost_deleted":0,"node_node_deleted":0,"channel_consumer_deleted":0}}]

@aslgithub
Copy link
Author

I don't know if this helps, as its not the matching example for the above, but this is the output from Telegraf v.1.18.1 for the node line

> rabbitmq_node,host=telegrafsrv,node=rabbit@rabbitmqsrv,url=http://rabbitmqsrv:15672 disk_free=25119473664i,disk_free_alarm=0i,disk_free_limit=50000000i,fd_total=65536i,fd_used=77i,gc_bytes_reclaimed=11665854057096i,gc_bytes_reclaimed_rate=1586580.8,gc_num=286486021i,gc_num_rate=51.4,io_read_avg_time=0i,io_read_avg_time_rate=0,io_read_bytes=1i,io_read_bytes_rate=0,io_write_avg_time=0i,io_write_avg_time_rate=0,io_write_bytes=128571i,io_write_bytes_rate=0,mem_alarm=0i,mem_allocated_unused=62739856i,mem_atom=1458513i,mem_binary=184018400i,mem_code=32810827i,mem_connection_channels=309212i,mem_connection_other=5894812i,mem_connection_readers=1195184i,mem_connection_writers=72076i,mem_limit=1717546188i,mem_metrics=627588i,mem_mgmt_db=6746188i,mem_mnesia=396016i,mem_msg_index=32648i,mem_other_ets=3775616i,mem_other_proc=38226968i,mem_other_system=14288500i,mem_plugins=15349184i,mem_queue_procs=781564i,mem_queue_slave_procs=0i,mem_reserved_unallocated=0i,mem_total=0i,mem_used=362479616i,mnesia_disk_tx_count=98i,mnesia_disk_tx_count_rate=0,mnesia_ram_tx_count=2138i,mnesia_ram_tx_count_rate=0,proc_total=1048576i,proc_used=1127i,run_queue=1i,running=1i,sockets_total=58893i,sockets_used=43i,uptime=3637605584i

@srebhan
Copy link
Member

srebhan commented Jun 28, 2021

@aslgithub, can you please attach a file with the JSON next time? :-)

I will try to fix the problem. The errors were there before, but they were silently dropped rendering all incorrectly typed values zero. 8-(

Starting with the easy one for me, we don't use Federation Links.
This data was not included in v1.18.1 output, but v1.19.0 is giving the error trying to include it.

/api/federation-links
{"error":"Object Not Found","reason":"Not Found"}

Well this is actually the reason for getting this error 2021-06-16T16:17:01Z E! [inputs.rabbitmq] Error in plugin: json: cannot unmarshal object into Go value of type []rabbitmq.FederationLink as this is not a valid federation-link object...
I will try to also deal with this.

@aslgithub Stay tuned for testing... ;-P

@helenosheaa
Copy link
Member

@srebhan thanks for working on this! Also staying tuned for testing :)

I was in the same situation as @aslgithub with no object found for federation-links.

@srebhan
Copy link
Member

srebhan commented Jun 28, 2021

@helenosheaa did you find a document stating the datatypes of the returned values (e.g. float vs. int)? I can now use the example posted by @aslgithub but we still don't know for certain if we miss out on other values currently being int64 that actually are float.

@helenosheaa
Copy link
Member

helenosheaa commented Jun 28, 2021

@srebhan I can't find useful documentation outside of https://www.rabbitmq.com/monitoring.html#node-metrics which isn't enlightening on int vs float. However if it's json that it's returning and the numeric type is "number", which is both int and float we may have to explicitly set it to float (if it can ever be a float). I was seeing the same issue for Memory.memory.totalas well as both io_write_avg_time and io_read_avg_time. But like you say there may be others we are missing.

@aslgithub
Copy link
Author

The API returned fields are listed as tables on this document, but not the field types.
https://rawcdn.githack.com/rabbitmq/rabbitmq-server/v3.8.17/deps/rabbitmq_management/priv/www/api/index.html

I cannot find the field types documented anywhere, so I posted the question on the RabbitMQ GitHub
rabbitmq/rabbitmq-server#3135

But no response

@srebhan
Copy link
Member

srebhan commented Jun 28, 2021

@aslgithub can you please attach a file with the result of /api/nodes/rabbit@rmqserver/memory?

@aslgithub
Copy link
Author

@srebhan Please find attached..
..sorry I saw the add link the last time in the toolbar but no attach option..I totally missed the the note at the bottom I could attached by drag and drop..

/api/nodes/rabbit@rmqserver/memory
rabbitmq_memory_json.txt

srebhan pushed a commit to HRI-EU/telegraf that referenced this issue Jun 28, 2021
@aslgithub
Copy link
Author

@srebhan I think this is a complete set of an example json of every API call from one of my servers

rabbitmq_api_overview.txt
rabbitmq_api_nodes.txt
rabbitmq_api_nodes_memory.txt
rabbitmq_api_queues.txt
rabbitmq_api_exchanges.txt
rabbitmq_api_federation-links.txt

@srebhan
Copy link
Member

srebhan commented Jun 28, 2021

@aslgithub is it ok to add those files to the repository as test-cases?

@aslgithub
Copy link
Author

@srebhan Let me check..

@srebhan
Copy link
Member

srebhan commented Jun 28, 2021

Oh and strangely you have

    "total": {
      "erlang": 324522240,
      "rss": 385548288,
      "allocated": 385548288
    }

in memory where the old test-case has only a single value. So is the total memory the sum of the three values in your case?

@aslgithub
Copy link
Author

@srebhan , I'm not too familiar with this, but it seems that they are not the sum of the values.
erlang = runtime used
allocated = runtime allocated
rss = resident set size (RSS) reported by the OS

From https://www.rabbitmq.com/memory-use.html
It looks as though the memory calculation strategy could be legacy, rss and allocated.
I'm currently running as rss; although its supposed to default to allocated for Windows..

The screenshot in the article that appears to show this is titled
Memory and binary heap breakdowns can be expensive to calculate and are produced on demand when the Update button is pressed

But the next screenshot shows the curl example to json with only the single Total field ...

Might need someone else here to give a comparison

@srebhan
Copy link
Member

srebhan commented Jun 28, 2021

Point taken. I'm now cascading and trying the estimations in the following order "rss", "allocated", "erlang".

Is it ok to use the data as test-case? If so I'm, ready to commit. :-)

@aslgithub
Copy link
Author

Sorry, I was trying to see if I could switch the calculation strategy to get a different result..

I've double checked, I think I have suitably redacted; go for it..

@srebhan
Copy link
Member

srebhan commented Jun 28, 2021

Once #9443 built you can use the binaries to test. At least it works with the testcase you provided. Regarding the federate error you can now exclude that metric using the new metric_exclude option (maybe we need to rename that option) and get rid of the error message.

@srebhan
Copy link
Member

srebhan commented Jun 28, 2021

Yeah, here you go. :-) Let me know if it fixes the regression.

@aslgithub
Copy link
Author

@srebhan Looks good, I've set the metric_exclude and #9443 runs without collection errors.
As best as I can compare given different collection results between runs the output looks similar and mem_total is now populated.
Attached two samples from each version, prefixed at the start of the line.
nodes_output.txt

The rest of the overview, exchange and queue output is the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/rabbitmq bug unexpected problem or unintended behavior regression something that used to work, but is now broken
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants