forked from newrelic/newrelic-fluent-bit-output
-
Notifications
You must be signed in to change notification settings - Fork 0
/
troubleshooting-dashboard.json.template
312 lines (312 loc) · 12.1 KB
/
troubleshooting-dashboard.json.template
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
{
"name": "New Relic Fluent Bit output plugin troubleshooting",
"description": null,
"permissions": "PUBLIC_READ_WRITE",
"pages": [
{
"name": "FB plugin instrumentation",
"description": null,
"widgets": [
{
"title": "",
"layout": {
"column": 1,
"row": 1,
"width": 4,
"height": 9
},
"linkedEntityGuids": null,
"visualization": {
"id": "viz.markdown"
},
"rawConfiguration": {
"text": "# What is this dashboard for\nThe purpose of this dashboard is to allow you to troubleshoot a malfunctioning Fluent Bit installation that uses the [New Relic Fluent Bit output plugin](https://github.com/newrelic/newrelic-fluent-bit-output).\n\nPlease note that the **metrics used by this dashboard must not be considered as a stable API: they can change its naming or dimensions at any time in newer plugin versions**. That is, **no critical alerts or long-term dashboard should be created out of them**.\n\nNote that **your plugin needs to be configured with `sendMetrics=true`** in order for the metrics used by this dashboard to be emitted.\n\n\n# Basic naming conventions\n- Fluent Bit aggregates logs in batches, also referred as **chunks**. Each chunk therefore contains an unknown amount of logs.\n- Chunks are received sequentially at the New Relic Fluent Bit output plugin, which takes care of reading the logs they contain and splitting them into the so-called New Relic *payloads*.\n- Each **payload** is a compressed stream of bytes that can be [at most 1MB long](https://docs.newrelic.com/docs/logs/log-api/introduction-log-api/#limits), and follows the [data format required by the Logs API](https://docs.newrelic.com/docs/logs/log-api/introduction-log-api/#json-content).\n\n\n# Error-detection graphs and recommended actions\n\nThe following are the main graphs used to detect potential problems in your log forwarding setup. Refer to each section to learn the recommended actions for each graph.\n\n## Payload packaging errors\nRepresents the percentage of Fluent Bit chunks that threw an error when they were attempted to be packaged as New Relic payloads. Such errors are never expected to happen. Therefore, **any value greater than 0% should be thoroughly investigated**.\n\nIf you find errors in this graph, please open a support ticket and include a sample of your logs for further investigation.\n\n## Payload sending errors\nRepresents the percentage of New Relic payloads that threw an unexpected error when they were attempted to be sent to New Relic. Such errors can happen sporadically: timeouts due to poor network performance or sudden network changes can cause them from time to time. Observing **values greater than 0% can sometimes be normal, but any value above 10% should be considered as an annomalous situation and should be thoroughly investigated**.\n\nIf you find errors in this graph, please ensure that you don't have any weak spots in your network path to New Relic: are you using a proxy? Is it or any network hop introducing too much latency due to being saturated? If you can't find anything on you side, please open a support ticket and include as much information as possible from your network setup.\n\n## Payload send results\nRepresents the amount of API requests that were performed to send logs to New Relic. **Ideally, you should only observe 202 responses here**. Sometimes, intermediary CDN providers can introduce some errors (503 error codes) from time to time, in which case your logs will not be lost and reattempted to be sent.\n\nIf you find a considerable amount of non-202 responses in this graph, please open a customer support ticket.\n\n# Additional troubleshooting graphs\n\nThe following graphs include additional fine-grained information that will be useful for New Relic to troubleshoot your potential installation issues.\n\n## Average timings\nRepresents the average amount of time the plugin spent packaging the log payloads and sending them to New Relic, respectively.\n\n## Accumulated time per minute\nRepresents the amount of time per minute the plugin spent packaging the log payloads and sending them to New Relic, respectively.\n\n## Payload size\nRepresents the size in bytes of the individual compressed payloads sent to New Relic.\n\n## Payload packets per Fluent Bit chunk\nRepresents the amount of payloads sent to New Relic per each Fluent Bit chunk."
}
},
{
"title": "Payload packaging errors",
"layout": {
"column": 5,
"row": 1,
"width": 2,
"height": 3
},
"linkedEntityGuids": null,
"visualization": {
"id": "viz.billboard"
},
"rawConfiguration": {
"facet": {
"showOtherSeries": false
},
"nrqlQueries": [
{
"accountIds": [
YOUR_ACCOUNT_ID
],
"query": "FROM Metric SELECT percentage(count(`logs.fb.packaging.time`), WHERE hasError = true) AS 'packaging errors'"
}
],
"platformOptions": {
"ignoreTimeRange": false
},
"thresholds": [
{
"alertSeverity": "CRITICAL",
"value": 0
}
]
}
},
{
"title": "Payload sending errors",
"layout": {
"column": 7,
"row": 1,
"width": 2,
"height": 3
},
"linkedEntityGuids": null,
"visualization": {
"id": "viz.billboard"
},
"rawConfiguration": {
"facet": {
"showOtherSeries": false
},
"nrqlQueries": [
{
"accountIds": [
YOUR_ACCOUNT_ID
],
"query": "FROM Metric SELECT percentage(count(`logs.fb.payload.send.time`), WHERE hasError = true) AS 'send errors'"
}
],
"platformOptions": {
"ignoreTimeRange": false
},
"thresholds": [
{
"alertSeverity": "WARNING",
"value": 0
},
{
"alertSeverity": "CRITICAL",
"value": 0.1
}
]
}
},
{
"title": "Payload send results",
"layout": {
"column": 9,
"row": 1,
"width": 4,
"height": 3
},
"linkedEntityGuids": null,
"visualization": {
"id": "viz.line"
},
"rawConfiguration": {
"facet": {
"showOtherSeries": false
},
"legend": {
"enabled": true
},
"nrqlQueries": [
{
"accountIds": [
YOUR_ACCOUNT_ID
],
"query": "SELECT rate(count(logs.fb.payload.send.time), 1 minute) AS 'Status Code' FROM Metric FACET CASES(WHERE statusCode = 0 AS 'Send error') OR statusCode timeseries max"
}
],
"platformOptions": {
"ignoreTimeRange": false
},
"units": {
"unit": "REQUESTS_PER_MINUTE"
},
"yAxisLeft": {
"zero": true
},
"yAxisRight": {
"zero": true
}
}
},
{
"title": "Average timings",
"layout": {
"column": 5,
"row": 4,
"width": 4,
"height": 3
},
"linkedEntityGuids": null,
"visualization": {
"id": "viz.line"
},
"rawConfiguration": {
"facet": {
"showOtherSeries": false
},
"legend": {
"enabled": true
},
"nrqlQueries": [
{
"accountIds": [
YOUR_ACCOUNT_ID
],
"query": "SELECT average(logs.fb.payload.send.time) AS 'Payload sending', average(logs.fb.packaging.time) AS 'Payload packaging' FROM Metric timeseries max"
}
],
"nullValues": {
"nullValue": "zero"
},
"platformOptions": {
"ignoreTimeRange": false
},
"units": {
"unit": "MS"
},
"yAxisLeft": {
"zero": true
},
"yAxisRight": {
"zero": true
}
}
},
{
"title": "Accumulated time per minute",
"layout": {
"column": 9,
"row": 4,
"width": 4,
"height": 3
},
"linkedEntityGuids": null,
"visualization": {
"id": "viz.area"
},
"rawConfiguration": {
"facet": {
"showOtherSeries": false
},
"legend": {
"enabled": true
},
"nrqlQueries": [
{
"accountIds": [
YOUR_ACCOUNT_ID
],
"query": "SELECT rate(sum(logs.fb.total.send.time), 1 minute) AS 'Sending', rate(sum(logs.fb.packaging.time), 1 minute) AS 'Packaging' FROM Metric TIMESERIES max"
}
],
"nullValues": {
"nullValue": "zero"
},
"platformOptions": {
"ignoreTimeRange": false
},
"units": {
"unit": "MS"
}
}
},
{
"title": "Payload size",
"layout": {
"column": 5,
"row": 7,
"width": 4,
"height": 3
},
"linkedEntityGuids": null,
"visualization": {
"id": "viz.line"
},
"rawConfiguration": {
"facet": {
"showOtherSeries": false
},
"legend": {
"enabled": true
},
"nrqlQueries": [
{
"accountIds": [
YOUR_ACCOUNT_ID
],
"query": "SELECT min(logs.fb.payload.size) AS 'Minimum', average(logs.fb.payload.size) AS 'Average', max(logs.fb.payload.size) AS 'Maximum' FROM Metric timeseries MAX "
}
],
"nullValues": {
"nullValue": "default"
},
"platformOptions": {
"ignoreTimeRange": false
},
"units": {
"unit": "BYTES"
},
"yAxisLeft": {
"zero": true
},
"yAxisRight": {
"zero": true
}
}
},
{
"title": "Payload packets per Fluent Bit chunk",
"layout": {
"column": 9,
"row": 7,
"width": 4,
"height": 3
},
"linkedEntityGuids": null,
"visualization": {
"id": "viz.line"
},
"rawConfiguration": {
"facet": {
"showOtherSeries": false
},
"legend": {
"enabled": true
},
"nrqlQueries": [
{
"accountIds": [
YOUR_ACCOUNT_ID
],
"query": "SELECT min(logs.fb.payload.count) AS 'Minimum', average(logs.fb.payload.count) AS 'Average', max(logs.fb.payload.count) AS 'Maximum' FROM Metric timeseries MAX "
}
],
"platformOptions": {
"ignoreTimeRange": false
},
"units": {
"unit": "COUNT"
},
"yAxisLeft": {
"zero": true
},
"yAxisRight": {
"zero": true
}
}
}
]
}
],
"variables": []
}