docs: add Token SDK metrics reference and identify coverage gaps#1749
docs: add Token SDK metrics reference and identify coverage gaps#1749SuyashAlphaC wants to merge 2 commits into
Conversation
Catalog every metric the SDK emits (50 across driver services, the ttx transaction lifecycle, finality listener, versioned envelope sessions, the auditor service, token selection, certification, identity caches, and the Fabric-X finality queue), each with its type, labels, and description, in a new docs/metrics.md, linked from the monitoring guide. Also add a coverage-gap section calling out the layers that are currently uninstrumented, ranked by impact: the storage/persistence layer (no metrics at all), the distributed auditor lock manager, the standard Fabric network/approval path, token-request validation/double-spend, a transaction-level failure counter, and wallet/identity resolution. Closes LFDT-Panurus#1745 Signed-off-by: SuyashAlphaC <suyashagrawal862@gmail.com>
|
Hey @adecaro Can you pls review this PR once. |
| Recommended: request-approval and broadcast counters, durations, and error | ||
| counters for the Fabric network driver, mirroring the Fabric-X queue metrics. | ||
|
|
||
| ### 4. Validation / double-spend |
There was a problem hiding this comment.
do you mean the EndorserService in token/services/network/fabric/endorsement/provider.go?
Double-spending can only be enforced at committing time.
There was a problem hiding this comment.
RequestApprovalView could also be instrumented. Remember though that the execution of views themselves are instrumented in FSC directly.
|
Hi @SuyashAlphaC , I left some comments. Thanks for looking at this. |
| lifecycle, finality, auditing, selection). Several layers are currently | ||
| uninstrumented. The items below are ordered by impact. | ||
|
|
||
| ### 1. Storage / persistence layer (highest priority) |
There was a problem hiding this comment.
You mean this in addition to what the DB can provide already?
There was a problem hiding this comment.
yes, on top of the DB-level stuff, not replacing it for two reasons:
- semantic labels the DB can't synthesize cheaply. postgres_exporter / pg_stat_statements give you INSERT INTO tokens row counts and query latency, but no per-store / per-operation domain labels (store=tokendb,operation=write vs
store=ttxdb,operation=write) without per-query parsing rules you'd have to maintain by hand and re-tune on every schema change. - sqlite / embedded backends. there's no exporter there, so SDK-level counters are the only option if you want any signal at all.
so it's complementary: DB exporter for IO/contention/query stats, SDK counters for the application-level semantics.
Review fixes on docs/metrics.md from @adecaro: - Storage gap (section 1): clarify the SDK-level counters complement, rather than duplicate, what postgres_exporter / pg_stat_statements already give. The SDK metrics add semantic labels the DB layer cannot infer without per-query parsing, and they are the only source of metrics when the backend is sqlite or another embedded store with no exporter. - Endorser path (former sections 3 + 4 merged): drop the imprecise "validation / double-spend" framing; describe the actual EndorserService in token/services/network/fabric/endorsement/provider.go and the RequestApprovalView in fsc/initiator.go, and call out that double-spend is a commit-time concern enforced by Fabric / the token chaincode, not by an SDK-level metric. - Add a global note that FSC instruments view execution at the platform layer, so the suggestions here stay domain-specific instead of duplicating it. Also adds a ready-to-import Grafana dashboard covering all 50 catalogued metrics under docs/monitoring/grafana/, with multi-select template variables for network/channel/namespace/method, plus a README describing the panel layout and import steps. Linked from docs/metrics.md. Signed-off-by: SuyashAlphaC <suyashagrawal862@gmail.com>
|
hey @adecaro! Can you pls review the recent changes in the docs I made as per your comments. 🙏 |
|
hey @adecaro , are there any further changes you want from my side on this PR? |
|
Hi @SuyashAlphaC , thanks for this effort. Let me ask @AkramBitar if he can verify the grafana dashboard in our deployment so see how it looks. Thanks @AkramBitar 🙏 |
| | `finality_listener_confirmed_total` | Counter | Transactions confirmed on the ledger and committed to local storage | | ||
| | `finality_listener_deleted_total` | Counter | Transactions marked deleted due to an invalid ledger status or token-request hash mismatch | | ||
| | `finality_listener_hash_mismatch_total` | Counter | Transactions rejected because the committed token-request hash did not match the local one | | ||
| | `finality_listener_retry_exhausted_total` | Counter | Transactions abandoned after all finality-processing retries were exhausted | |
There was a problem hiding this comment.
For finalty I found the following. As you see:
- None of the above in the list
- The list has new ones that is not described in that doc
fts_services_network_fabricx_finality_queue_finality_queue_pending_events
fts_services_network_fabricx_finality_queue_finality_queue_processing_duration_seconds
fts_services_network_fabricx_finality_queue_finality_queue_processing_errors_total - The prefix is ts_services_network_fabricx_ then the metric it slef (need to verify that for the all the metrics that you listed above in the table has the same prefix since I do not see these in Prometheus query.
# HELP fts_services_network_fabricx_finality_queue_finality_queue_pending_events Current number of finality events waiting in the queue buffer
# TYPE fts_services_network_fabricx_finality_queue_finality_queue_pending_events gauge
fts_services_network_fabricx_finality_queue_finality_queue_pending_events 0
# HELP fts_services_network_fabricx_finality_queue_finality_queue_processing_duration_seconds Histogram of successful event processing time in worker goroutines (seconds)
# TYPE fts_services_network_fabricx_finality_queue_finality_queue_processing_duration_seconds histogram
fts_services_network_fabricx_finality_queue_finality_queue_processing_duration_seconds_bucket{le="0.001"} 0
fts_services_network_fabricx_finality_queue_finality_queue_processing_duration_seconds_bucket{le="0.005"} 0
fts_services_network_fabricx_finality_queue_finality_queue_processing_duration_seconds_bucket{le="0.01"} 0
fts_services_network_fabricx_finality_queue_finality_queue_processing_duration_seconds_bucket{le="0.025"} 0
fts_services_network_fabricx_finality_queue_finality_queue_processing_duration_seconds_bucket{le="0.05"} 2
fts_services_network_fabricx_finality_queue_finality_queue_processing_duration_seconds_bucket{le="0.1"} 17
fts_services_network_fabricx_finality_queue_finality_queue_processing_duration_seconds_bucket{le="0.25"} 5327
fts_services_network_fabricx_finality_queue_finality_queue_processing_duration_seconds_bucket{le="0.5"} 6372
fts_services_network_fabricx_finality_queue_finality_queue_processing_duration_seconds_bucket{le="1"} 6387
fts_services_network_fabricx_finality_queue_finality_queue_processing_duration_seconds_bucket{le="2.5"} 6406
fts_services_network_fabricx_finality_queue_finality_queue_processing_duration_seconds_bucket{le="5"} 6408
fts_services_network_fabricx_finality_queue_finality_queue_processing_duration_seconds_bucket{le="+Inf"} 6409
fts_services_network_fabricx_finality_queue_finality_queue_processing_duration_seconds_sum 1324.906861666999
fts_services_network_fabricx_finality_queue_finality_queue_processing_duration_seconds_count 6409
# HELP fts_services_network_fabricx_finality_queue_finality_queue_processing_errors_total Total number of errors returned by event.Process in worker goroutines
# TYPE fts_services_network_fabricx_finality_queue_finality_queue_processing_errors_total counter
fts_services_network_fabricx_finality_queue_finality_queue_processing_errors_
```total 6393
|
|
||
| | Metric | Type | Description | | ||
| |--------|------|-------------| | ||
| | `endorsed_transactions` | Counter | Number of endorsed transactions | |
There was a problem hiding this comment.
All the metrics in deployment environment starts with prefix fts_core_common_metrics_ and then the v it self.
See for example:
**fts_services_ttx_**endorsed_transactions
**fts_services_ttx_**endorsement_duration_seconds{channel="arma", instance="dectrust20.vpc.cloud9.ibm.com:10021", job="FSC.issuer", namespace="tokenchaincode", network="mytopos"}
|
|
||
| | Metric | Type | Description | | ||
| |--------|------|-------------| | ||
| | `issue_service_operations_total` | Counter | Total IssueService method invocations | |
There was a problem hiding this comment.
All the metrics in deployment environment starts with prefix fts_core_common_metrics_ and then the metric it self.
Example 1:
**fts_core_common_metrics_**issue_service_operations_total{channel="arma", instance="dectrust20.vpc.cloud9.ibm.com:10021", job="FSC.issuer", method="DeserializeIssueAction", namespace="tokenchaincode", network="mytopos"} 7492
fts_core_common_metrics_issue_service_operations_total{channel="arma", instance="dectrust20.vpc.cloud9.ibm.com:10021", job="FSC.issuer", method="Issue", namespace="tokenchaincode", network="mytopos"} 3706
fts_core_common_metrics_issue_service_operations_total{channel="arma", instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw", method="DeserializeIssueAction", namespace="tokenchaincode", network="mytopos"} 11118
fts_core_common_metrics_issue_service_operations_total{channel="arma", instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw", method="VerifyIssue", namespace="tokenchaincode", network="mytopos"}
Example 2:
fts_core_common_metrics_auditor_service_operations_total{channel="arma", instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw", method="AuditorCheck", namespace="tokenchaincode", network="mytopos"}
| "options": {"legend": {"showLegend": true, "placement": "bottom"}}, | ||
| "targets": [ | ||
| {"refId": "issue", "expr": "sum(rate(issue_service_operations_total{network=~\"$network\",channel=~\"$channel\",namespace=~\"$namespace\",method=~\"$method\"}[$__rate_interval]))", "legendFormat": "issue"}, | ||
| {"refId": "transfer", "expr": "sum(rate(transfer_service_operations_total{network=~\"$network\",channel=~\"$channel\",namespace=~\"$namespace\",method=~\"$method\"}[$__rate_interval]))", "legendFormat": "transfer"}, |
There was a problem hiding this comment.
Need to fix the metrics names. All has prefix see my comments in docs/metrics.md file.
| |--------|------|-------------| | ||
| | `auditor_audit_duration_seconds` | Histogram | Audit() processing time per transaction, including lock acquisition | | ||
| | `auditor_audit_lock_conflicts_total` | Counter | Audit() calls that failed to acquire enrollment-ID locks | | ||
| | `auditor_append_duration_seconds` | Histogram | Append() processing time per transaction | |
There was a problem hiding this comment.
strats with fts_services_auditor prefix
see fts_services_auditor_auditor_append_duration_seconds
fts_services_auditor_auditor_duration_count{instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw"} 10108
fts_services_auditor_auditor_duration_sum{instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw"} 1127.6164152429933
fts_services_auditor_auditor_duration_bucket{instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw", le="0.005"} 128
fts_services_auditor_auditor_duration_bucket{instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw", le="0.01"} 228
fts_services_auditor_auditor_duration_bucket{instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw", le="0.025"} 230
fts_services_auditor_auditor_duration_bucket{instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw", le="0.05"} 230
fts_services_auditor_auditor_duration_bucket{instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw", le="0.1"} 6447
fts_services_auditor_auditor_duration_bucket{instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw", le="0.25"} 9997
fts_services_auditor_auditor_duration_bucket{instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw", le="0.5"} 10069
fts_services_auditor_auditor_duration_bucket{instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw", le="1.0"} 10093
fts_services_auditor_auditor_duration_bucket{instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw", le="2.5"} 10104
fts_services_auditor_auditor_duration_bucket{instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw", le="5.0"} 10105
fts_services_auditor_auditor_duration_bucket{instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw", le="10.0"} 10108
fts_services_auditor_auditor_duration_bucket{instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw", le="+Inf"} 10108
fts_services_auditor_auditor_operations{instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw"} 10108
fts_services_auditor_auditor_releases_total{instance="dectrust20.vpc.cloud9.ibm.com:10031", job="FSC.dw"}
|
|
||
| | Metric | Type | Labels | Description | | ||
| |--------|------|--------|-------------| | ||
| | `unspent_tokens_invocations` | Counter | `fetcher_type` | Number of unspent-token fetch invocations | |
There was a problem hiding this comment.
Starts with fts_services_selector_sherdlock_ prefix (need to duble check in the code)
See
fts_services_selector_sherdlock_unspent_tokens_invocations{fetcher_type="eager", instance="dectrust20.vpc.cloud9.ibm.com:10021", job="FSC.issuer"} 3302
fts_services_selector_sherdlock_unspent_tokens_invocations{fetcher_type="eager", instance="dectrust21.vpc.cloud9.ibm.com:10041", job="FSC.banka"}
|
|
||
| | Metric | Type | Labels | Description | | ||
| |--------|------|--------|-------------| | ||
| | `certified_tokens` | Counter | `network`, `channel`, `namespace` | Number of tokens certified | |
There was a problem hiding this comment.
I did not find them in Prometheus query, please see in the code which prefix the have.
|
|
||
| | Metric | Type | Labels | Source | Description | | ||
| |--------|------|--------|--------|-------------| | ||
| | `cache_level` | Gauge | `network`, `channel`, `namespace` | `token/services/identity/idemix/cache/metrics.go` | Fill level of the Idemix credential cache | |
There was a problem hiding this comment.
Starts with fts_core_common_metrics_ prifex
See
fts_core_common_metrics_cache_level{channel="arma", instance="dectrust20.vpc.cloud9.ibm.com:10021", job="FSC.issuer", namespace="tokenchaincode", network="mytopos"} 4
fts_core_common_metrics_recipient_data_cache_level{channel="arma", instance="dectrust20.vpc.cloud9.ibm.com:10021", job="FSC.issuer", namespace="tokenchaincode", network="mytopos"}
|
|
||
| | Metric | Type | Description | | ||
| |--------|------|-------------| | ||
| | `finality_queue_pending_events` | Gauge | Finality events currently waiting in the queue buffer | |
There was a problem hiding this comment.
Has fts_services_network_fabricx_ prefix see
fts_services_network_fabricx_finality_queue_finality_queue_pending_events{instance="dectrust21.vpc.cloud9.ibm.com:10041", job="FSC.banka"}
|
|
||
| | Metric | Type | Labels | Description | | ||
| |--------|------|--------|-------------| | ||
| | `ttx_envelope_sent_total` | Counter | `version`, `type` | Versioned envelopes sent | |
There was a problem hiding this comment.
I did not find them in Prometheus query, please see in the code which prefix the have.
Hello @SuyashAlphaC Thanks a lot for your effort on that PR. Following @adecaro request, I have checked the dashboard in our deployment and I did not see any information on it (see picture). I think this is related to my review comments. Pease have a look at them.
Thanks a lot, |

Closes #1745
Adds
docs/metrics.md, a complete reference of the 50 metrics the SDK emits (driver services, ttx lifecycle, finality, envelope sessions, auditor, selection, certification, identity caches, Fabric-X queue), each with type, labels, and description. Linked from the monitoring guide.Also adds a coverage-gap section ranking the uninstrumented layers — storage, the auditor lock manager, the standard Fabric network path, validation, a transaction-failure counter, and wallet resolution — with concrete suggested metrics for follow-up.