Skip to content

Use APM metrics to introduce low-fi data layer for space reduction #104

@roncohen

Description

@roncohen

We should use metrics transaction timing data to have two layers of data fidelity in the APM UI.

We'd have a low-fi layer and a hi-fi layer.

Motivation

Today, most graphs in the APM UI are querying transaction documents. This works because we're sending up all transactions, even unsampled.

As part of #78 we also started sending up transaction timing data as a metricset. Some of the data shown in the APM UI can be calculated using this new timing data instead of the transaction documents.

This would allow users to get rid of the transaction documents early, say after 7 days, but still be able to derive value from the APM UI beyond this timeframe. Setting a separate ILM policy for transactions is already supported through a bit of manual work.

User experience on low-fi data

The idea would be that the low-fi layer is calculated from the metrics data, while all the data that requires the (unsampled) transactions will be part of the hi-fi layer.

From the new metricset, we can show

untitled

  • transactions per minute for each transaction group and across each type. The current UI shows transactions per minute per result type (2xx, 3xx, etc.). Result type is not currently in the new timing data, but we could add it as another dimension

image

  • transaction list, without percentiles

image

We'd be unable to show the transaction distribution chart or any samples:

image

If agents eventually support histograms as a a metric, we could encode the transaction duration as a histogram and show the transaction distribution even with only the low-fi data. This shouldn't be a blocker at the moment.

Querying

To make things simple, the APM UI could always be using the new metrics data to draw the things it can. We'd then fire off separate queries for the "hi-fi" data (percentiles, distribution chart, actual transaction samples etc.). If the hi-fi data is available for the given time range the percentile lines show on the graphs etc. If not, we only show the low-fi data.

That means, if you pick a time range that has both low-fi and hi-fi data for the full time rage, you'll see exactly what you see today.

If you go back in time far enough, only low-fi data is available and you'll not see percentiles, distribution chart etc.

If you select a time range that includes hi-fi data some part of the time range, the percentiles graph might appear in the middle of a graph. For the distribution chart in particular, this is a complication because it's not clear that the visualization that it's partial as it is on the graphs. Users will be able to deduct that fact by looking at the other graphs on the same page.

We could try to detect that the data is partial and show a note. Detection could happen by comparing the number of transactions we have compared to the number we get from the metricsets. Probably not a blocker for the first version.

Transaction group list

The transaction group list could represents a special problem here as it would require us to merge the low-fi and hi-fi data in the list. I don't think the merge can be done in Elasticsearch.

Due to pagination etc., we'd need to ensure that low-fi and hi-fi queries return data for the same transaction groups, and then merge it in Kibana. We could potentially do it by making the queries sort by both lists by avg. transaction time calculated on the metricset and transaction data respectively, and then do the merge in Kibana. I have more thoughts on this, but we should probably do a POC to investigate the feasibility of this.

Rollups

Introducing the low-fi layer as described above allows users to delete transaction data and still see low-fi data. I expect that will be a significant storage reduction for users that want to keep hi-fi data for, say one week, and low-fi data for 2 months. Some users will want to keep low-fi data for much longer. For those users, applying rollups to the low-fi data to decrease time granularity will allow them to further reduce storage costs. Supporting rollups isn't something we'd need to do in the first phase.

Rollups includes functionality to transparently rewrite queries to search regular documents and rolled up data at the same time. So the queries for low-fi data should mostly just work for rolled up data. There are some improvements to rollups coming which we should probably wait for before spending time investigating more: elastic/elasticsearch#42720

Future

When elastic/elasticsearch#33214 arrives, agents could start sending up transaction duration histograms and we'll be able to move percentiles and distribution chart into the low-fi layer. We'd be able to stop sending up unsampled transactions. The hi-fi layer will then only be actual transaction samples.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions