Description
Context
#2954 introduces the new experimental Coud output with a Protobuf-based protocol.
Memory usage
After the first iteration, the memory usage is higher than required. Especially for the Trend metrics is very easy to saturate the bandwidth in a range from tons of KiloBytes up to the remote limit (1 MB).
We also decided to denormalize some fields to reduce the workload and keep the implementation simple on the remote server but the load generated on the client is high, we should revisit this decision.
Fault tolerance
The current flush process could be more fault tolerant, it doesn't retry on failures.
Validation
__name__
and test_run_id
are reserved labels for the remote service and if a test also sets them then there are conflicts generating unexpected behavior for the user. A more dev-friendly UX should be implemented.
Proposal
We identified some actions that should drive us to the goal:
- A more compact Protobuf representation for Histogram.
- Split in multiple requests when the flush process gets a number of time series higher than the
MaxMetricSamplesPerPackage
variable. - Normalize as MetricSet's fields the common fields across time series.
- Fault-tolerant flush operation.
- Exclude
__name__
andtest_run_id
from the allowed tag names.
Acceptance criteria
Change the Cloud output default version to 2
.