Description
Problem description
Given automated way of collecting Request and Dependency events in App Insights customers have little to no control over the amount of information their application would produce. By default we collect all requests and dependency calls and there is no way to reduce event rates if application in question is fairly large.
With static sampling feature shipped in [2.0 beta] Sdk in sprint 90 filtering feature shipped in early 91 we do provide customers a way to control volumes but questions remain.
If customer attempts to use sampling the very first problem they will run into is which value they would need to set as sampling percentage. This can be determined by looking at total volume of events for a fairly large period of time, but presents a different problem. Say, application is bursty and produces a lot of volume in a short period of time followed by a long “quiet” period. With static sampling burst period would be sampled down and would produce statistically correct data but “quiet” period may have so little telemetry events generated that static sampling will capture a few events per day/hour and make data incorrect (skewed).
Solution
Proposed solution is to develop “adaptive sampling” mechanism that would very the sampling percentage based on the observed rate of telemetry events generated by the application. As the rate increases (burst situation), sampling percentage will decrease capturing less events. When rate of telemetry events drops (“quiet period”), sampling percentage increases back capturing more events to preserve statistical data correctness.
Thus, sampling percentage may “float” based on the rate of telemetry events produced by the application. Since we’re addressing this on Sdk side, the rate of events is actually rate of events on that box/device, no across entire application.
This solution will work for .Net server-side Sdk. We can potentially port it to other server-side sdk ports later).
Adaptive sampling module can be set up in code or in configuration file. The default configuration file will include adaptive sampling module be default.
Design details
The solution uses existing concept of “telemetry processor” found in 2.0 beta Sdk of AI.
We will employ existing sampling module to do actual sampling and create another telemetry processor to do the math to figure out what sampling percentage is to be applied in a given situation.
To do that, we’ll calculate exponential moving average (see: https://en.wikipedia.org/wiki/Moving_average) of the telemetry items sent to AI data collector (rate of events after sampling). This number will be available all the time the application runs. If application has just started, the calculation of event rate will be reset. As the application continues to run, moving average will more precisely reflect rate of telemetry events.
The process will keep its state in memory without any serialization initially. So, restart of the application will reset state to the initial one.
Having average rate of events produced and effective sampling rate (current sampling rate set on the sampling telemetry processor) we can determine the ‘ideal’ sampling rate given target event rate set as configuration value. If ‘ideal’ sampling percentage is different from the currently effective one, we’ll change corresponding parameter of the sampling telemetry processor to new value.
Sampling percentage will not be changed constantly. Timeouts will be applied (different ones) before sampling percentage will be [further] decreased or increased. Changing sampling rate very frequently may result in bad behavior where request/rdd/events may not be together sampled in or out if sampling percentage changes in between, therefore certain timeout is needed.
Process parameters
The table below contains entire set of parameters used in the process. These will be settable in code or in configuration file. All the parameters may be set from code or via configuration file. Default configuration file will have only MaxTelemetryItemsPerSecond parameter set explicitly to default value.
In addition to all parameters, a callback can be set in code invoked every time sampling percentage algorithm is run (in case customer wants to track/trace sampling percentage change events).
- InitialSamplingPercentage (default: 100%) - Sampling percentage to apply when the application code starts and no state of the estimation process is available.
- MaxTelemetryItemsPerSecond (default: 5) - Target maximum number of telemetry items generated by a single box/device per second. This parameter is the main driving factor of sampling percentage changes. Generally speaking, if this parameter is set to 5 and we observe 10 telemetry events generated per second, we’ll set sampling percentage to 50%.
- MinSamplingPercentage (default: 0.1%) - As sampling percentage varies, what is the minimum value we’re allowed to set.
- MaxSamplingPercentage (default: 100%) - As sampling percentage varies, what is the maximum value we’re allowed to set.
- EvaluationIntervalSeconds (default: 15sec) - How frequently do we run sampling percentage evaluation algorithm (along with moving average algorithm).
- MovingAverageRatio (default: 0.25) - When calculating moving average of telemetry events submitted per second, how much “emphasis” to put on the most recent values vs. historical values.
With default value we put 25% ‘emphasis” on the most recent value and 75% on historical values. - SamplingPercentageDecreaseTimeoutSeconds (default: 2min) - When sampling percentage value changes, how soon after are we allowed to lower sampling percentage again to capture less data.
- SamplingPercentageIncreaseTimeoutSeconds (default: 15min) - When sampling percentage value changes, how soon after are we allowed to increase sampling percentage again to capture more data.
Sdk Api design
Similar to static sampling we will provide simple building block to enable customers to quickly setup adaptive sampling in code. TelemetryChannelBuilder class allowing to build a list of telemetry processors will receive new extension UseAdaptiveSampling() with the following overloads:
- No parameters. Enables adaptive sampling with all default values for algorithm parameters. This is to be used by new customers primarily to “kick the tires”;
- maxTelemetryItemsPerSecond parameter. Enables adaptive sampling with all default parameters of the algorithm but custom target telemetry item rate. We expect this one to be used by more advanced customers who have either fewer boxes/servers in the application and willing to capture more telemetry per box or the other way around.
- settings, callback parameters. An overload for full customization of the algorithm allowing to set all parameters (here “settings” is a set of settable properties corresponding to all parameters outlined above for the estimation algorithm). Callback parameter allows customer to set up code that is invoked when sampling percentage evaluation algorithm runs. The following parameters will be provided to callback:
- After-sampling rate of telemetry observed by the algorithm;
- Current sampling percentage algorithm assumes is applied by the sampling telemetry processor;
- New sampling percentage to set for sampling telemetry processor in order to make the rate of events “ideal”;
- Whether or not sampling percentage will be changed after this evaluation (even though current & new sampling percentages may be different, new one may not be applied immediately due to ‘timeout’ or ‘penalty box’ situations, in other words, in cases sampling percentage was changed recently);
- Algorithm current settings.
A separate telemetry processor AdaptiveSamplingTelemetryProcessor will also be provided. This one is used by the extensions and in itself is a combination of two telemetry processors – existing sampling processor and new sampling percentage estimator processor.
AdaptiveSamplingTelemetryProcessor also contains code to react to sampling percentage change recommendation performed by [internal] estimator processor by setting it as sampling percentage property of the sampling processor.
Sampling percentage evaluation algorithm
SamplingPercentageEstimatorTelemetryProcessor [internal] new class implements code for sampling percentage evaluation algorithm.
It sets up the timer to evaluate sampling percentage and follows this set of steps every time timer fires:
- Close next interval of the ‘moving average’ counter and get average observed after-sampling telemetry event rate;
- Calculate suggested sampling percentage so that rate would be “just under” ‘ideal’ target rate provided; adjust this suggested rate if it is below min or above max;
- Reset the timer if evaluation frequency parameter was changed;
- See if sampling percentage needs to be changed;
- Call evaluation callback if provided suppressing all exceptions (we’re on a timer thread in the process here and if that throws, the process would die);
- If sampling percentage can be changed (suggested is different from current and we’re not in any kind of ‘penalty box’), assume sampling percentage was changed by sampling telemetry processor (this is enforced by the container public telemetry processor), record current and date of change and also reset moving average counter since previous values were taken with different sampling percentage.