Description
Splitting out the processing as a separate task from the telemetry collection covered in #3056.
Processing
Due to some of the records containing redundant data, structured queries aren't suitable to generate the retention chart directly. Instead we'll run a daily scheduled web job to convert the records into a form that's easier to query.
As an example, using a rolling 4 day period (where _
marks data outside the 4-day period), the following table shows the "real" user (not included in actual data) and corresponding logged activity. It also shows which records would be ignored due to being redundant with data submitted later.
Day | User | Activity Record | Redundant |
---|---|---|---|
1 | A | [1, _, _, _] |
X |
1 | C | [1, _, _, _] |
X |
2 | A | [1, 1, _, _] |
|
2 | C | [1, 1, _, _] |
X |
2 | D | [1, 0, _, _] |
|
3 | B | [1, 0, 0, _] |
X |
4 | A | [1, 0, 1, 1] |
|
4 | B | [1, 1, 0, 0] |
Or alternatively to show how the data aligns across days:
1A [1, _, _, _] X
1C [1, _, _, _] X
2A [1, 1, _, _]
2C [1, 1, _, _] X
2D [1, 0, _, _]
3B [1, 0, 0, _] X
4A [1, 0, 1, 1]
4B [1, 1, 0, 0]
Note that marking a record as redundant only means it matches the same usage pattern - it doesn't actually have to originate from the same user. Since record 4A ends in 1, 1
, it needs to cancel out a record from day 2 starting with 1, 1
and a record from day 1 starting with 1
. In this example the cancelled record from day 2 actually came from user C, but that's okay as balancing the numbers so each day's activity only gets counted once is what matters.
So to determine how many unique users were active for at least two days in this time period, we simply count how many non-redundant records have at least two 1
s within the four day range. That's 2A, 4A, and 4B for a total of three unique users. The actual users who met this criteria were A, B, and C, but we don't need to know that - only how many of them there were.