Skip to content

[fb-survey] bug in w*li raw and smoothed #47

Closed
@krivard

Description

@krivard

A known issue with the fb-survey is that if a user tagged for the survey forwards the survey to friends, their survey responses get tagged with the same identifier, but the only survey response to which the fb-generated weight applies is the user originally selected for the survey.

Previously, we have addressed this at the aggregation step by throwing out all but the earliest survey for each identifier. This works, but is slow, since it requires loading the cumulative list of all tokens and their earliest known start dates.

While transitioning the system to pseudo-incremental (where we dump partially-processed survey responses into a big bucket and store for the next run, so that we only have to fully process the last week's worth of data or so) I foolishly split off the step of generating the identifier list for a day in such a way that it does not get antijoined against the cumulative list. This has caused us to have duplicate identifier-weight pairs for 33 surveys going back to the very first week of the survey.

Recommended fix:

  • For past identifiers and weights, keep only the earliest identifier-weight pair in each set. This generates a bunch of edits that exceed our (arbitrary, but still) validation limits, details below. Most of them are large enough that I'd be more comfortable noting them in the release notes than silently passing them through.
  • For future identifier lists, anti join against the cumulative list.

Objections?

Diffs by geo type:

  • hrr: 5 raw (below), 35 smoothed
  • msa: 2 raw (below), 14 smoothed
  • state: 4 raw (below), 22 smoothed
  • county: 3 raw (below), 15 smoothed
[1] "raw_wcli"
[1] "hrr"
        date geo_id     val.x      se.x sample_size.x effective_sample_size.x
1 2020-04-06    155 1.3361589 0.8031495           181                178.0729
2 2020-04-08    113 0.4751491 0.1059180          2820               1758.6626
3 2020-04-09    145 0.7352730 0.3614189           439                356.8198
4 2020-04-09    223 0.7999764 0.5034265          2179               1445.3862
5 2020-04-30     56 0.3355428 0.1329136          1157                865.6997
      val.y      se.y sample_size.y effective_sample_size.y val.mismatch
1 1.5321662 0.8219147           182                178.9988         TRUE
2 0.4728166 0.1057039          2821               1703.0832        FALSE
3 0.7287457 0.3597578           440                349.8120        FALSE
4 0.7975445 0.5019053          2180               1435.6523        FALSE
5 0.3336672 0.1327289          1158                851.6769        FALSE
  se.mismatch sample_size.mismatch effective_sample_size.mismatch
1       FALSE                FALSE                          FALSE
2       FALSE                FALSE                           TRUE
3       FALSE                FALSE                           TRUE
4       FALSE                FALSE                           TRUE
5       FALSE                FALSE                           TRUE
[1] "raw_wcli"
[1] "msa"
        date geo_id     val.x      se.x sample_size.x effective_sample_size.x
1 2020-04-08  47900 0.6446462 0.1258031      3912.972                2407.896
2 2020-04-30  31080 0.3344158 0.1191181      1557.844                1170.382
      val.y      se.y sample_size.y effective_sample_size.y val.mismatch
1 0.6423696 0.1254606      3913.972                2354.284        FALSE
2 0.3330240 0.1188767      1558.844                1156.206        FALSE
  se.mismatch sample_size.mismatch effective_sample_size.mismatch
1       FALSE                FALSE                           TRUE
2       FALSE                FALSE                           TRUE
[1] "raw_wcli"
[1] "state"
        date geo_id     val.x       se.x sample_size.x effective_sample_size.x
1 2020-04-08     md 0.5978971 0.10563249      5852.989               3858.5343
2 2020-04-09     md 0.6106332 0.23702405      4831.990               3285.0041
3 2020-04-10     nh 0.5424426 0.25401259       652.000                467.4285
4 2020-04-30     ca 0.3513494 0.06685219      8207.078               6220.2927
      val.y      se.y sample_size.y effective_sample_size.y val.mismatch
1 0.5965956 0.1054311      5853.989               3805.2662        FALSE
2 0.6097974 0.2366978      4832.990               3274.1975        FALSE
3 0.5608735 0.2585968       653.000                489.9669        FALSE
4 0.3510655 0.0668012      8208.078               6205.3173        FALSE
  se.mismatch sample_size.mismatch effective_sample_size.mismatch
1       FALSE                FALSE                           TRUE
2       FALSE                FALSE                           TRUE
3       FALSE                FALSE                           TRUE
4       FALSE                FALSE                           TRUE
[1] "raw_wcli"
[1] "county"
        date geo_id     val.x      se.x sample_size.x effective_sample_size.x
1 2020-04-06  17031 0.8046518 0.2241702     1166.0908                665.7235
2 2020-04-08  24021 0.3922447 0.2390339      324.7628                344.3633
3 2020-04-30  06037 0.3346056 0.1328722     1151.7269                864.2107
      val.y      se.y sample_size.y effective_sample_size.y val.mismatch
1 0.8248768 0.2249927     1167.0908                666.2835         TRUE
2 0.3815037 0.2223784      325.7628                382.5789        FALSE
3 0.3327233 0.1326897     1152.7269                850.0887        FALSE
  se.mismatch sample_size.mismatch effective_sample_size.mismatch
1       FALSE                FALSE                          FALSE
2       FALSE                FALSE                           TRUE
3       FALSE                FALSE                           TRUE

Metadata

Metadata

Assignees

Labels

API changeRenames, large changes to calculations, large changes to affected regionsready to 🚢Deliver to marketing on ship day

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions