-
Notifications
You must be signed in to change notification settings - Fork 35
Bug 1635893 Implement DecryptAetIdentifiers #1304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Supports Account Ecosystem Telemetry. This differs somewhat from the design in the open [draft docs PR](#1264), so those docs will be updated before deploying this.
Codecov Report
@@ Coverage Diff @@
## master #1304 +/- ##
============================================
+ Coverage 80.34% 83.44% +3.09%
+ Complexity 687 637 -50
============================================
Files 93 70 -23
Lines 4020 3104 -916
Branches 407 322 -85
============================================
- Hits 3230 2590 -640
+ Misses 645 386 -259
+ Partials 145 128 -17
Continue to review full report at Codecov.
|
| } else if (STRUCTURED_URI_PATTERN.matcher(uri).matches()) { | ||
| processStructuredIngestionPayload(json); | ||
| } else { | ||
| nonAetDocType.inc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We pass through the message if it doesn't match any of the expected URI patterns. This shouldn't happen if the edge is routing messages correctly. Perhaps it would be better to err on the side of caution here and raise IllegalAetPayloadException here so the payload gets sanitized and sent to error. Or I suppose a new IllegalAetUriException would be more semantically appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending the message to the error stream may be more actionable to diagnose the problem. Knowing what documents passed through would be useful to determine why the counter goes up. On the other hand, once the document is redacted, it becomes useless for backfill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to go ahead and change this to send to errors and redact the payload.
acmiyaguchi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, and mostly straightforward. Is it correct to say that this new pipeline will handle all of the doctypes that contain AET identifiers now, instead of rerouting the pings back into the standard family pipelines?
| "type": "string" | ||
| } | ||
| }, | ||
| "ecosystemAnonId": false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to specify that this field must not exist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct. I wish I could add commentary in the schema to explain this method.
The false schema never validates, so if any of these fields is present, the payload will fail validation. I had previously used a not with required fields, but that only works for a single field. The payload would have to have all of the required fields to trigger. This seemed the most elegant way to define an arbitrary number of forbidden fields.
ingestion-beam/src/main/java/com/mozilla/telemetry/aet/DecryptAetIdentifiers.java
Outdated
Show resolved
Hide resolved
| } else if (STRUCTURED_URI_PATTERN.matcher(uri).matches()) { | ||
| processStructuredIngestionPayload(json); | ||
| } else { | ||
| nonAetDocType.inc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending the message to the error stream may be more actionable to diagnose the problem. Knowing what documents passed through would be useful to determine why the counter goes up. On the other hand, once the document is redacted, it becomes useless for backfill.
|
|
||
| // Note that we must work with the raw URI because ParseUri can raise errors that would | ||
| // route payloads containing anon_id values to error output; this transform must be placed | ||
| // before ParseUri so that we can handle redacting anon_id values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, if the values are redacted at this point in the pipeline, it should be fine to pass the uri's into parse uri wholesale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input to DecryptAetIdentifiers is the raw payload. The idea is that either it's processed successfully within this transform such that all identifiers are properly decrypted, or we sanitize/drop the payload before passing to error output.
If we ran ParseUri first, we would have to make sure to process the errors from that transform to drop or redact the payload, which spreads responsibility around too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this comment to hopefully be a bit more clear.
I will go ahead and add docs as part of this PR to make sure it's clear. AET is now a pipeline family; the edge will handle routing all AET doctypes to a raw topic for AET. There will be no raw BQ output for this pipeline family and now decryption will happen within the Decoder job. |
| }, | ||
| "required": [ | ||
| "ecosystemClientId", | ||
| "ecosystemDeviceId" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jklukas I'm lightly embarrassed to be asking this question, but...what's the difference between "client id" and "device id" here?
I couldn't find a definition of "device id" in the "identifiers" spec doc, but it does confusing talk about the device being identified by a "client id".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not obvious to me what the difference is here. I'll try to track down the history of docs to see where these two terms were introduced. I'm definitely open to removing ecosystemDeviceId if it's not meaningfully different from client id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have just client_id and remove device_id, if that makes sense to you. That's certainly the only such value I had intended on submitting with the first pings from Firefox desktop :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not been able to track down where ecosystem_device_id came from, so I'm inclined to go ahead and remove it. We can easily mark it as non-required immediately, and then I can follow up about recreating the tables to remove it from the schema completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, SGTM 👍
Supports Account Ecosystem Telemetry.
This differs somewhat from the design in the open draft docs PR, so those docs will be updated before deploying this.