Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sensitive Data Redaction #255
base: main
Are you sure you want to change the base?
Sensitive Data Redaction #255
Changes from 2 commits
798e427
885bbb9
87fe972
281d6b8
32f62b1
95c1252
2d508f8
dd32fef
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 Nice summary @svrnm!
I suggest that we turn these into a list of principles and call it out clearly in this OTEP.
Here are some examples I could think of:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About 2: while we cannot guarantee all SDKs will provide a bug-free implementation of the redaction logic, we could offer a logic to be followed by all SDKs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @reyang, that's a good proposal, will incorporate that (+ the suggestion from @jpkrohling which I agree to, I'll try to find some wording to express that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added those at the top of the document. I used "MUST avoid" vs "MUST not provide" since as @jpkrohling said this is impossible to guarantee, but we can put mechanisms in place to avoid them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel metrics should take an exception here due to the nature of cardinality and performance. Do we know if there were credential/privacy leaks in metrics systems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would need to research on that, but I think performance will be an issue for all signals, I added that to the trade-off section already. So I wonder if we need to do some experiments eventually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metrics API is designed to be much faster than other APIs (e.g. it's not uncommon to have metrics APIs that are 20x faster than tracing APIs), so the perf impact would be bigger if measured by the % drop of calls/second.
Cardinality could also be an issue, for example, if the intention is to get the "unique count of users via email address", the result will be
1
if email address got redacted to "redacted@email.address".There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarification, that's a signification performance difference indeed.
On cardinality I think there are 2 options to address that:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to add this into the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we considering Baggage as a signal here? And would we consider in-band data (context propagation) as well as out-of-band data?
I think open-telemetry/opentelemetry-specification#1633 may benefit from any spec changes related to being able to specify sensitivity of Baggage attributes to be propagated across system boundaries. I don't think this OTEP would solve the context propagation boundary issue (Propagators would still need to define what they consider a system boundary) but I wonder, providing that Baggage is considered here, if the
SensitivityConfig
should contain something related to defining if a Baggage attribute is safe to propagate or not. I'm not convinced either way myself just now, but thought I'd raise it as we're saying "Every API".There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very interesting thought. Since baggage may carry sensitive data it's worth looking into it (we might later to move into "future considerations"). @lmolkova example with
client.address
in #255 (comment) already called out that it might be relevant to add contextual information to the redaction, the baggage propagation is another example of that.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to add this into the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably more of a spec concern and we'll have more time to design and polish API, but I'd like to entertain the idea of
SensitivityConfig
instrumentation can request fromTracer
SensitivityConfig
can be configured by the SDK/distro/etc and can have different implementations for different regulationsSensitivityConfig
has convenience methods to redact common things like URLs. E.g.String SensitivityConfig.redactQueryParams(Uri uri)
client.address
on the server spans is potentially PII) and could be applied implicitly without instrumentation codeE.g.
or
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a suggestion as an alternative to what I suggested or an addition? It looks like an addition (and I think it's great!), but I wanted to verify first before discussing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's both :)
I'd like to be able to add redaction/sanitization without requiring instrumentation to change their code or worry about it.
I.e. the alternative proposal is to build a reasonable story within existing API surface first.
But the proposal you made in the OTEL will still be necessary - we'll need to redact/sanitize custom attributes or to optimize and fine-tune instrumentation code for those who want it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throwing in one more suggestion - we can auto-generate semantic convention code along with the redaction.
Today we generate constants like:
we could instead generate a method
We still need to have a central config and a way to pass it around and make it accessible on the API level, but we can hide a lot of boilerplate and be able to partially automate things across languages.
/cc @lquerel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's an important feature! With me suggestions for how to write that down in the semantic conventions this should be feasible. Let me add some wording to make this clear.
Looking at the solutions I would prefer the first one over the updated code generation for the semantic conventions, i.e.
setAttribute
should do the redaction internally, because having a method per attribute seems excessive to me (and people will forget to use it, vs not using the constants comes with less penalties) and people can configure their own redaction through configuration (let's say the use an application they do not have code access and this application emits "enduser.id" without redaction, they can update their sensitivity config and make sure that it is redacted)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Library authors do not have enough context to make reduction decisions. The best they can do is tag attributes with some semantic annotation saying "this is sensitive" (e.g. in my company we have a taxonomy of "purpose policy" annotations for data elements). If I deploy this library as a user, I should be able to decide if I actually want "sensitive" to be redacted or not - I may be debugging locally and need to see all data, or maybe in production I have different pathways where the data flows, some require reduction and others require raw data.
And to add these annotations it's better to go with the schemas approach, without changing the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we are talking about the same thing, @yurishkuro: Indeed a library author can not make reduction decision, this is the responsibility of the application operator, but the library author needs capabilities to express that some data may be sensitive and provide redaction options similar to what we want to do in the semantic conventions. Here is an example: A library author writes a method that calls a payment provider HTTP endpoint, which has a unique query parameter to carry the token (let's say
acmeToken
). That library author now wants to say "when I call this endpoint, and a HTTP span is generated, make sure thaturl.query
is redacted in a way thatacmeToken
is treated like other potentially sensitive attributes (see #971). Ideally they can do something similar to what is proposed for the semantic conventions, e.g. a way to write down what to do whenDEFAULT
,STRICT
orSTRICTER
(or any other profile is selected). Then, the end user of that library and of opentelemetry can make the decision of redaction.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@svrnm yes, this is a pretty complex use case, but I would still err on the side of a centralized solution over federating these responsibilities to the call sites. It's conceivable that the semantic convention / schema for
url.query
could be extended with conditional clauses like "if I am in this instrumentation scope or the URL contains this domain name, these are the additional rules / annotations for the query elements". And I think this approach is going to be necessary in many cases anyway, because the specific HTTP instrumentation may not have any clue about this business use case, e.g. it could be a generic HTTP client interceptor that generates client spans and records URL, and only user code would have the theoretical capability of understanding the sensitivity of the query params.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what is part of this proposal, I need to optimize the wording for that, but eventually the configuration for redaction needs to be that fine-grained.
Yes, centralized is the core of it, but library/application authors still need a way to provide a hint that a redaction needs to be applied. Rethinking my example, we actually have the following situation:
A library author is likely using a HTTP client library as dependency themselves to interact with a 3rd partyAPI, so in some pseudo code they might have something like the following:
How would they implement that? They need some ways to annotate that. Also those annotations need to be composable somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/IIN/Issuer Identification Number (IIN)/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
applied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this apply to the SDK globally or can be applied on specific pipelines/processors/exporters? e.g. if the goal is to send audit logs (which have EUII such as email address and IP address) to destination A without performing any redaction, and send normal logs to destination B with redaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggestion here was globally, but I see the point you are making for having it split out by exporter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to add this into the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the not on purpose?
It seems like at least
ALWAYS
could be useful, for example as a temporary mitigation when a leakage is discovered, and before it can be properly fixed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I wanted to express here is that ALWAYS and NEVER can not be reconfigured through
<SENSITIVITY_DETAILS>
as outlined above in the document. Because if allowed it would break expectationsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that statement was a bit confusing to me as well, maybe it can be rephrased to express what you said above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to add this into the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the direction this proposal is going toward, however, I'd like to challenge the notion that redaction is treated as an instrumentation issue, as opposed to a configuration issue. I wonder if it would be possible to treat redaction as a configuration issue, like we do sampling.
Reading through this document and arriving at this part, I was thinking about a solution that relies solely on SDK components (without any API changes needed), by providing SDK redactors and configuring them via "redaction profiles" (which consists of mappings from attribute names to redaction rules, similar to the example below). SDKs could provide some default redaction profiles (for example for redacting strict, stricter, or never), which are based on definitions in semantic conventions.
The main challenges of this approach would be considering multiple versions of semantic conventions, furthermore it wouldn't be that straightforward for authors of instrumentation libraries to add redaction rules.
However, it would give the benefit to have redaction rules defined at a central place. Given that the examples above (different regular expressions for different redaction levels), it might be error prone to implement those consistently across all usages of an attribute in different instrumentation libraries in different languages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 90% of the cases, I would say the answer to this is yes, but especially library authors need ways to express additional redaction requirements very locally, e.g. if they call a 3rd party service via a HTTP GET and they know that in this one specific case they need to apply additional redaction. Without a functionality for that they can either over-configure (e.g. apply it globally but potentially have other calls redacted without the need for that) or do their own logic on top of what we provide them. Both things are OK and we might decide against having the API for that (in a first version, or forever), but then we need to indiciate that somehow in the specification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the updated version, I moved towards leading with the configuration centric approach and mentioned API capabilities towards the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll eventually need to have something similar to this as well: semconv attributes would need a marker, about the common sensitivity of the attribute's value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to review those 2 PRs in detail and incorporate whatever we want to include here as well. I added a little bit of wording on classification (see my credit card example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like there should still be some stability definition.
I wouldn't want to see something that was redacted suddenly not be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, maybe "adding redactions" may be outside, but "removing redactions" may require additional guard rails. It is probably unlikely that redactions will be removed, the only case I can think of is that something is deprecated for a very long time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will leave the question open, I can add some words on adding/removing redaction