Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sensitive Data Redaction #255
base: main
Are you sure you want to change the base?
Sensitive Data Redaction #255
Changes from 4 commits
798e427
885bbb9
87fe972
281d6b8
32f62b1
95c1252
2d508f8
dd32fef
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you have this proposal be against semconv YAML model (or the model in general) vs. the documentation?
The model is what we'll be generating code against and what we'll be advertising via SchemaURL. It'd be a lot easier to comment/design around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can do that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I would expand this section dedicated to semconvs by including the concept of a telemetry schema, which could eventually override what is defined in a semconv but specifically for a service or application and for a given deployment. In a way, the previously presented configuration file could be reorganized or seen as a subset of this telemetry schema for a specific context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lquerel my knowledge about telemetry schema is very limited, can you provide me some examples and pointers to learn more about this? Overall I am a fan of making one thing a "subset of another", but I would need to understand the impact of that, technically but also organizational (e.g. what would be the extended timeline for that?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a reasonable default (or strict mode?) for PII data may be anonymization. If we defined an algorithm, we'd be able to do it consistently across implementations and preserve some level of observability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What algorithm would you have in mind for anonymization? Often hashing is used for that, but depending on the underlying data it's pseudo-anonymization (there is a paper on that from the EU, I need to dig up), e.g. email addresses can be guessed easily by tuning your dictionaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question, I don't have an algorithm in mind. My intention was to have some form of AMONYMIZE mode listed, but details can be out of scope of this otep.
If you believe there is no common and reliable algorithm, then I'm taking my comment back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://ec.europa.eu/justice/article-29/documentation/opinion-recommendation/files/2014/wp216_en.pdf is the document I was referring to, they give a really good overview of different approaches to Anonymization (e.g. stripping octets from IPs would fall into "Generalization" or "Aggregation") and also discuss Pseudonymization (e.g. hashing).
I don't know if this is doable, my point is that if we provide ANONYMIZE as an algorithm it should do exactly that (choosing from the approaches listed in that document), and if we want to use hashing etc an appropriate name should be used (PSEUDONYMIZE?)
It's worth to explore this, but before doing that we need to have the tooling for that in place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The anonymization method often depends on the underlying data type. The byte-stripping technique you mentioned is applicable to IPs but not necessarily to all data types. This raises the question of what we can support/propose based on the metadata available or not for an attribute (e.g., semconv or schema), for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking about that as well right now, I try to come up with a suggestion in the updated proposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I am misinterpreting the proposal, but at the semcons level, I don't think we should have something defining the action to take on an attribute or a body. Instead, we should have a property characterizing the potential sensitivity of the data carried by this attribute. In my view, the decision on what to do with this type of sensitive data (e.g., partial or complete combination of name, data type, sensitivity level) should be determined by the profile or the local configuration/schema.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I plan to change this. I consider to suggest that we keep some kind of "suggested default redaction" if that is feasible, but beyond that it's something that needs to be decided locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you call out how this works across signals?
Right now we have no direct ties between Span/Metric/Trace via the SDK. They actually only communicate with each other via
Context
API.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The honest answer is "I don't know", @jmacd made some suggestions on this, I need to take a look into this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the trace SDK-- yes, this is functionality that is similar to samplers. Sampling SIG has been working on requirements for a V2 sampler API, and I would include redaction functionality there. See open-telemetry/opentelemetry-specification#4012.
For the metrics SDK -- there will be questions about whether redaction happens before or after aggregation. The present SDK specification doesn't support a processor that happens before aggregation; the filtering that is possible through metrics Views doesn't exactly support redaction, since modifications to the attribute set need to be considered before aggregation (or else re-aggregation has to be performed following redaction). There has been a requested feature, which goes by the name
MeasurementProcessor
in past discussions, that would be the proper place to (a) redact / drop attributes, (b) extend attributes from context.In the logs SDK, this is form of processing is the subject of an open discussion, see open-telemetry/opentelemetry-specification#4010.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think I wrote that "similar to a sampler" without giving it a lot of thought beyond "this needs to be a component that can be evolved independently (like samplers) and is also independent of the signal.
I'll review your feedback and try to incorporate it into the document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The identification of sensitive data could also be applied at a semantic level. For example, as a company, I might want to apply a redaction rule to ALL URLs, another rule to ALL IP addresses, and so on. One possible approach would be to introduce a new type of group in the semconv that would allow the definition of semantic types, which could be attached to existing attributes in addition to the "physical" data type that we already have. The redaction rules would then apply to the semantic types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes 🙌 , this was one of the things I wanted to include in my updated proposal, based on your and @jsuereth's suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another complementary option could be to support this type of sensitive data management at the code generation level with a tool like Weaver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my understanding from the discussions so far is that if we integrate the sensitive data details in the semconv a tool like weaver could be part of the solution, not an alternative?