Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sensitive Data Redaction #255

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
Draft
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
276 changes: 276 additions & 0 deletions text/0000-sensitive-data-redaction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,276 @@
# Sensitive Data Redaction

This is a proposal for adding treatment of sensitive data to the OpenTelemetry (OTel) Specification and semantic conventions.

## Motivation

When collecting data from an application, there is always the possibility, that the data contains information that shouldn’t be collected, because it is either leaking (parts of) credentials (passwords, tokens, usernames, credit card information) or can be used to uniquely identify a person (name, IP, email, credit card information) which may be protected through certain regulations. By adding OTel to a library or instrumentation an end-user of OTel is facing exactly this challenge: values of an [Attribute](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/README.md#attribute) may carry such sensitive data.
svrnm marked this conversation as resolved.
Show resolved Hide resolved
svrnm marked this conversation as resolved.
Show resolved Hide resolved

While it’s ultimately the responsibility of the legal entity operation an application to protect sensitive data, end-users of OpenTelemetry (developers, operators working for that entity) are turning to the authors of OpenTelemetry – or to those of libraries that implement OpenTelemetry, like the Azure SDK – to have means in place to redact/filter sensitive data. Without that capability provided, they will raise security issues and/or will drop OpenTelemetry eventually due to it not meeting their security/legal requirements.
svrnm marked this conversation as resolved.
Show resolved Hide resolved

In this OTEP you will find a proposal for adding treatment of sensitive data to OpenTelemetry.

By adding the proposed features, OpenTelemetry will provide its end-users the tooling needed to make sure that sensitive data is treated according to their requirements.

This will make sure that these end-users can use OpenTelemetry within their secure and legal requirements and that OpenTelemetry (and implementing libraries) are able to avoid vulnerabilities.
Copy link
Member

@reyang reyang May 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 Nice summary @svrnm!

I suggest that we turn these into a list of principles and call it out clearly in this OTEP.

Here are some examples I could think of:

  1. OpenTelemetry MUST allow the end-user to meet with their security/privacy/compliance requirements regarding the data being collected.
  2. OpenTelemetry MUST not provide redaction offerings that lead to bigger security issues such as https://en.wikipedia.org/wiki/ReDoS (e.g. the redaction logic is poorly implemented, so a hacker could forge certain input to DDoS the redaction engine itself).
  3. OpenTelemetry SHOULD allow the telemetry data to apply different redaction logic per telemetry pipeline/exporter in a single process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About 2: while we cannot guarantee all SDKs will provide a bug-free implementation of the redaction logic, we could offer a logic to be followed by all SDKs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @reyang, that's a good proposal, will incorporate that (+ the suggestion from @jpkrohling which I agree to, I'll try to find some wording to express that)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added those at the top of the document. I used "MUST avoid" vs "MUST not provide" since as @jpkrohling said this is impossible to guarantee, but we can put mechanisms in place to avoid them


## Explanation

### Overview

This proposal aims to provide the following features:

- methods for consumers of the OpenTelemetry API to enrich collected Attributes with sensitivity information and hooks to apply different levels of redaction.
- related to those methods a similar way to enrich attributes in the semantic conventions with sensitivity information and ways of redaction.
- a consistent way to configure sensitivity requirements for end-users of the OpenTelemetry SDK and in instrumentation libraries, including predefined “configuration profiles” and ways for fine grain configuration.
- A redactor implements the logic to apply redaction and that owns predefined helpers for redaction in the SDK (URLParams filtering, Zeroing IPs, etc.).

The following limitations apply:

- Only Attributes are treated, although it is possible that sensitive data is contained in other data generated by OpenTelemetry as well (e.g. the span name, or the instrumentation scope name could contain sensitive data)
svrnm marked this conversation as resolved.
Show resolved Hide resolved

### Annotate attributes with sensitivity information

As a first building block the OpenTelemetry API needs to provide capability to enrich collected attributes with sensitivity details and with potential hooks to redact those. Those API changes then also can be used to describe sensitivity information in the semantic conventions.

#### API

Every API that sets an attribute consisting of a key and a value, needs to be enhanced by an additional functionality that allows to add details about the potential sensitivity of this data and a hooks how it may be redacted. This can be an additional set of parameters for an existing method or a method that can be called after the attribute has been set, if adding parameters to a method signature would lead to a breaking change.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel metrics should take an exception here due to the nature of cardinality and performance. Do we know if there were credential/privacy leaks in metrics systems?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would need to research on that, but I think performance will be an issue for all signals, I added that to the trade-off section already. So I wonder if we need to do some experiments eventually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics API is designed to be much faster than other APIs (e.g. it's not uncommon to have metrics APIs that are 20x faster than tracing APIs), so the perf impact would be bigger if measured by the % drop of calls/second.

Cardinality could also be an issue, for example, if the intention is to get the "unique count of users via email address", the result will be 1 if email address got redacted to "redacted@email.address".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarification, that's a signification performance difference indeed.

On cardinality I think there are 2 options to address that:

  • if the redaction happens late (before exporting data) it is still possible to create that metric properly
  • there are options to redact emails and still keep their uniqueness, e.g. by hashing the emails (or parts of it), note that hashing is a suboptimal solution since the hashed words are easy to guess & break (use a dictionary with common first and lastnames, etc.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to add this into the document.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we considering Baggage as a signal here? And would we consider in-band data (context propagation) as well as out-of-band data?

I think open-telemetry/opentelemetry-specification#1633 may benefit from any spec changes related to being able to specify sensitivity of Baggage attributes to be propagated across system boundaries. I don't think this OTEP would solve the context propagation boundary issue (Propagators would still need to define what they consider a system boundary) but I wonder, providing that Baggage is considered here, if the SensitivityConfig should contain something related to defining if a Baggage attribute is safe to propagate or not. I'm not convinced either way myself just now, but thought I'd raise it as we're saying "Every API".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very interesting thought. Since baggage may carry sensitive data it's worth looking into it (we might later to move into "future considerations"). @lmolkova example with client.address in #255 (comment) already called out that it might be relevant to add contextual information to the redaction, the baggage propagation is another example of that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to add this into the document.


An additional method for setting the sensitivity information for a span attribute might look like the following:

```
span.setAttribute("url.query", url.toString(), <SENSITIVITY_DETAILS>);
Copy link
Contributor

@lmolkova lmolkova Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably more of a spec concern and we'll have more time to design and polish API, but I'd like to entertain the idea of

  • Some SensitivityConfig instrumentation can request from Tracer
  • SensitivityConfig can be configured by the SDK/distro/etc and can have different implementations for different regulations
  • SensitivityConfig has convenience methods to redact common things like URLs. E.g. String SensitivityConfig.redactQueryParams(Uri uri)
  • It may recognize common attributes which are known to be problematic (e.g. client.address on the server spans is potentially PII) and could be applied implicitly without instrumentation code

E.g.

Span clientSpan = tracer.startSpan("GET", CLIENT);
String redactedUri = tracer.getSensitivityConfig().redactQueryParams(uri);
clientSpan.setAttribute("url.full", redactedUri);

or

Span serverSpan = tracer.startSpan("GET /foo", SERVER);

// the `SensitivityConfig` associated with the tracer knows that `client.address` 
// may be sensitive and would apply whatever it's configured to do 
// allow, drop, anonymize, etc)
serverSpan.setAttribute("client.address", request.getClientIp()); 

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a suggestion as an alternative to what I suggested or an addition? It looks like an addition (and I think it's great!), but I wanted to verify first before discussing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's both :)

I'd like to be able to add redaction/sanitization without requiring instrumentation to change their code or worry about it.

I.e. the alternative proposal is to build a reasonable story within existing API surface first.
But the proposal you made in the OTEL will still be necessary - we'll need to redact/sanitize custom attributes or to optimize and fine-tune instrumentation code for those who want it.

Copy link
Contributor

@lmolkova lmolkova May 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Throwing in one more suggestion - we can auto-generate semantic convention code along with the redaction.

Today we generate constants like:

class UrlAttributes {
   public static final String URL_FULL = "url.full";
}

we could instead generate a method

class UrlAttributes {
   public static final String URL_FULL = "url.full";
   public static void recordUrlFull(Span span, Uri uri) {
       String redacted = getSensitivityConfig().redactQueryParams(uri);
       span.setAttribute(URL_FULL, redacted);
   }
}

We still need to have a central config and a way to pass it around and make it accessible on the API level, but we can hide a lot of boilerplate and be able to partially automate things across languages.

/cc @lquerel

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to be able to add redaction/sanitization without requiring instrumentation to change their code or worry about it.

Yes, that's an important feature! With me suggestions for how to write that down in the semantic conventions this should be feasible. Let me add some wording to make this clear.

Looking at the solutions I would prefer the first one over the updated code generation for the semantic conventions, i.e. setAttribute should do the redaction internally, because having a method per attribute seems excessive to me (and people will forget to use it, vs not using the constants comes with less penalties) and people can configure their own redaction through configuration (let's say the use an application they do not have code access and this application emits "enduser.id" without redaction, they can update their sensitivity config and make sure that it is redacted)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Library authors do not have enough context to make reduction decisions. The best they can do is tag attributes with some semantic annotation saying "this is sensitive" (e.g. in my company we have a taxonomy of "purpose policy" annotations for data elements). If I deploy this library as a user, I should be able to decide if I actually want "sensitive" to be redacted or not - I may be debugging locally and need to see all data, or maybe in production I have different pathways where the data flows, some require reduction and others require raw data.

And to add these annotations it's better to go with the schemas approach, without changing the API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to add these annotations it's better to go with the schemas approach, without changing the API.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are talking about the same thing, @yurishkuro: Indeed a library author can not make reduction decision, this is the responsibility of the application operator, but the library author needs capabilities to express that some data may be sensitive and provide redaction options similar to what we want to do in the semantic conventions. Here is an example: A library author writes a method that calls a payment provider HTTP endpoint, which has a unique query parameter to carry the token (let's say acmeToken). That library author now wants to say "when I call this endpoint, and a HTTP span is generated, make sure that url.query is redacted in a way that acmeToken is treated like other potentially sensitive attributes (see #971). Ideally they can do something similar to what is proposed for the semantic conventions, e.g. a way to write down what to do when DEFAULT, STRICT or STRICTER (or any other profile is selected). Then, the end user of that library and of opentelemetry can make the decision of redaction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svrnm yes, this is a pretty complex use case, but I would still err on the side of a centralized solution over federating these responsibilities to the call sites. It's conceivable that the semantic convention / schema for url.query could be extended with conditional clauses like "if I am in this instrumentation scope or the URL contains this domain name, these are the additional rules / annotations for the query elements". And I think this approach is going to be necessary in many cases anyway, because the specific HTTP instrumentation may not have any clue about this business use case, e.g. it could be a generic HTTP client interceptor that generates client spans and records URL, and only user code would have the theoretical capability of understanding the sensitivity of the query params.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's conceivable that the semantic convention / schema for url.query could be extended with conditional clauses like "if I am in this instrumentation scope or the URL contains this domain name, these are the additional rules / annotations for the query elements".

That's what is part of this proposal, I need to optimize the wording for that, but eventually the configuration for redaction needs to be that fine-grained.

but I would still err on the side of a centralized solution over federating these responsibilities to the call sites.

Yes, centralized is the core of it, but library/application authors still need a way to provide a hint that a redaction needs to be applied. Rethinking my example, we actually have the following situation:

A library author is likely using a HTTP client library as dependency themselves to interact with a 3rd partyAPI, so in some pseudo code they might have something like the following:

function authenticateWithMyService($token) {
  // they want to let OpenTelemetry know that $token is sensitive, and when
  // it is injected into the `url.query` attribute certain redaction may be required.
  use3rdPartyHttpClient("https://myservice//?myAcmeToken=$token")
}

How would they implement that? They need some ways to annotate that. Also those annotations need to be composable somehow.

```

Or, if the setAttribute method can not be extended:

```
span.setAttribute("url.query", url.toString());
span.redactAttribute(“url.query”, <SENSITIVITY_DETAILS>);
```

The content of `<SENSITIVITY_DETAILS>` may look like the following example:

```
{
<REDACTION_LEVEL>: <REDACTION_RULE>
}
```

The key `<REDACTION_LEVEL>` is one of multiple available pre-defined levels of redaction requirements that an end-user may choose, e.g.

- `DEFAULT`: What should be the default redaction applied to this value
- `STRICT`: What should be a basic redaction that is applied to this value
- `STRICTER`: What should be an advanced redaction that is applied to this value
- …

The value `<REDACTION_RULE>` can be one of the following:

- A regular expression (or sed expression?) which when applied turns parts of a string into their redacted version,e.g. `s/([0-9]+\.[0-9]+\.[0-9]+\.)[0-9]+/\10/` applied on an IP address will replace the last octet with 0: `1.2.3.4` becomes `1.2.3.0`
- A constant that represents pre-defined redaction methods (see below for Redaction helpers), e.g. `REDACT_INSECURE_URL_PARAM_VALUES` will apply what is required for [#971](https://github.com/open-telemetry/semantic-conventions/pull/971)
- A callback that will call a function on that value and apply advanced redaction

It should be recommended to either use the expression or the constant and only in rare circumstances a callback function should be applied.

Note that those API calls are no-op and will be implemented by the SDK (as we do it with other API methods as well), this way (almost?) no additional overhead will be created by introducing those annotations.

Below you will find details on how the `<SENSITIVITY_DETAILS>` are included in the definitions of attributes in the Semantic Conventions. By pre-loading the details for known attributes in the SDK configuration a call to `span.setAttribute("url.query", url.toString());` can apply the redaction internally. An end user may append/overwrite those details.

### Semantic Conventions

With the definitions above values in the semantic conventions can be annotated with `<SENSITIVITY_DETAILS>` as outlined above, with the exception that no callback functions can be supplied as redaction rules.

Example:

| Attribute | ... existing columns ... | sensitivity details |
|-----------|--------------------------|---------------------|
| `url.query`| | Rationale: Some verbatim wording why this is the way it is below<br>Type: `mixed`<br>`DEFAULT`: `REDACT_INSECURE_URL_PARAM_VALUES`<br>STRICT: `REDACT_ALL_URL_VALUES`<br>`STRICTER`: `DROP` |
| `client.address`| | Rationale: some reasons why dropping octets may be required<br>Type: `maybe_pii`<br>`DEFAULT`: `NONE`<br>`STRICT`: `'s/([0-9]+\.[0-9]+\.[0-9]+\.)[0-9]+/\10/'`<br>`STRICTER`: `'s/([0-9]+\.[0-9]+\.)[0-9]+\.[0-9]+/\10.0/'` |
| `enduser.creditCardNumber`**[1]** | | Rationale: ...<br>Type: `always_pii`<br>DEFAULT: `EXTRACT_IIN`<br>`STRICT`: `DROP`|

**[1]**: _This is a made-up example for demonstration purpose, it’s not part of the current semantic conventions. It gives a more nuanced example, e.g. that extracting the IIN might be an option over dropping the number completely. It also demonstrated the value of “type”, which can enable Data lineage use cases_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/IIN/Issuer Identification Number (IIN)/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied


The `DROP` keyword means that the value is replaced with `REDACTED` (or set to null, …).

It is the responsibility of the OpenTelemetry SDK to implement those sensitivity details provided by the semantic conventions. This means that an instrumentation library does not need to add `<sensitivityDetails>` when calling an `setAttribute` method or does not need to call the additional `redactAttribute` method as outlined above. An instrumentation library may choose to apply additional redactions (leveraging the OpenTelemetry APIs or doing it before calling `setAttribute` in their own business logic).

### SDK Configuration of sensitivity requirements

The annotations and APIs as outlined above will allow SDK users to provide their sensitive requirements as configuration (here: environment variable, but can be encoded in future config files as well), e.g.

```
env OTEL_SENSITIVE_DATA_PROFILE=”STRICT” ./myInstrumentedApp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this apply to the SDK globally or can be applied on specific pipelines/processors/exporters? e.g. if the goal is to send audit logs (which have EUII such as email address and IP address) to destination A without performing any redaction, and send normal logs to destination B with redaction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion here was globally, but I see the point you are making for having it split out by exporter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to add this into the document.

```

This call will make sure that redactions applied follow the `STRICT` profile. If not set the `DEFAULT` will be used. Additionally there are 2 levels that can not be used in sensitivity details:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the not on purpose?
It seems like at least ALWAYS could be useful, for example as a temporary mitigation when a leakage is discovered, and before it can be properly fixed.

Suggested change
This call will make sure that redactions applied follow the `STRICT` profile. If not set the `DEFAULT` will be used. Additionally there are 2 levels that can not be used in sensitivity details:
This call will make sure that redactions applied follow the `STRICT` profile. If not set the `DEFAULT` will be used. Additionally there are 2 levels that can be used in sensitivity details:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I wanted to express here is that ALWAYS and NEVER can not be reconfigured through <SENSITIVITY_DETAILS> as outlined above in the document. Because if allowed it would break expectations

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that statement was a bit confusing to me as well, maybe it can be rephrased to express what you said above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to add this into the document.


- `NEVER`: No redaction is applied, SDKs may choose to log a warning that this is a risky choice
- `ALWAYS`: All values with sensitivity details will be dropped

Additionally SDK end users can provide advanced configuration (through code, configuration file, probably not environment variable) to add specific needs:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the direction this proposal is going toward, however, I'd like to challenge the notion that redaction is treated as an instrumentation issue, as opposed to a configuration issue. I wonder if it would be possible to treat redaction as a configuration issue, like we do sampling.

Reading through this document and arriving at this part, I was thinking about a solution that relies solely on SDK components (without any API changes needed), by providing SDK redactors and configuring them via "redaction profiles" (which consists of mappings from attribute names to redaction rules, similar to the example below). SDKs could provide some default redaction profiles (for example for redacting strict, stricter, or never), which are based on definitions in semantic conventions.

The main challenges of this approach would be considering multiple versions of semantic conventions, furthermore it wouldn't be that straightforward for authors of instrumentation libraries to add redaction rules.

However, it would give the benefit to have redaction rules defined at a central place. Given that the examples above (different regular expressions for different redaction levels), it might be error prone to implement those consistently across all usages of an attribute in different instrumentation libraries in different languages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be possible to treat redaction as a configuration issue, like we do sampling.

For 90% of the cases, I would say the answer to this is yes, but especially library authors need ways to express additional redaction requirements very locally, e.g. if they call a 3rd party service via a HTTP GET and they know that in this one specific case they need to apply additional redaction. Without a functionality for that they can either over-configure (e.g. apply it globally but potentially have other calls redacted without the need for that) or do their own logic on top of what we provide them. Both things are OK and we might decide against having the API for that (in a first version, or forever), but then we need to indiciate that somehow in the specification.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the updated version, I moved towards leading with the configuration centric approach and mentioned API capabilities towards the end


```
{
<ATTRIBUTE_KEY> => <SENSITIVITY_DETAILS>
}
```

e.g.

```
{
"url.query" => { DEFAULT: REDACT_ALL_URL_VALUES },
"com.example.customer.name" => { DEFAULT: x => { /* do something in this callback * } },
"com.example.customer.email" => { ‘s/^[^@]*@/REDACTED@/’ },
}
```

### Redactor

To accomplish redaction the SDK needs a component (similar to a sampler) that inspects attributes when they are set and applies the required redactions:

- `Redactor::setup(profile)` will setup the redactor with the given profile, maybe a constructor? depends on the language
- `Redactor::redact(value, <sensitivityDetails>)` will return the redacted value using the provided profile

The redactor will also have all the methods to apply predefined redactions (`REDACT_ALL_URL_VALUES`, `REDACT_INSECURE_URL_PARAM_VALUES`, `DROP`, etc.). If a method is not implemented (either by the SDK or by the end-user choosing one that does not exist), it will default to apply `DROP` to avoid leakage of any sensitive data.

## Internal details

**tbd**

## Additional context

Treating sensitive data properly is a very complex multi-dimensional topic. Below you will find some additional context on that subject.

### Types of sensitive data

There are different kinds of “sensitivity” that may apply to data. The ones most relevant in this proposal are “security” and “privacy”. They may overlap but we distinguish them as follows:

- security-relevant sensitive data: any information that when exposed [weakens the overall security of the device/system](https://en.wikipedia.org/wiki/Vulnerability_(computing)).
- privacy-relevant sensitive data: any information that when exposed [can adversely affect the privacy or welfare of an individual](https://en.wikipedia.org/wiki/Information_sensitivity).

Note, that there are other kinds of sensitive data (business information, classified information), which are not covered extensively by this proposal.

### Level of Sensitivity

The level of sensitivity of an information can also be different and that sensitivity can be contextual, e.g.

- The password of a user without privileges is less sensitive than the password of an administrator
- The "client IP" in a server-to-server communication is less sensitive than the "client IP" in an client-to-server communication, where the client can be linked to a human.
- API tokens of a demo system are less sensitive than API tokens for a production system
- The license plate of an individual’s car is less sensitive than their social security number
- The full name of a user in a social network is less sensitive than the full name of a user in a medical research database

Depending on the sensitivity data an end-user of an observability system may weigh up if collecting this data is worth it.

### Regulatory and other requirements

Due to the negative effects that the exposure of sensitive data can have (see above in "Types of sensitive data"), different entities have created regulations for the collection of sensitive data, among them:

- GDPR
- CPRA
- PIPEDA
- HIPAA
- [more…](https://en.wikipedia.org/wiki/Information_privacy)

Additionally the entities operating the applications who leverage OpenTelemetry may have their own requirements for treating certain sensitive data.

Finally end-users may want to apply recommendations for [Data Minimization](https://en.wikipedia.org/wiki/Data_minimization), to avoid "unnecessary risks for the data subject".

**Note 1**: it is not (and can not be) the responsibility of the OpenTelemetry project to provide compliance with any of those regulations, this is a responsibility of the OTel end-user. OTel can only facilitate parts of those requirements.

**Note 2**: Those requirements are subject of change and outside of the control of the OpenTelemetry community.

## Trade-offs and mitigations

### Performance Impact

By adding an extra layer of processing every time an attribute value gets set, might have an impact on the performance. There might be ways to reduce that overhead, e.g. by only redacting values which are finalized and ready to exported such that updated values or sampled data does not need to be handled.

## Prior art and alternatives

svrnm marked this conversation as resolved.
Show resolved Hide resolved
### OTEPS

- [OTEP 100 - Sensitive Data Handling](https://github.com/open-telemetry/oteps/pull/100)
- [OTEP 187 - Data Classification for resources and attributes](https://github.com/open-telemetry/oteps/pull/187)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll eventually need to have something similar to this as well: semconv attributes would need a marker, about the common sensitivity of the attribute's value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to review those 2 PRs in detail and incorporate whatever we want to include here as well. I added a little bit of wording on classification (see my credit card example)


### Spec & SemConv Issues

This problem has been discussed multiple times, here is a list of existing issues in the OpenTelemetry repositories:

- [Http client and server span default collection behavior for url.full and url.query attributes](https://github.com/open-telemetry/semantic-conventions/issues/860)
- [URL query string values should be redacted by default](https://github.com/open-telemetry/semantic-conventions/pull/961)
- [Specific URL query string values should be redacted](https://github.com/open-telemetry/semantic-conventions/pull/971)
- [Allow url.path sanitization](https://github.com/open-telemetry/semantic-conventions/pull/676)
- [Guidance requested: static SQL queries may contain sensitive values](https://github.com/open-telemetry/semantic-conventions/issues/436)
- [Semantic conventions vs GDPR](https://github.com/open-telemetry/semantic-conventions/issues/128)
- [Guidelines for redacting sensitive information](https://github.com/open-telemetry/semantic-conventions/issues/877)
- [DB sanitization uniform format](https://github.com/open-telemetry/semantic-conventions/issues/717)
- [Add db.statement sanitization/masking examples](https://github.com/open-telemetry/semantic-conventions/issues/708)
- [TC Feedback Request: View attribute filter definition in Go](https://github.com/open-telemetry/opentelemetry-specification/issues/3664)

### SemConv Pages

The semantic conventions already contains notes around treating sensitive data (search for "sensitive" on the linked pages if not stated otherwise):

- [gRPC SemConv](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/rpc/grpc.md)
- [Sensitive Information in URL SemConv](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/url/url.md#sensitive-information)
- [GraphQL Spans SemConv](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/graphql/graphql-spans.md)
- [Container Resource SemConv](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/resource/container.md)
- [Database Spans SemConv](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/database/database-spans.md)
- [HTTP Spans SemConv](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/http/http-spans.md)
- [General Attributes SemConv](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/attributes.md)
- [LLM Spans SemConv](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/gen-ai/llm-spans.md)
- [ElasticSearch SemConv](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/database/elasticsearch.md)
- [Redis SemConv](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/database/redis.md)
- [Connect RPC SemConv](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/rpc/connect-rpc.md)
- [Device SemConv](https://github.com/open-telemetry/semantic-conventions/blob/main/docs/resource/device.md) (search for GDPR)

### Existing Solutions within OpenTelemetry

The following solutions for OpenTelemetry already exist:

- [MrAlias/redact](https://github.com/MrAlias/redact) for OpenTelemetry Go
- Collector processors, including
- [Redaction Processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/redactionprocessor)
- [Transform Processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/transformprocessor)
- [Filter Processor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/filterprocessor)

### Alternative 0: Do nothing

It’s always good to analyze this option. If we do nothing end users will still need to satisfy their requirements for treating sensitive data accordingly:

- Instrumentation library authors are required to manage redaction before using the OpenTelemetry API
- Application developers will do the same or are forced to join the instrumented application with a collector for redaction/filtering
- Third party solutions will be implemented.

### Alternative 1: OpenTelemetry Collector

As listed above there are multiple processors available that can be used to redact or filter sensitive data with the OpenTelemetry collector. The challenge with that is that it is unknown to an application (owner) if data is processed in the collector as expected. Also, the data leaving the application might already be a risk (non-encrypted or compromised network) or may not be allowed (collector is hosted in a different country, which may conflict with a regulation)

Ideally a combination is used.

### Alternative 2: Backend

The backend consuming the OpenTelemetry data can provide processing for filtering and redaction as well. The same objection as for the collector apply.

### Existing Solutions outside OpenTelemetry

There are many solutions outside OpenTelemetry that help to filter or redact sensitive data based on security and privacy requirements:

- [sanitize_field_name in Elastic Java](https://www.elastic.co/guide/en/apm/agent/java/1.x/config-core.html#config-sanitize-field-names)
- [Filter sensitive data in AppDynamics Java Agent](https://docs.appdynamics.com/appd/24.x/24.4/en/application-monitoring/install-app-server-agents/java-agent/administer-the-java-agent/filter-sensitive-data)
- [GA4 data redaction](https://support.google.com/analytics/answer/13544947?sjid=3336918779004544977-EU)
- [Configure Privacy Settings in Matamo](https://matomo.org/faq/general/configure-privacy-settings-in-matomo/)
- [DataDog Sensitive Data Redaction](https://docs.datadoghq.com/observability_pipelines/sensitive_data_redaction/)

## Open questions

- **Question 1**: Should sensitivity details for an attribute in the semantic conventions be excluded from the stability guarantees? This means, updating them for a **stable** attribute is not a breaking change. The idea behind excluding them from the stability guarantees is that the requirements are subject of change due to changes in technology (see [#971](https://github.com/open-telemetry/semantic-conventions/pull/971), the list of query string values will evolve over time) or changes in regulatory requirements or both.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like there should still be some stability definition.
I wouldn't want to see something that was redacted suddenly not be.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, maybe "adding redactions" may be outside, but "removing redactions" may require additional guard rails. It is probably unlikely that redactions will be removed, the only case I can think of is that something is deprecated for a very long time

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will leave the question open, I can add some words on adding/removing redaction


## Future possibilities

Attributes are most likely to carry sensitive information, but as stated in the overview section of the explanation other user-set properties may carry sensitive information as well. In a later iteration we might want to review them as well.

The proposal puts the configuration of sensitivity requirements into the hands of the person operating an application. In a future iteration we can look into providing end-users of instrumented applications to provide their _consent_ of which and how data related to them is tracked, see [Do Not Track](https://en.wikipedia.org/wiki/Do_Not_Track), [Global Privacy Control](https://privacycg.github.io/gpc-spec/) and the requirements of certain local regulations.