Description
I'd like to propose that ECS adds guidance for anonymization and pseudonymization. Some thoughts:
Definitions
- anonymization: Irreversible data obfuscation.
- pseudonymization: Reversible data obfuscation.
PII model
The NIST 800-122 publication on PII identifies levels of personal identifiable information:
- High (4): publication has severe/catastrophic effects
- Medium (3): publication has serious adverse effects
- Low (2): publication has limited adverse effects
- Public (1): not part of PII, but describes non-personal data
Typically if one is allowed to see PII level X
, one can also see PII levels < X
(the Air Force One uses the same method: walk freely towards the rear, but never walk forward of your own seat). We could also imagine putting pii_<level>
as a pre- or postfix in field names to easily manage Field Level Security (because it supports access based on wildcards (*
)).
Varying levels of obfuscation
We should also recognize that various versions of the same field can (and should) exist in harmony. Perhaps the Dutch postal code system is a good example:
postalcode: 1234AB
The system is set up so that each character to the right is adding more precision to the location.
Perhaps in Elasticsearch this becomes:
customer.postalcode.raw: 1234AB
customer.postalcode.city: 12
customer.postalcode.obfuscated: E32DB25A9BAAA6AF655FE65A861C9BD35AF1868229E0E9D738236B4500626AFB
Or, implementing PII:
customer.postalcode.pii4: 1234AB
<-- perhaps enough to identify the customercustomer.postalcode.pii2: 12
<-- not enough to identify the customercustomer.postalcode.pii1: E32DB25A9BAAA6AF655FE65A861C9BD35AF1868229E0E9D738236B4500626AFB
<-- not enough to identify the customer, but based on PII 4 data hence we can bucket customers of the same street without knowing which street it is.
The above would allow various users to access the postal code at an appropriate level for their usage (in case Business Analytics, for example, uses non-PII 3 or 4 data only due to laws on personal data like GDPR).