Privacy Leak via Data-Dependent Domain Inference

**Description**
In _dpmm/src/dpmm/models/base/mechanisms/mechanism.py_, the mechanism's domain is automatically inferred by calculating the maximum value of the raw input dataframe.
In a Differential Privacy (DP) context, any operation performed on the raw data must be accounted for in the privacy budget ($\epsilon, \delta$). Calculating a global max() is a non-private operation that leaks information about the dataset's tail distribution.

**Location**
File: dpmm/src/dpmm/models/base/mechanisms/mechanism.py
Line: 113
```python3
_domain = (df.astype(int).max(axis=0) + 1).to_dict()
```

By calculating df.max(axis=0) directly on the private dataframe, the specific maximum value of a sensitive attribute is revealed without any noise or privacy cost.

Suggested Patch
The domain should be treated as a hyperparameter or a private statistic. Consider one of the following approaches:
1. User-Provided Bounds: Require the user to pass a domain or bounds argument derived from public knowledge or data schemas.
2. Private Max Computation: Use a DP-compliant mechanism to find a noisy upper bound, and subtract the cost from the total privacy budget.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Privacy Leak via Data-Dependent Domain Inference #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Privacy Leak via Data-Dependent Domain Inference #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions