Skip to content

Privacy Leak via Data-Dependent Domain Inference #18

@tudorcebere

Description

@tudorcebere

Description
In dpmm/src/dpmm/models/base/mechanisms/mechanism.py, the mechanism's domain is automatically inferred by calculating the maximum value of the raw input dataframe.
In a Differential Privacy (DP) context, any operation performed on the raw data must be accounted for in the privacy budget ($\epsilon, \delta$). Calculating a global max() is a non-private operation that leaks information about the dataset's tail distribution.

Location
File: dpmm/src/dpmm/models/base/mechanisms/mechanism.py
Line: 113

_domain = (df.astype(int).max(axis=0) + 1).to_dict()

By calculating df.max(axis=0) directly on the private dataframe, the specific maximum value of a sensitive attribute is revealed without any noise or privacy cost.

Suggested Patch
The domain should be treated as a hyperparameter or a private statistic. Consider one of the following approaches:

  1. User-Provided Bounds: Require the user to pass a domain or bounds argument derived from public knowledge or data schemas.
  2. Private Max Computation: Use a DP-compliant mechanism to find a noisy upper bound, and subtract the cost from the total privacy budget.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions