Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add in concept of "concentration" for disclosure control #250

Open
tombisho opened this issue Nov 25, 2021 · 3 comments
Open

Add in concept of "concentration" for disclosure control #250

tombisho opened this issue Nov 25, 2021 · 3 comments
Assignees
Milestone

Comments

@tombisho
Copy link
Contributor

tombisho commented Nov 25, 2021

DataSHIELD doesn't currently appear to have the concept of concentration as one of the disclosure controls.

The idea is to limit the proportion of a statistic that can be made by a single value from the set of values being sampled. In simple terms, if we have the numbers 0.1, 0.2, 0.3, 0.5, 4e6, 0.6, 0.5, then we should block the mean of this because one value dominates and it is disclosive. At the moment, this passes the standard nfilter.tab test.

The limit could be set to no value should be more than 0.9 of the statistic.

The first functions where this will be implemented are ds.mean() and similar. One of the attack modes is to create a vector of all 0s except a single 1, multiply this with the column of interest and take the mean. Knowing the length allows recreation of a value. Moving the 1 allows all values to be recreated. This change will stop this attack.

This control will not help with other differencing attacks (as per Stefan's work)

@tombisho tombisho added this to the v6.3 milestone Nov 25, 2021
@tombisho tombisho self-assigned this Nov 25, 2021
@tombisho
Copy link
Contributor Author

tombisho commented Dec 2, 2021

This solution will also not help with the trick of repeating a value several times. That is, perform the steps detailed above, but copy the vector 5 times. Rbind these vectors together. The concentration trap will no longer work because there will be 5 values contributing to the mean, and dividing by 5 will yield the answer as before.

A proposed solution for this will be opened in a separate issue

@tombisho
Copy link
Contributor Author

tombisho commented Dec 2, 2021

For meanSdGpDS() need something like:

ans <- lapply(X = split(X, group), FUN = function(x){x/sum(x)})
any(unlist(ans) >0.9)

@tombisho
Copy link
Contributor Author

tombisho commented Dec 6, 2021

To do list of functions:

  • meanDS
  • meanSdGpDS
  • kurtosisDS1
  • isValidDS
  • table1DDS
  • table2DDS
  • tableDS
  • varDS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

1 participant