Skip to content

Commit

Permalink
SmartNoise Eval Draft Release (#582)
Browse files Browse the repository at this point in the history
* Cleanup

* updates

* Made some doc changes

* Some updates

* Updates

* More skeleton

* Updates

* Add Median test

* update metrics

* update basic metrics file

* update basic metrics file

* update init files

* update base and metric files

* update compare metrics

* update base file

* update metrics

* update basic metrics

* update basic metrics

* update base files

* update metrics

* add 'parameters' as a key to the out dict

* update metrics

* add Analyze

* update analyze and metrics

* update compare metrics and evaluate file

* update metrics

* update evaluate and analyze

* update analyze and evaluate

* update metrics

* no changes

* add pytest for MeanAbsErrorInCount and MeanPropErrorInCount

* add two synthetic datasets for testing

* add docstring for metrics

* save to csv

* update metric doc

* update basic metrics and the default computations

* pyproject

* Add code for packaging to PyPi and generating docs

* adding 2way compution by default and update BelowKCount metric

* adding 2way computation by default and update FabricatedCombinationCount

* add metric MeanError and update default 2-way computation

* fix a bug

* Changes

* Fixes

* Setup

* Update README.md

---------

Co-authored-by: Joshua <joshua-oss@users.noreply.github.com>
Co-authored-by: paxton-coder <xiaopeng.qu.li@hotmail.com>
  • Loading branch information
3 people authored Nov 10, 2023
1 parent 0b92809 commit a1d5b47
Show file tree
Hide file tree
Showing 87 changed files with 2,593 additions and 3,433 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -128,3 +128,6 @@ sdk/opendp/v1/
# datasets
PUMS_1000.csv
*.db

# parquet files
PUMS_large.parquet/
8 changes: 8 additions & 0 deletions docs/make_docs.sh
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,14 @@ cp -R build/html /tmp/docs/en/stable/synth
make clean
cd ../..

cd eval
pip install -e .
cd docs
make html
cp -R build/html /tmp/docs/en/stable/eval
make clean
cd ../..

open /tmp/docs/index.html


2 changes: 2 additions & 0 deletions eval/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
poetry.lock
run.py
3 changes: 3 additions & 0 deletions eval/HISTORY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# SmartNoise Eval v0.3.0 Release Notes

* Initial Release
44 changes: 27 additions & 17 deletions eval/README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,37 @@
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python](https://img.shields.io/badge/python-3.7%20%7C%203.8-blue)](https://www.python.org/)
# SmartNoise Evaluator

<a href="https://smartnoise.org"><img src="https://github.com/opendp/smartnoise-sdk/raw/main/images/SmartNoise/SVG/Logo%20Mark_grey.svg" align="left" height="65" vspace="8" hspace="18"></a>
The SmartNoise Evaluator is designed to help assess the privacy and accuracy of differentially private queries. It includes:

## SmartNoise Stochastic Evaluator
* Analyze: Analyze a dataset and provide information about cardinality, data types, independencies, and other information that is useful for creating a privacy pipeline
* Evaluate: Compares the privatized results to the true results and provides information about the accuracy and bias

Tests differential privacy algorithms for privacy, accuracy, and bias. Privacy tests are based on the method described in [section 5.3 of this paper](https://arxiv.org/pdf/1909.01917.pdf).
These tools currently require PySpark.

## Installation
## Analyze

```
pip install smartnoise-eval
```
Analyze provides metrics about a single dataset.

## Communication
* Percent of all dimension combinations that are unique, k < 5 and k < 10 (Count up to configurable “reporting length”)
* Report which columns are “most linkable”
* Marginal histograms up to n-way -- choose default with reasonable size (e.g. 10 per marginal, and up to 20 marginals -- allow override). Trim and encode labels.
* Number of rows
* Number of distinct rows
* Count, Mean, Variance, Min, Max, Median, Percentiles for each marginal
* Classification AUC
* Individual Cardinalities
* Dimensionality, Sparsity
* Independencies

- You are encouraged to join us on [GitHub Discussions](https://github.com/opendp/opendp/discussions/categories/smartnoise)
- Please use [GitHub Issues](https://github.com/opendp/smartnoise-sdk/issues) for bug reports and feature requests.
- For other requests, including security issues, please contact us at [smartnoise@opendp.org](mailto:smartnoise@opendp.org).

## Releases and Contributing
## Evaluate

Please let us know if you encounter a bug by [creating an issue](https://github.com/opendp/smartnoise-sdk/issues).
Evaluate compares an original data file with one or more comparison files. It can compare any of the single-file metrics computed in `Analyze` as well as a number of metrics that involve two datasets. When more than one comparison dataset is provided, we can provide all of the two-way comparisons with the original, and allow the consumer to combine these measures (e.g. average over all datasets)

We appreciate all contributions. We welcome pull requests with bug-fixes without prior discussion.

If you plan to contribute new features, utility functions or extensions to this system, please first open an issue and discuss the feature with us.
* How many dimension combinations are suppressed
* How many dimension combinations are fabricated
* How many redacted rows (fully redacted vs. partly redacted)
* Mean error in the count across categories by 1-way, 2-way, etc.
* Mean absolute error by 1-way, 2-way, etc. up to reporting length
* Also do for user specified dimension combinations
* Report by bin size (e.g., < 1000, >= 1000)
* Mean proportional error by 1-way, 2-way, etc.
2 changes: 1 addition & 1 deletion eval/VERSION
Original file line number Diff line number Diff line change
@@ -1 +1 @@
0.2.0
0.3.0
3 changes: 1 addition & 2 deletions eval/docs/.gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
build
source/api
build
3 changes: 1 addition & 2 deletions eval/docs/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,9 @@ help:
@echo " versions to make HTML files for all committed versions"

clean:
rm -rf $(BUILDDIR)/* source/api
rm -rf $(BUILDDIR)/*

html:
$(SPHINXAPIDOC) -f -F -e -H "SmartNoise Evaluator" -A "The OpenDP Project" -V $(VERSION) -o source/api ../sneval --templatedir source/_templates
$(SPHINXBUILD) $(SPHINXOPTS) -D version=$(VERSION) -D 'html_sidebars.**'=search-field.html,sidebar-nav-bs.html source $(BUILDDIR)/html
@echo
@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
Expand Down
22 changes: 3 additions & 19 deletions eval/docs/README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
![CI](https://github.com/opendp/opendp-documentation/actions/workflows/main.yml/badge.svg)

# SmartNoise Documentation
# SmartNoise SQL Documentation

Note: The SmartNoise documentation, [docs.smartnoise.org](https://docs.opendp.org), is currently under development.
This folder contains the source for building the detailed documentation for SmartNoise Eval.

## Building the Docs

Expand All @@ -12,31 +12,15 @@ The steps below assume the use of [Homebrew] on a Mac.
[Homebrew]: https://brew.sh

```shell
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
make html
open build/html/index.html
```

To make html and run python doctests:

```shell
make doctest-python
```

## Deployment

Docs are deployed to http://docs.opendp.org using GitHub Actions.

Note that `make html` is replaced with `make versions` to build multiple versions (branches, tags) using the [sphinx-multiversion][] extension.
Be sure you have installed sphinx-multiversion from the fork in requirements.txt.
Otherwise, you will get an error that includes:

/docs/source/api/index.rst:4:toctree contains reference to nonexisting document 'api/python/index'

Docs are deployed to http://docs.smartnoise.org using GitHub Actions.

[sphinx-multiversion]: https://holzhaus.github.io/sphinx-multiversion/

## Join the Discussion

Expand Down
9 changes: 0 additions & 9 deletions eval/docs/redirect.html

This file was deleted.

Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions eval/docs/source/_static/images/smartnoise-logo.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
15 changes: 9 additions & 6 deletions eval/docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from datetime import datetime

# We're inside source when this runs.
sys.path.append(os.path.abspath('../../python/src'))
sys.path.append(os.path.abspath('../..'))
# print("*****************************************")
# [print(p) for p in sys.path]
# print("*****************************************")
Expand Down Expand Up @@ -62,7 +62,7 @@

# The name for this set of Sphinx documents. If None, it defaults to
# "<project> v<release> Documentation".
html_title = 'OpenDP SmartNoise'
html_title = 'OpenDP SmartNoise Eval'

# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
Expand All @@ -74,7 +74,12 @@
html_last_updated_fmt = '%b %d, %Y'

# Custom sidebar templates, maps document names to template names.
html_theme = 'pydata_sphinx_theme'

html_theme_options = {
"logo": {
"link": "http://docs.smartnoise.org"
},
"icon_links": [
{
"name": "GitHub Discussions",
Expand All @@ -83,11 +88,9 @@
},
],
"twitter_url": "https://twitter.com/opendp_org",
"github_url": "https://github.com/opendp/smartnoise"
"github_url": "https://github.com/opendp/smartnoise-sdk"
}

html_theme = 'pydata_sphinx_theme'

# See https://pydata-sphinx-theme.readthedocs.io/en/v0.6.3/user_guide/configuring.html#configure-the-sidebar
# Note: Overridden in the Makefile for local builds. Be sure to update both places.
html_sidebars = {
Expand Down Expand Up @@ -127,7 +130,7 @@
#html_file_suffix = None
htmlhelp_basename = 'OpenDPdoc'

# html_logo = "_static/images/opendp-logo.png"
html_logo = "_static/images/smartnoise-logo.svg"

rst_prolog = """
.. |toctitle| replace:: Contents:
Expand Down
65 changes: 47 additions & 18 deletions eval/docs/source/index.rst
Original file line number Diff line number Diff line change
@@ -1,21 +1,50 @@
Welcome
=======

SmartNoise documentation is organized into the guides below.
Return home by clicking the OpenDP logo in the header.
Each section in the header bar corresponds to a top-level section below.
When you are in a top-level section, the left panel contains a table of contents for the section,
and the right panel contains a table of contents for the current document.
Documentation for past releases are available in the drop down on the left panel.
In addition to browsing, you can :ref:`search <search>`.

.. toctree::
:glob:
:titlesonly:
:maxdepth: 2

quickstart
API <api/index>
===============
SmartNoise Eval
===============

This library contains two primary components:

1. `Analyze`: Analyzes your source data to help you decide the best approach to producing synthetic data or private synopsis. Gives information on dimensionality, sparsity, and distribution of your data.

2. `Evaluate`: Evaluates the quality of your synthetic data or private synopsis. Compares the original data with the synthetic data or private synopsis to give you a sense of how well the synthetic data or private synopsis preserves the original data.

.. contents:: Table of Contents
:local:
:depth: 3

Getting Started
===============


API Reference
=============

Analyze
-------

.. autoclass:: sneval.Analyze
:members:
:undoc-members:
:show-inheritance:

Dataset
-------

.. autoclass:: sneval.Dataset
:members:
:undoc-members:
:show-inheritance:

Evaluate
--------

.. autoclass:: sneval.Evaluate
:members:
:undoc-members:
:show-inheritance:




This is version |version| of the guides, last built on |today|.

Expand Down
4 changes: 0 additions & 4 deletions eval/docs/source/quickstart.rst

This file was deleted.

13 changes: 6 additions & 7 deletions eval/pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[tool.poetry]
name = "smartnoise-eval"
version = "0.2.0"
description = "Differential Privacy Stochastic Evaluator"
version = "0.3.0"
description = "Evaluation of differentially private tabular data"
authors = ["SmartNoise Team <smartnoise@opendp.org>"]
license = "MIT"
packages = [{include="sneval"}]
Expand All @@ -10,13 +10,12 @@ repository = "https://github.com/opendp/smartnoise-sdk"
readme = "README.md"

[tool.poetry.dependencies]
python = ">=3.7.1,<=3.9"
opendp = "^0.3.0"
smartnoise-sql = "^0.2"
matplotlib = "^3.4.3"
python = ">=3.9,<3.13"
pyspark = "^3.5.0"
numpy = "^1.26.1"

[tool.poetry.dev-dependencies]

[build-system]
requires = ["setuptools", "poetry-core>=1.0.0"]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"
20 changes: 6 additions & 14 deletions eval/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,27 +2,19 @@
from setuptools import setup

packages = \
['sneval',
'sneval.benchmarking',
'sneval.evaluator',
'sneval.explorer',
'sneval.learner',
'sneval.metrics',
'sneval.params',
'sneval.privacyalgorithm',
'sneval.report']
['sneval', 'sneval.metrics', 'sneval.metrics.basic', 'sneval.metrics.compare']

package_data = \
{'': ['*']}

install_requires = \
['opendp>=0.3.0,<0.4.0', 'smartnoise-sql>=0.2,<0.3']
['numpy>=1.26.1,<2.0.0', 'pyspark>=3.5.0,<4.0.0']

setup_kwargs = {
'name': 'smartnoise-eval',
'version': '0.2.0',
'description': 'Differential Privacy Stochastic Evaluator',
'long_description': '[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python](https://img.shields.io/badge/python-3.7%20%7C%203.8-blue)](https://www.python.org/)\n\n<a href="https://smartnoise.org"><img src="https://github.com/opendp/smartnoise-sdk/raw/main/images/SmartNoise/SVG/Logo%20Mark_grey.svg" align="left" height="65" vspace="8" hspace="18"></a>\n\n## SmartNoise Stochastic Evaluator\n\nTests differential privacy algorithms for privacy, accuracy, and bias. Privacy tests are based on the method described in [section 5.3 of this paper](https://arxiv.org/pdf/1909.01917.pdf).\n\n## Installation\n\n```\npip install smartnoise-eval\n```\n\n## Communication\n\n- You are encouraged to join us on [GitHub Discussions](https://github.com/opendp/opendp/discussions/categories/smartnoise)\n- Please use [GitHub Issues](https://github.com/opendp/smartnoise-sdk/issues) for bug reports and feature requests.\n- For other requests, including security issues, please contact us at [smartnoise@opendp.org](mailto:smartnoise@opendp.org).\n\n## Releases and Contributing\n\nPlease let us know if you encounter a bug by [creating an issue](https://github.com/opendp/smartnoise-sdk/issues).\n\nWe appreciate all contributions. We welcome pull requests with bug-fixes without prior discussion.\n\nIf you plan to contribute new features, utility functions or extensions to this system, please first open an issue and discuss the feature with us.',
'version': '0.3.0',
'description': 'Evaluation of differentially private tabular data',
'long_description': '# SmartNoise Evaluator\n\nThe SmartNoise Evaluator is designed to help assess the privacy and accuracy of differentially private queries. It includes:\n\n* Analyze: Analyze a dataset and provide information about cardinality, data types, independencies, and other information that is useful for creating a privacy pipeline\n* Evaluator: Compares the privatized results to the true results and provides information about the accuracy and bias\n\nThese tools currently require PySpark.\n\n## Analyze\n\nAnalyze provides metrics about a single dataset.\n\n* Percent of all dimension combinations that are unique, k < 5 and k < 10 (Count up to configurable “reporting length”)\n* Report which columns are “most linkable”\n* Marginal histograms up to n-way -- choose default with reasonable size (e.g. 10 per marginal, and up to 20 marginals -- allow override). Trim and encode labels.\n* Number of rows\n* Number of distinct rows\n* Count, Mean, Variance, Min, Max, Median, Percentiles for each marginal\n* Classification AUC\n* Individual Cardinalities\n* Dimensionality, Sparsity\n* Independencies\n\n\n## Evaluate\n\nEvaluate compares an original data file with one or more comparison files. It can compare any of the single-file metrics computed in `Analyze` as well as a number of metrics that involve two datasets. When more than one comparison dataset is provided, we can provide all of the two-way comparisons with the original, and allow the consumer to combine these measures (e.g. average over all datasets)\n\n* How many dimension combinations are suppressed \n* How many dimension combinations are fabricated \n* How many redacted rows (fully redacted vs. partly redacted) \n* Mean absolute error by 1-way, 2-way, etc. up to reporting length\n* Also do for user specified dimension combinations \n* Report by bin size (e.g., < 1000, >= 1000) \n* Mean proportional error by 1-way, 2-way, etc. \n',
'author': 'SmartNoise Team',
'author_email': 'smartnoise@opendp.org',
'maintainer': None,
Expand All @@ -31,7 +23,7 @@
'packages': packages,
'package_data': package_data,
'install_requires': install_requires,
'python_requires': '>=3.7.1,<=3.9',
'python_requires': '>=3.9,<3.13',
}


Expand Down
Loading

0 comments on commit a1d5b47

Please sign in to comment.