SmartNoise Eval Draft Release (#582)

* Cleanup * updates * Made some doc changes * Some updates * Updates * More skeleton * Updates * Add Median test * update metrics * update basic metrics file * update basic metrics file * update init files * update base and metric files * update compare metrics * update base file * update metrics * update basic metrics * update basic metrics * update base files * update metrics * add 'parameters' as a key to the out dict * update metrics * add Analyze * update analyze and metrics * update compare metrics and evaluate file * update metrics * update evaluate and analyze * update analyze and evaluate * update metrics * no changes * add pytest for MeanAbsErrorInCount and MeanPropErrorInCount * add two synthetic datasets for testing * add docstring for metrics * save to csv * update metric doc * update basic metrics and the default computations * pyproject * Add code for packaging to PyPi and generating docs * adding 2way compution by default and update BelowKCount metric * adding 2way computation by default and update FabricatedCombinationCount * add metric MeanError and update default 2-way computation * fix a bug * Changes * Fixes * Setup * Update README.md --------- Co-authored-by: Joshua <joshua-oss@users.noreply.github.com> Co-authored-by: paxton-coder <xiaopeng.qu.li@hotmail.com>
opendp · Nov 10, 2023 · a1d5b47 · a1d5b47
1 parent 0b92809
commit a1d5b47
Show file tree

Hide file tree

Showing 87 changed files with 2,593 additions and 3,433 deletions.
diff --git a/.gitignore b/.gitignore
@@ -128,3 +128,6 @@ sdk/opendp/v1/
 # datasets
 PUMS_1000.csv
 *.db
+
+# parquet files
+PUMS_large.parquet/
diff --git a/docs/make_docs.sh b/docs/make_docs.sh
@@ -29,6 +29,14 @@ cp -R build/html /tmp/docs/en/stable/synth
 make clean
 cd ../..
 
+cd eval
+pip install -e .
+cd docs
+make html
+cp -R build/html /tmp/docs/en/stable/eval
+make clean
+cd ../..
+
 open /tmp/docs/index.html
 
 
diff --git a/eval/.gitignore b/eval/.gitignore
@@ -0,0 +1,2 @@
+poetry.lock
+run.py
diff --git a/eval/HISTORY.md b/eval/HISTORY.md
@@ -0,0 +1,3 @@
+# SmartNoise Eval v0.3.0 Release Notes
+
+* Initial Release
diff --git a/eval/README.md b/eval/README.md
@@ -1,27 +1,37 @@
-[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python](https://img.shields.io/badge/python-3.7%20%7C%203.8-blue)](https://www.python.org/)
+# SmartNoise Evaluator
 
-<a href="https://smartnoise.org"><img src="https://github.com/opendp/smartnoise-sdk/raw/main/images/SmartNoise/SVG/Logo%20Mark_grey.svg" align="left" height="65" vspace="8" hspace="18"></a>
+The SmartNoise Evaluator is designed to help assess the privacy and accuracy of differentially private queries. It includes:
 
-## SmartNoise Stochastic Evaluator
+* Analyze: Analyze a dataset and provide information about cardinality, data types, independencies, and other information that is useful for creating a privacy pipeline
+* Evaluate: Compares the privatized results to the true results and provides information about the accuracy and bias
 
-Tests differential privacy algorithms for privacy, accuracy, and bias.  Privacy tests are based on the method described in [section 5.3 of this paper](https://arxiv.org/pdf/1909.01917.pdf).
+These tools currently require PySpark.
 
-## Installation
+## Analyze
 
-```
-pip install smartnoise-eval
-```
+Analyze provides metrics about a single dataset.
 
-## Communication
+* Percent of all dimension combinations that are unique, k < 5 and k < 10 (Count up to configurable “reporting length”)
+* Report which columns are “most linkable”
+* Marginal histograms up to n-way -- choose default with reasonable size (e.g. 10 per marginal, and up to 20 marginals -- allow override).  Trim and encode labels.
+* Number of rows
+* Number of distinct rows
+* Count, Mean, Variance, Min, Max, Median, Percentiles for each marginal
+* Classification AUC
+* Individual Cardinalities
+* Dimensionality, Sparsity
+* Independencies
 
-- You are encouraged to join us on [GitHub Discussions](https://github.com/opendp/opendp/discussions/categories/smartnoise)
-- Please use [GitHub Issues](https://github.com/opendp/smartnoise-sdk/issues) for bug reports and feature requests.
-- For other requests, including security issues, please contact us at [smartnoise@opendp.org](mailto:smartnoise@opendp.org).
 
-## Releases and Contributing
+## Evaluate
 
-Please let us know if you encounter a bug by [creating an issue](https://github.com/opendp/smartnoise-sdk/issues).
+Evaluate compares an original data file with one or more comparison files.  It can compare any of the single-file metrics computed in `Analyze` as well as a number of metrics that involve two datasets.  When more than one comparison dataset is provided, we can provide all of the two-way comparisons with the original, and allow the consumer to combine these measures (e.g. average over all datasets)
 
-We appreciate all contributions. We welcome pull requests with bug-fixes without prior discussion.
-
-If you plan to contribute new features, utility functions or extensions to this system, please first open an issue and discuss the feature with us.
+* How many dimension combinations are suppressed 
+* How many dimension combinations are fabricated 
+* How many redacted rows (fully redacted vs. partly redacted)
+* Mean error in the count across categories by 1-way, 2-way, etc.
+* Mean absolute error by 1-way, 2-way, etc. up to reporting length
+  * Also do for user specified dimension combinations 
+  * Report by bin size (e.g., < 1000, >= 1000) 
+* Mean proportional error by 1-way, 2-way, etc. 
diff --git a/eval/VERSION b/eval/VERSION
@@ -1 +1 @@
-0.2.0
+0.3.0
diff --git a/eval/docs/.gitignore b/eval/docs/.gitignore
@@ -1,2 +1 @@
-build
-source/api
+build
diff --git a/eval/docs/Makefile b/eval/docs/Makefile
@@ -21,10 +21,9 @@ help:
 	@echo "  versions   to make HTML files for all committed versions"
 
 clean:
-	rm -rf $(BUILDDIR)/* source/api
+	rm -rf $(BUILDDIR)/*
 
 html:
-	$(SPHINXAPIDOC) -f -F -e -H "SmartNoise Evaluator" -A "The OpenDP Project" -V $(VERSION) -o source/api ../sneval --templatedir source/_templates
 	$(SPHINXBUILD) $(SPHINXOPTS) -D version=$(VERSION) -D 'html_sidebars.**'=search-field.html,sidebar-nav-bs.html source $(BUILDDIR)/html
 	@echo
 	@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."

diff --git a/eval/docs/README.md b/eval/docs/README.md
@@ -1,9 +1,9 @@
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 ![CI](https://github.com/opendp/opendp-documentation/actions/workflows/main.yml/badge.svg)
 
-# SmartNoise Documentation
+# SmartNoise SQL Documentation
 
-Note: The SmartNoise documentation, [docs.smartnoise.org](https://docs.opendp.org), is currently under development.
+This folder contains the source for building the detailed documentation for SmartNoise Eval.
 
 ## Building the Docs
 
@@ -12,31 +12,15 @@ The steps below assume the use of [Homebrew] on a Mac.
 [Homebrew]: https://brew.sh
 
 ```shell
-python3 -m venv venv
-source venv/bin/activate
 pip install -r requirements.txt
 make html
 open build/html/index.html
 ```
 
-To make html and run python doctests:
-
-```shell
-make doctest-python
-```
-
 ## Deployment
 
-Docs are deployed to http://docs.opendp.org using GitHub Actions.
-
-Note that `make html` is replaced with `make versions` to build multiple versions (branches, tags) using the [sphinx-multiversion][] extension.
-Be sure you have installed sphinx-multiversion from the fork in requirements.txt. 
-Otherwise, you will get an error that includes: 
-
-    /docs/source/api/index.rst:4:toctree contains reference to nonexisting document 'api/python/index'
-
+Docs are deployed to http://docs.smartnoise.org using GitHub Actions.
 
-[sphinx-multiversion]: https://holzhaus.github.io/sphinx-multiversion/
 
 ## Join the Discussion
 

diff --git a/eval/docs/redirect.html b/eval/docs/redirect.html
diff --git a/eval/docs/source/_static/images/figs/example_education.png b/eval/docs/source/_static/images/figs/example_education.png
diff --git a/eval/docs/source/_static/images/figs/example_simulations.png b/eval/docs/source/_static/images/figs/example_simulations.png
diff --git a/eval/docs/source/_static/images/figs/example_size.png b/eval/docs/source/_static/images/figs/example_size.png
diff --git a/eval/docs/source/_static/images/figs/example_utility.png b/eval/docs/source/_static/images/figs/example_utility.png
diff --git a/eval/docs/source/_static/images/figs/plugin_mean_comparison.png b/eval/docs/source/_static/images/figs/plugin_mean_comparison.png
diff --git a/eval/docs/source/_static/images/opendp-logo copy.png b/eval/docs/source/_static/images/opendp-logo copy.png
diff --git a/eval/docs/source/_static/images/smartnoise-logo.svg b/eval/docs/source/_static/images/smartnoise-logo.svg
diff --git a/eval/docs/source/conf.py b/eval/docs/source/conf.py
@@ -5,7 +5,7 @@
 from datetime import datetime
 
 # We're inside source when this runs.
-sys.path.append(os.path.abspath('../../python/src'))
+sys.path.append(os.path.abspath('../..'))
 # print("*****************************************")
 # [print(p) for p in sys.path]
 # print("*****************************************")
@@ -62,7 +62,7 @@
 
 # The name for this set of Sphinx documents.  If None, it defaults to
 # "<project> v<release> Documentation".
-html_title = 'OpenDP SmartNoise'
+html_title = 'OpenDP SmartNoise Eval'
 
 # Add any paths that contain custom static files (such as style sheets) here,
 # relative to this directory. They are copied after the builtin static files,
@@ -74,7 +74,12 @@
 html_last_updated_fmt = '%b %d, %Y'
 
 # Custom sidebar templates, maps document names to template names.
+html_theme = 'pydata_sphinx_theme'
+
 html_theme_options = {
+    "logo": {
+        "link": "http://docs.smartnoise.org"
+    },
     "icon_links": [
         {
             "name": "GitHub Discussions",
@@ -83,11 +88,9 @@
         },
     ],
     "twitter_url": "https://twitter.com/opendp_org",
-    "github_url": "https://github.com/opendp/smartnoise"
+    "github_url": "https://github.com/opendp/smartnoise-sdk"
 }
 
-html_theme = 'pydata_sphinx_theme'
-
 # See https://pydata-sphinx-theme.readthedocs.io/en/v0.6.3/user_guide/configuring.html#configure-the-sidebar
 # Note: Overridden in the Makefile for local builds. Be sure to update both places.
 html_sidebars = {
@@ -127,7 +130,7 @@
 #html_file_suffix = None
 htmlhelp_basename = 'OpenDPdoc'
 
-# html_logo = "_static/images/opendp-logo.png"
+html_logo = "_static/images/smartnoise-logo.svg"
 
 rst_prolog = """
 .. |toctitle| replace:: Contents:

diff --git a/eval/docs/source/index.rst b/eval/docs/source/index.rst
@@ -1,21 +1,50 @@
-Welcome
-=======
-
-SmartNoise documentation is organized into the guides below.
-Return home by clicking the OpenDP logo in the header.
-Each section in the header bar corresponds to a top-level section below.
-When you are in a top-level section, the left panel contains a table of contents for the section,
-and the right panel contains a table of contents for the current document.
-Documentation for past releases are available in the drop down on the left panel.
-In addition to browsing, you can :ref:`search <search>`.
-
-.. toctree::
-  :glob:
-  :titlesonly:
-  :maxdepth: 2
-
-  quickstart
-  API <api/index>
+===============
+SmartNoise Eval
+===============
+
+This library contains two primary components:
+
+1. `Analyze`: Analyzes your source data to help you decide the best approach to producing synthetic data or private synopsis. Gives information on dimensionality, sparsity, and distribution of your data.
+
+2. `Evaluate`: Evaluates the quality of your synthetic data or private synopsis. Compares the original data with the synthetic data or private synopsis to give you a sense of how well the synthetic data or private synopsis preserves the original data.
+
+.. contents:: Table of Contents
+  :local:
+  :depth: 3
+
+Getting Started
+===============
+
+
+API Reference
+=============
+
+Analyze
+-------
+
+.. autoclass:: sneval.Analyze
+    :members:
+    :undoc-members:
+    :show-inheritance:
+
+Dataset
+-------
+
+.. autoclass:: sneval.Dataset
+    :members:
+    :undoc-members:
+    :show-inheritance:
+
+Evaluate
+--------
+
+.. autoclass:: sneval.Evaluate
+    :members:
+    :undoc-members:
+    :show-inheritance:
+
+
+
 
 This is version |version| of the guides, last built on |today|.
 

diff --git a/eval/docs/source/quickstart.rst b/eval/docs/source/quickstart.rst
diff --git a/eval/pyproject.toml b/eval/pyproject.toml
@@ -1,7 +1,7 @@
 [tool.poetry]
 name = "smartnoise-eval"
-version = "0.2.0"
-description = "Differential Privacy Stochastic Evaluator"
+version = "0.3.0"
+description = "Evaluation of differentially private tabular data"
 authors = ["SmartNoise Team <smartnoise@opendp.org>"]
 license = "MIT"
 packages = [{include="sneval"}]
@@ -10,13 +10,12 @@ repository = "https://github.com/opendp/smartnoise-sdk"
 readme = "README.md"
 
 [tool.poetry.dependencies]
-python = ">=3.7.1,<=3.9"
-opendp = "^0.3.0"
-smartnoise-sql = "^0.2"
-matplotlib = "^3.4.3"
+python = ">=3.9,<3.13"
+pyspark = "^3.5.0"
+numpy = "^1.26.1"
 
 [tool.poetry.dev-dependencies]
 
 [build-system]
-requires = ["setuptools", "poetry-core>=1.0.0"]
+requires = ["poetry-core>=1.0.0"]
 build-backend = "poetry.core.masonry.api"
diff --git a/eval/setup.py b/eval/setup.py
@@ -2,27 +2,19 @@
 from setuptools import setup
 
 packages = \
-['sneval',
- 'sneval.benchmarking',
- 'sneval.evaluator',
- 'sneval.explorer',
- 'sneval.learner',
- 'sneval.metrics',
- 'sneval.params',
- 'sneval.privacyalgorithm',
- 'sneval.report']
+['sneval', 'sneval.metrics', 'sneval.metrics.basic', 'sneval.metrics.compare']
 
 package_data = \
 {'': ['*']}
 
 install_requires = \
-['opendp>=0.3.0,<0.4.0', 'smartnoise-sql>=0.2,<0.3']
+['numpy>=1.26.1,<2.0.0', 'pyspark>=3.5.0,<4.0.0']
 
 setup_kwargs = {
     'name': 'smartnoise-eval',
-    'version': '0.2.0',
-    'description': 'Differential Privacy Stochastic Evaluator',
-    'long_description': '[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python](https://img.shields.io/badge/python-3.7%20%7C%203.8-blue)](https://www.python.org/)\n\n<a href="https://smartnoise.org"><img src="https://github.com/opendp/smartnoise-sdk/raw/main/images/SmartNoise/SVG/Logo%20Mark_grey.svg" align="left" height="65" vspace="8" hspace="18"></a>\n\n## SmartNoise Stochastic Evaluator\n\nTests differential privacy algorithms for privacy, accuracy, and bias.  Privacy tests are based on the method described in [section 5.3 of this paper](https://arxiv.org/pdf/1909.01917.pdf).\n\n## Installation\n\n```\npip install smartnoise-eval\n```\n\n## Communication\n\n- You are encouraged to join us on [GitHub Discussions](https://github.com/opendp/opendp/discussions/categories/smartnoise)\n- Please use [GitHub Issues](https://github.com/opendp/smartnoise-sdk/issues) for bug reports and feature requests.\n- For other requests, including security issues, please contact us at [smartnoise@opendp.org](mailto:smartnoise@opendp.org).\n\n## Releases and Contributing\n\nPlease let us know if you encounter a bug by [creating an issue](https://github.com/opendp/smartnoise-sdk/issues).\n\nWe appreciate all contributions. We welcome pull requests with bug-fixes without prior discussion.\n\nIf you plan to contribute new features, utility functions or extensions to this system, please first open an issue and discuss the feature with us.',
+    'version': '0.3.0',
+    'description': 'Evaluation of differentially private tabular data',
+    'long_description': '# SmartNoise Evaluator\n\nThe SmartNoise Evaluator is designed to help assess the privacy and accuracy of differentially private queries. It includes:\n\n* Analyze: Analyze a dataset and provide information about cardinality, data types, independencies, and other information that is useful for creating a privacy pipeline\n* Evaluator: Compares the privatized results to the true results and provides information about the accuracy and bias\n\nThese tools currently require PySpark.\n\n## Analyze\n\nAnalyze provides metrics about a single dataset.\n\n* Percent of all dimension combinations that are unique, k < 5 and k < 10 (Count up to configurable “reporting length”)\n* Report which columns are “most linkable”\n* Marginal histograms up to n-way -- choose default with reasonable size (e.g. 10 per marginal, and up to 20 marginals -- allow override).  Trim and encode labels.\n* Number of rows\n* Number of distinct rows\n* Count, Mean, Variance, Min, Max, Median, Percentiles for each marginal\n* Classification AUC\n* Individual Cardinalities\n* Dimensionality, Sparsity\n* Independencies\n\n\n## Evaluate\n\nEvaluate compares an original data file with one or more comparison files.  It can compare any of the single-file metrics computed in `Analyze` as well as a number of metrics that involve two datasets.  When more than one comparison dataset is provided, we can provide all of the two-way comparisons with the original, and allow the consumer to combine these measures (e.g. average over all datasets)\n\n* How many dimension combinations are suppressed \n* How many dimension combinations are fabricated \n* How many redacted rows (fully redacted vs. partly redacted) \n* Mean absolute error by 1-way, 2-way, etc. up to reporting length\n* Also do for user specified dimension combinations \n* Report by bin size (e.g., < 1000, >= 1000) \n* Mean proportional error by 1-way, 2-way, etc. \n',
     'author': 'SmartNoise Team',
     'author_email': 'smartnoise@opendp.org',
     'maintainer': None,
@@ -31,7 +23,7 @@
     'packages': packages,
     'package_data': package_data,
     'install_requires': install_requires,
-    'python_requires': '>=3.7.1,<=3.9',
+    'python_requires': '>=3.9,<3.13',
 }