Skip to content

Commit

Permalink
Colab review comments (#184)
Browse files Browse the repository at this point in the history
  • Loading branch information
dvadym authored Jan 26, 2022
1 parent 64e319d commit 67bb421
Showing 1 changed file with 10 additions and 27 deletions.
37 changes: 10 additions & 27 deletions examples/restaurant_visits.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@
},
"source": [
"## PipelineDP\n",
"PipelineDP is a Python end-to-end framework for generating differentially private statistics. PipelineDP provides high-level API for anonymizing data using Apache Beam It also has a core API that can be used to define a generic pipeline which can be then run on various engines. In the following section we demonstrate both APIs."
"PipelineDP is a Python end-to-end framework for generating differentially private statistics. PipelineDP provides a high-level API for anonymizing data using Apache Beam It also has a core API that can be used to define a generic pipeline which can be then run on various engines. In the following section we demonstrate both APIs."
]
},
{
Expand All @@ -119,7 +119,6 @@
"source": [
"#@markdown Install dependencies and download data\n",
"\n",
"\n",
"import os\n",
"os.chdir('/content')\n",
"!git clone https://github.com/OpenMined/PipelineDP.git\n",
Expand Down Expand Up @@ -527,10 +526,10 @@
},
"source": [
"## Configure DP aggregations\n",
"In this section we will demonstrate how to configure the DP aggregations that will be run on the dataset. For this, we need to define `AggregationParams` object with the following properties:\n",
"In this section we will demonstrate how to configure the DP aggregations that will be run on the dataset. For this, we need to define an `AggregationParams` object with the following properties:\n",
"* `noise_kind` defines the distribution of the noise that is added to make the result differentially private.\n",
"* `methods` is a collection of aggregation methods that will be executed on the dataset. In our example we count visits and sum-up visitor spending per day and hence we use `COUNT` and `SUM` aggregations.\n",
"* `max_partitions_contributed` specifies the upper bound on the number of partitions that can be contributed by a privacy ID. All contributions in excess of the limit will be discarded. The contributions to be discarded are chosen randomly.\n",
"* `max_partitions_contributed` specifies the upper bound on the number of partitions to which one privacy ID can contribute. All contributions in excess of the limit will be discarded. The contributions to be discarded are chosen randomly.\n",
"* `max_contributions_per_partition` is the maximum number of times a privacy ID can contribute to a partition. For instance, if in our example it’s set to 2, that means that for each visitor we will count up to 2 visits and corresponding spendings per day.\n",
"* `min_value` and `max_value` are the lower and upper bound on the privacy ID contributions. Values less than min_value are “clamped” to the min_value, and values greater than the max_value bound are clamped to the max_value. This is necessary in order to limit the sensitivity.\n",
"* `public_partitions` is a collection of the partition keys that will appear in the result. In our example we’d like to see the statistics for each week day and hence pass `range(1, 8)` as public partitions. If we wanted to compute the result for the week-end only, we would pass `range(6, 8)`. If `public_partitions` is not specified, the pipeline will select partitions to release in a DP manner."
Expand Down Expand Up @@ -561,9 +560,9 @@
},
"source": [
"## Run the pipeline\n",
"Now, that all the parameters have been defined, call `aggregate` on `DPEngine`. This is a lazy operation, it builds the computational graph but doesn't trigger any data processing. Next, we must call `budget_accountant.compute_budgets()`so that it allocates privacy budget to the aggregations. Finally, we can trigger the pipeline computation and observe the result.\n",
"Now, that all the parameters have been defined, call `aggregate` on the `DPEngine` instance. This is a lazy operation, it builds the computational graph but doesn't trigger any data processing. Next, we must call `budget_accountant.compute_budgets()`so that it allocates a privacy budget to the aggregations. Finally, we can trigger the pipeline computation and obtain the result.\n",
"\n",
"Due to the stateful nature of the `BudgetAccountant`, the code below can be executed only once. If you'd like to recompute the DP result, you'll need to create a new `BudgetAccountant` and `DPEngine`. This reminds us that each time we run the pipeline, we consume privacy budget."
"Due to the stateful nature of the `BudgetAccountant`, the code below can be executed only once. If you'd like to recompute the DP result, you'll need to create new `BudgetAccountant` and `DPEngine` instances. This reminds us that each time we run the pipeline, we consume privacy budget."
]
},
{
Expand Down Expand Up @@ -677,7 +676,7 @@
"In our example we used “public partitions”. This means, we explicitly defined the partition keys that appear in the result. It’s possible in this case because the week days are publicly known information. Defining public partitions isn’t always possible. If partition keys are based on user data rather than being public information, they are private information and need to be calculated using DP. If you do not specify public_partitions, PipelineDP automatically calculates partition keys with DP. As a consequence, the DP result will include only partitions that have sufficiently many contributing privacy IDs to ensure that a single privacy ID cannot impact the structure of the returned result. You can learn more about private partition selection in [this blog post](https://desfontain.es/privacy/almost-differential-privacy.html).\n",
"\n",
"## Public partitions\n",
"Public partitions have a couple of caveats. First, you really need to make sure that the provided partitions are based on public knowledge or derived using differential privacy. Second, public partitions with no contributions from users will appear in the DP statistics with noisy values. This ensures that an attacker cannot know which partitions users contributed to by looking at the structure of the results. This can be bad for utility, as empty partitions will be all noise and no signal."
"Public partitions have a couple of caveats. First, you really need to make sure that the provided partitions are based on public knowledge or derived using differential privacy. Second, public partitions with no contributions from users will appear in the DP statistics with noisy values. This ensures that an attacker cannot know which partitions users contributed to by looking at the structure of the results. This can be detrimental for utility, as empty partitions will be all noise and no signal."
]
},
{
Expand All @@ -688,7 +687,7 @@
"source": [
"# Porting your pipeline on different frameworks\n",
"\n",
"It’s easy to port the pipeline that we’ve defined in the \"Core API\" section to a different framework such as Beam. To do this, we need to copy the input data into collection accepted by Beam (`PCollection`) and use `BeamBackend` instead of `LocalBackend`. Similarly, the pipeline could be run on Spark with equivalent changes.\n",
"It’s easy to port the pipeline that we’ve defined in the \"Core API\" section to a different framework such as Beam. To do this, we need to copy the input data into a collection accepted by Beam (`PCollection`) and use `BeamBackend` instead of `LocalBackend`. Similarly, the pipeline could be run on Spark with equivalent changes.\n",
"To demonstrate using Beam below, we first move all framework-independent logic in a run_pipeline function and then call it with the framework-specific parameters."
]
},
Expand All @@ -704,14 +703,6 @@
"outputId": "42d3a0bf-6ba6-410c-a1f6-2ef2d98941e3"
},
"outputs": [
{
"output_type": "stream",
"name": "stderr",
"text": [
"WARNING:root:Make sure that locally built Python SDK docker image has Python 3.7 interpreter.\n",
"WARNING:apache_beam.runners.interactive.interactive_environment:Dependencies required for Interactive Beam PCollection visualization are not available, please use: `pip install apache-beam[interactive]` to install necessary dependencies to enable all data visualization features.\n"
]
},
{
"output_type": "display_data",
"data": {
Expand Down Expand Up @@ -743,14 +734,6 @@
},
"metadata": {}
},
{
"output_type": "stream",
"name": "stderr",
"text": [
"WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py', '-f', '/root/.local/share/jupyter/runtime/kernel-922682aa-19d2-49e6-b846-4ade7b3edd22.json']\n",
"WARNING:root:Make sure that locally built Python SDK docker image has Python 3.7 interpreter.\n"
]
},
{
"output_type": "stream",
"name": "stdout",
Expand Down Expand Up @@ -925,13 +908,13 @@
},
"source": [
"## Configure privacy budget (BudgetAccountant)\n",
"`BudgetAccountant` defines the total amount of privacy budget that will be spent on DP aggregations within the program and automatically splits it among all DP aggregations. Different `BudgetAccountant`s will allocate budget in different ways. In this codelab we use NaiveBudgetAccountant, which implements basic composition. The budget accountant is created in the code-snippet below.\n",
"`BudgetAccountant` defines the total amount of privacy budget that will be spent on DP aggregations within the program and automatically splits it among all DP aggregations. Different `BudgetAccountant` implementations will allocate budget in different ways. In this codelab we use NaiveBudgetAccountant, which implements basic composition. The budget accountant is created in the code-snippet below.\n",
"```\n",
"budget_accountant = pipeline_dp.NaiveBudgetAccountant(total_epsilon=1, total_delta=1e-8)\n",
"```\n",
"\n",
"## Create a PrivatePCollection\n",
"`PrivatePCollection` is a wrapper for the input data on which the DP operations can be run. The code-snippet below wraps the input data in a Beam pipeline and then turns it into a `PrivatePCollection`. As a part of a `PrivatePCollection` definition, we tell PipelineDP from what budget accountant the budget needs to be charged, and how to extract a privacy ID (e.g., user ID) from an input data row.\n",
"`PrivatePCollection` is a wrapper for the input data on which the DP operations can be run. The code-snippet below wraps the input data in a Beam pipeline and then turns it into a `PrivatePCollection` object. As a part of a `PrivatePCollection` definition, we tell PipelineDP from which budget accountant the budget needs to be charged, and how to extract a privacy ID (e.g., user ID) from an input data row.\n",
"```\n",
"beam_data = pipeline | beam.Create(rows)\n",
"# Creating PrivatePCollection\n",
Expand All @@ -949,7 +932,7 @@
"* `partition_extractor` defines how to extract the partition key from an element of the input data. In our example we compute daily statistics and hence a partition key is a day.\n",
"* `value_extractor` defines how to extract the value to be aggregated from an element of the input data. In our example we compute sum of visitor spending per day, and hence the value to be aggregated is the amount of money spent at a visit.\n",
"\n",
"Next, we call the DP sum.\n",
"Next, we call DP sum.\n",
"```\n",
"sum_params = pipeline_dp.aggregate_params.SumParams(\n",
" noise_kind=pipeline_dp.NoiseKind.GAUSSIAN,\n",
Expand Down

0 comments on commit 67bb421

Please sign in to comment.