-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Deid samples and resource #1400
Conversation
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here (e.g. What to do if you already signed the CLAIndividual signers
Corporate signers
|
dlp/deid.py
Outdated
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
"""Sample app that uses the Data Loss Prevention API for deidentifying |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you think of a wording that shortens this to one line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
from __future__ import print_function | ||
|
||
import argparse | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this blank line, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dlp/deid.py
Outdated
masking_character: The character to mask matching sensitive data with. | ||
number_to_mask: The maximum number of sensitive characters to mask in | ||
a match. If omitted the request or set to 0, the API will mask any | ||
mathcing characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"If omitted or set to zero, the API will default to no maximum."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dlp/deid.py
Outdated
import google.cloud.dlp | ||
|
||
# Instantiate a client | ||
google.cloud.dlp.DlpServiceClient.SERVICE_ADDRESS = 'autopush-dlp.sandbox.googleapis.com' # DO NOT SUBMIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Go ahead and remove this for the commit. (Actually, I think we don't even need it anymore due to changes on the API side since we started).
Also apply this to the other functions, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dlp/deid.py
Outdated
google.cloud.dlp.DlpServiceClient.SERVICE_ADDRESS = 'autopush-dlp.sandbox.googleapis.com' # DO NOT SUBMIT | ||
dlp = google.cloud.dlp.DlpServiceClient() | ||
|
||
# Add parent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with the inspect_content.py samples, use this comment:
# Convert the project id into a full resource id.
Also apply this to the other functions, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dlp/deid_test.py
Outdated
import deid | ||
|
||
harmful_string = 'My SSN is 372819127' | ||
parent = os.environ['GCLOUD_PROJECT'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call this 'project' instead of 'parent' to avoid confusing it with the fully qualified resource name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dlp/deid_test.py
Outdated
key_name = os.environ['DLP_DEID_KEY_NAME'] | ||
surrogate_type = 'SSN_TOKEN' | ||
csv_file = 'resources/dates.csv' | ||
output_csv_file = 'resources/temp.results.csv' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check out the inspect_config_test.py file to see how to create a temporary directory, instead of hardcoding a temp file here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dlp/deid_test.py
Outdated
assert 'My SSN is *******27' in out | ||
|
||
|
||
def test_deidentify_with_mask_handles_masking_number_error(capsys): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can go ahead and remove this test; it's not a problem for the sample if the API rejects this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dlp/deid_test.py
Outdated
out, _ = capsys.readouterr() | ||
|
||
assert 'Successful' in out | ||
# read in csv?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to read in the CSV at this time, but go ahead and remove this comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dlp/deid_test.py
Outdated
date_fields=date_fields, | ||
context_field_id=csv_context_field, | ||
key_name=key_name) | ||
# [END Deidentify with date shift] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove all the START and END tags in the test file; they're only needed in the real sample file. They're used to mark regions that are auto-included in our documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Great work on this! |
wrapped_key = base64.b64decode(wrapped_key) | ||
|
||
# Construct Deidentify Config | ||
reidentify_config = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a bit wild. @andrewsg wdyt about using the real proto types here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, they're all really heavily nested. What's the advantage of using the real proto types vs. just breaking up the dictionaries into separate steps of construction and adding them together iteratively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's less about breaking it up into steps and more about knowing the actual structure of the item in question. When using the message types directly you get a better sense for how things are constructed and where you can look to find out additional parameters. With just a big o' nested dictionary it's sort of a wild west - the user is left asking "well, how did they know how to construct this? How do I know how to construct this?".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with this diagnosis of the problem for sure, but I'm not sure about the solution. If we use the real protos here people will just run into the proto help() crash bug. Node doesn't use proto objects; does it have a better user experience for discovering this stuff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Node is also a vastly different community with different expectations. Remember, Python is strongly moving towards strongly typed. Seeing giant dictionary with no details on the structure or type constraints isn't wonderful.
Proto help() crash bug has been fixed and should hopefully be released soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, okay. I'll buy that. I think that's a bigger change to our focus than we can reasonably do in this PR, though; can we accept this part as-is and do another pass this or next week to see what it looks like with protos instead?
dlp/deid.py
Outdated
item = {'value': string} | ||
|
||
# Call the API | ||
response = dlp.reidentify_content(parent, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please avoid hanging indents throughout, prefer to line break at the opening (
and then indent the continuation by one level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dlp/deid.py
Outdated
# Read and parse the CSV file | ||
import csv | ||
from datetime import datetime | ||
file = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use file
, it's a built-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
dlp/deid.py
Outdated
'-m', '--masking_character', | ||
help='The character to mask matching sensitive data with.') | ||
mask_parser.add_argument( | ||
'-p', '--project', default=os.environ['GCLOUD_PROJECT']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please just don't do project ID detection at all. Have them pass it in.
I signed it! |
CLAs look good, thanks! |
Sure.
…On Wed, Mar 14, 2018 at 9:50 AM Andrew Gorcester ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In dlp/deid.py
<#1400 (comment)>
:
> + # Import the client library
+ import google.cloud.dlp
+
+ # Instantiate a client
+ google.cloud.dlp.DlpServiceClient.SERVICE_ADDRESS = 'autopush-dlp.sandbox.googleapis.com'
+ dlp = google.cloud.dlp.DlpServiceClient()
+
+ # Add parent
+ parent = dlp.project_path(parent)
+
+ # Wrapped key can not be base64 encoded
+ import base64
+ wrapped_key = base64.b64decode(wrapped_key)
+
+ # Construct Deidentify Config
+ reidentify_config = {
Hmm, okay. I'll buy that. I think that's a bigger change to our focus than
we can reasonably do in this PR, though; can we accept this part as-is and
do another pass this or next week to see what it looks like with protos
instead?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1400 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAPUc7GVrk5D6Yrn-IKE9Ez7QFFWSbZgks5teUpggaJpZM4Spnkx>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - No changes since this was merged (and Jon and Andrew have reviewed). (And it really looks good)
Your comments improve things in almost all cases, the one place i'd quibble is "Print out the results" which appears to be redundant to the code. (ie if the line below and the comment are the same, then don't bother)
In general, unless the language style for comments differ, it's good to prefer terseness w/o sacrificing clarity. If you haven't already read the Google style guide comments it's worth looking at. (Note - it looks like you have)
I was going to suggest some additional vertical white space, but that appears to be counter the community's view. The one place that appears to be recommended is before the "Returns:"
No description provided.