Skip to content

Scrubbing #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 60 commits into from
Closed

Scrubbing #128

wants to merge 60 commits into from

Conversation

KrishPatel13
Copy link
Collaborator

Presidio is not capable for Scrubbing Credit-Card numbers and Social Security Numbers (ssn), which is strange (it should do it). See Test file and the attached test session text file for reference.
Besides that, it is working for the below fields:

  • Email
  • Phone Number
  • Date of Birth
  • Address
  • Driving Licence
  • Passport
  • National ID
  • Routing Number
  • Bank Account Number

@KrishPatel13 KrishPatel13 requested review from abrichr and apgorton May 9, 2023 12:40
@KrishPatel13 KrishPatel13 self-assigned this May 9, 2023
@KrishPatel13 KrishPatel13 linked an issue May 9, 2023 that may be closed by this pull request
@KrishPatel13
Copy link
Collaborator Author

My Test Session Transcript:

pytest .\test_scrub.py 
====================================================== test session starts =======================================================
platform win32 -- Python 3.9.7, pytest-7.1.3, pluggy-1.0.0
rootdir: C:\Users\Krish Patel\OneDrive - University of Toronto\Desktop\MLDSAI Inc\PAT
plugins: anyio-3.6.2, hypothesis-6.62.0, typeguard-2.13.3
collected 14 items
​
test_scrub.py ....F..F.....F                                                                                                [100%]
​
============================================================ FAILURES ============================================================ 
_____________________________________________________ test_scrub_credit_card _____________________________________________________ 
​
    def test_scrub_credit_card():
        # Test scrubbing of credit card number
>       assert scrub("My credit card number is 1234-5678-9012-3456 and ") == "My credit card number is ****-****-****-**** and "   
E       AssertionError: assert 'My credit ca...012-3456 and ' == 'My credit ca...***-**** and '
E         - My credit card number is ****-****-****-**** and 
E         + My credit card number is 1234-5678-9012-3456 and
​
test_scrub.py:24: AssertionError
------------------------------------------------------- Captured log call -------------------------------------------------------- 
WARNING  presidio-analyzer:nlp_engine_provider.py:109 configuration file C:\Users\Krish Patel\AppData\Roaming\Python\Python39\site-packages\conf\default.yaml not found.  Using default config: {'nlp_engine_name': 'spacy', 'models': [{'lang_code': 'en', 'model_name': 'en_core_web_lg'}]}.
_________________________________________________________ test_scrub_ssn _________________________________________________________ 
​
    def test_scrub_ssn():
        # Test scrubbing of social security number
>       assert scrub("My social security number is 123-45-6789") == "My social security number is ***********"
E       AssertionError: assert 'My social se...s 123-45-6789' == 'My social se...s ***********'
E         - My social security number is ***********
E         + My social security number is 123-45-6789
​
test_scrub.py:36: AssertionError
------------------------------------------------------- Captured log call -------------------------------------------------------- 
WARNING  presidio-analyzer:nlp_engine_provider.py:109 configuration file C:\Users\Krish Patel\AppData\Roaming\Python\Python39\site-packages\conf\default.yaml not found.  Using default config: {'nlp_engine_name': 'spacy', 'models': [{'lang_code': 'en', 'model_name': 'en_core_web_lg'}]}.
____________________________________________________ test_scrub_all_together _____________________________________________________ 
​
    def test_scrub_all_together():
        # Text with all PII/PHI types
        text_with_pii_phi = "John Smith's email is johnsmith@example.com and his phone number is 555-123-4567. His credit card number is 1234-5678-9012-3456 and his social security number is 123-45-6789. He was born on 01/01/1980."
>       assert scrub(text_with_pii_phi) == "****'s email is ***@***.*** and his phone number is ***-***-****. His credit card number is ****-****-****-**** and his social security number is ***-**-****. He was born on **/**/****."
E       assert '************...n **********.' == "****'s email...n **/**/****."
E         - ****'s email is ***@***.*** and his phone number is ***-***-****. His credit card number is ****-****-****-**** and his social security number is ***-**-****. He was born on **/**/****.
E         + ************ email is ********************* and his phone number is ************. His credit card number is 1234-5678-9012-3456 and his social security number is 123-45-6789. He was born on **********.
​
test_scrub.py:61: AssertionError
------------------------------------------------------- Captured log call -------------------------------------------------------- 
WARNING  presidio-analyzer:nlp_engine_provider.py:109 configuration file C:\Users\Krish Patel\AppData\Roaming\Python\Python39\site-packages\conf\default.yaml not found.  Using default config: {'nlp_engine_name': 'spacy', 'models': [{'lang_code': 'en', 'model_name': 'en_core_web_lg'}]}.
==================================================== short test summary info ===================================================== 
FAILED test_scrub.py::test_scrub_credit_card - AssertionError: assert 'My credit ca...012-3456 and ' == 'My credit ca...***-****...
FAILED test_scrub.py::test_scrub_ssn - AssertionError: assert 'My social se...s 123-45-6789' == 'My social se...s ***********'     
FAILED test_scrub.py::test_scrub_all_together - assert '************...n **********.' == "****'s email...n **/**/****."
================================================= 3 failed, 11 passed in 19.99s ==================================================

@abrichr
Copy link
Member

abrichr commented May 9, 2023

Excellent work @KrishPatel13 !

It looks like the scrubbing is working, and the test needs to be modified:

assert '************...n **********.' == "****'s email...n **/**/****."

@KrishPatel13
Copy link
Collaborator Author

My next step will be Implementing scrubbing for images

Copy link
Collaborator Author

@KrishPatel13 KrishPatel13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed and Changed!

@KrishPatel13 KrishPatel13 requested a review from abrichr May 26, 2023 02:48
@KrishPatel13
Copy link
Collaborator Author

@abrichr Re-review is requested!

Thank you for your patience!

Copy link
Collaborator Author

@KrishPatel13 KrishPatel13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed

@abrichr
Copy link
Member

abrichr commented May 26, 2023

Thank you @KrishPatel13 ! Can you please merge the latest from main?

@@ -214,4 +215,4 @@ Please submit any issues to https://github.com/MLDSAI/openadapt/issues with the
following information:

- Problem description (please include any relevant console output and/or screenshots)
- Steps to reproduce (please help others to help you!)
- Steps to reproduce (please help others to help you!)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove unrelated changes 🙏

@KrishPatel13
Copy link
Collaborator Author

KrishPatel13 commented May 31, 2023

microsoft/presidio#1079

I was not able to record so I finally had to investigate deep and found that after adding simplest code of Presidio of initializing the engines it gave the errors. So I had to raise the issue on their GitHubpage.

@KrishPatel13
Copy link
Collaborator Author

KrishPatel13 commented May 31, 2023

microsoft/presidio#1079

I was not able to record so I finally had to investigate deep and found that after adding simplest code of Presidio of initializing the engines it gave the errors. So I had to raise the issue on their GitHub page.

Resolved:
Issue Description:
If I pressed Ctrl + C after 3 "starting". It gave errors. I even did some typing and then pressed Ctrl + C, got the error.

BUT,

when I waited for at least 10-15 sec idle after the 3 starting and then press ctrl + C, it recorded successfully.

@KrishPatel13
Copy link
Collaborator Author

@abrichr We can close this WITHOUT merging once the below is merged! :)

#211
Ready for review! :)

@abrichr
Copy link
Member

abrichr commented Jun 1, 2023

Looks great @KrishPatel13 ! Can you please look into modifying configure_logging to scrub data before it is logged? If it's a lot of work we can leave it as a TODO for now.

@abrichr
Copy link
Member

abrichr commented Jun 1, 2023

Closing in favor of #211

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Scrubbing
2 participants