Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to reuse the same dp.Profiler object for different data #461

Open
sergeypine opened this issue Apr 28, 2022 · 3 comments
Open

Ability to reuse the same dp.Profiler object for different data #461

sergeypine opened this issue Apr 28, 2022 · 3 comments
Assignees
Labels
New Feature A feature addition not currently in the library

Comments

@sergeypine
Copy link

sergeypine commented Apr 28, 2022

Is your feature request related to a problem? Please describe.

We are evaluating DataProfiler as a possible way to label data in our pipeline.

We need to be able to profile many small samples of data at a high frequency. As things stand now, it appears that we need to create a new dp.Profiler object for each such sample. That creation takes several seconds (apparently due to TensorFlow loading) and is therefore not scalable.

At the same time, the update_profile method only adds data to the data previously submitted to the Profiler. So if we use the same Profiler object with the update_profile method the data inside of it keeps growing.

What we would need is a replace_data functionality: basically, make the Profiler forget the data it was given previously and instead receive new data.

Describe the outcome you'd like:
Ability to use reuse the same dp.Profiler object on different data samples and avoid the time costly initialization.

Additional context:

@sergeypine sergeypine added the New Feature A feature addition not currently in the library label Apr 28, 2022
@sergeypine sergeypine changed the title Ability to reuse. the same dp.Profiler object for different data Ability to reuse the same dp.Profiler object for different data Apr 29, 2022
@JGSweets
Copy link
Contributor

JGSweets commented May 2, 2022

@sergeypine I want to fully understand the problem.
Are you looking to only label data or are you also interested in the profiling as well?

If you are looking for just labeling, you can use the labelers directly:

data = dp.Data(...)
labeler = dp.DataLabeler(labeler_type=<...>)  # structured or unstructured for type
results = labeler.predict(data)

For structured this will give you prediction per cell.

I think adding a profile reset function as a feature is a good value add.

If you want profiling + labeling in addition to the other components (albeit a little convoluted):

# you can create a labeler ahead of time and set it via options
labeler = dp.DataLabeler(labeler_type=<...>)  # structured or unstructured for type

# create the options (may also want to turn off multiprocessing)
profiler_options = dp.ProfilerOptions()
profile_options.set({'*.data_labeler.data_labeler_object': labeler})

profiler = dp.Profiler(..., options=profile_options)

Happy to iterate on your thoughts.

@sergeypine
Copy link
Author

sergeypine commented May 3, 2022

Thank you for the timely reply, @JGSweets .

Our requirement is, given a sample of data, to count instances of sensitive data in it, by type (e.g. PERSON: 15, ADDRESS: 7). I believe that is what this library calls profiling, correct?

I tried the first snippet of the code you shared (with labeler_type="unstructured") and labeler.predict() returned:

{'pred': [array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1., 15., 15., 15., 15., 15., 15., 15., 15., 15.,
       15., 15., 15., 15., 15., 15.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1., 15., 15., 15., 15., 15., 15., 15., 15., 15., 15.,
       15., 15., 15.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 15.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 15., 15.,
       15., 15., 15., 15., 15., 15., 15., 15., 15., 15.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 15.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1., 15., 15., 15., 15., 15., 15., 15., 15., 15., 15., 15., 15.,
       15.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 15.,
       15., 15., 15., 15., 15., 15., 15., 15., 15., 15., 15.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 15., 15., 15., 15., 15.,
       15., 15., 15., 15., 15., 15., 15., 15.,  1.,  1.,  1.,  1.,  1.,
        1.,  1., 15., 15., 15., 15., 15., 15., 15., 15., 15., 15., 15.,
       15., 15., 15., 15., 15.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 15.,
       15., 15., 15., 15., 15., 15., 15., 15., 15., 15., 15.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
       15., 15., 15., 15., 15., 15., 15., 15., 15.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1., 15., 15., 15., 15., 15., 15., 15.,
       15., 15., 15.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 15., 15., 15., 15., 15.,
       15., 15., 15., 15., 15., 15., 15., 15., 15., 15.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1., 15., 15.,  1.,  1.,  1.,  1.,
       15.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.])]}

I am not sure how to read that and whether it is possible to get kinds of sensitive data counts out of the above.

The second snippet of code did produce the desired results. It seems like the "trick" here is to create the Labeler object once and reuse it, as that is where TF gets initialized. Am I right?

(_To summarize, our need to is to be able to get counts of sensitive data by type from different samples without having to undergo time intensive re-initialization. You have demonstrated how to meet it, though it would be great if there was a more straightforward API to profile different samples of data without re-initialization)

@JGSweets
Copy link
Contributor

JGSweets commented May 3, 2022

@sergeypine
First, I'm going to assume that that you are working with unstructured text given the output above rather than tabular data.

Currently, the output is showing a value per character. you can technically alter the output of the postprocessing by setting its params as well:

labeler.set_params(
    { 'postprocessor': { 'output_format': 'ner', 'use_word_level_argmax': True } }
)

this will do two things. Apply a word level aggregation to the character output. This doesn't guarantee a single vote per word though as some labels are multiple words, but instead uses a threshold to guess:
https://github.com/capitalone/DataProfiler/blob/main/dataprofiler/labelers/data_processing.py#L787

It also converts it to a more readable format and maybe what you desire, an output of contiguous labels and the label associated rather than the integer.
Ultimately, in your array above, those integers represent the classification for each character given the label_mapping:

print(labeler.label_mapping)

The second snippet of code did produce the desired results. It seems like the "trick" here is to create the Labeler object once and reuse it, as that is where TF gets initialized. Am I right?

Yes, essentially if the profiler doesn't have a labeler already, like at init, it will create it for you. In this case, you are just reusing your own labeler by setting your preference in the options.

though it would be great if there was a more straightforward API to profile different samples of data without re-initialization

Agreed, having to reinitialize every profile does seem problematic. It's possible we could cache a labeler for init after the first profile and reuse it if it is desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
New Feature A feature addition not currently in the library
Projects
None yet
Development

No branches or pull requests

5 participants