Ability to reuse the same `dp.Profiler` object for different data #461

sergeypine · 2022-04-28T23:53:53Z

Is your feature request related to a problem? Please describe.

We are evaluating DataProfiler as a possible way to label data in our pipeline.

We need to be able to profile many small samples of data at a high frequency. As things stand now, it appears that we need to create a new dp.Profiler object for each such sample. That creation takes several seconds (apparently due to TensorFlow loading) and is therefore not scalable.

At the same time, the update_profile method only adds data to the data previously submitted to the Profiler. So if we use the same Profiler object with the update_profile method the data inside of it keeps growing.

What we would need is a replace_data functionality: basically, make the Profiler forget the data it was given previously and instead receive new data.

Describe the outcome you'd like:
Ability to use reuse the same dp.Profiler object on different data samples and avoid the time costly initialization.

Additional context:

The text was updated successfully, but these errors were encountered:

JGSweets · 2022-05-02T14:44:12Z

@sergeypine I want to fully understand the problem.
Are you looking to only label data or are you also interested in the profiling as well?

If you are looking for just labeling, you can use the labelers directly:

data = dp.Data(...)
labeler = dp.DataLabeler(labeler_type=<...>)  # structured or unstructured for type
results = labeler.predict(data)

For structured this will give you prediction per cell.

I think adding a profile reset function as a feature is a good value add.

If you want profiling + labeling in addition to the other components (albeit a little convoluted):

# you can create a labeler ahead of time and set it via options
labeler = dp.DataLabeler(labeler_type=<...>)  # structured or unstructured for type

# create the options (may also want to turn off multiprocessing)
profiler_options = dp.ProfilerOptions()
profile_options.set({'*.data_labeler.data_labeler_object': labeler})

profiler = dp.Profiler(..., options=profile_options)

Happy to iterate on your thoughts.

sergeypine · 2022-05-03T00:25:50Z

Thank you for the timely reply, @JGSweets .

Our requirement is, given a sample of data, to count instances of sensitive data in it, by type (e.g. PERSON: 15, ADDRESS: 7). I believe that is what this library calls profiling, correct?

I tried the first snippet of the code you shared (with labeler_type="unstructured") and labeler.predict() returned:

{'pred': [array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1., 15., 15., 15., 15., 15., 15., 15., 15., 15.,
       15., 15., 15., 15., 15., 15.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1., 15., 15., 15., 15., 15., 15., 15., 15., 15., 15.,
       15., 15., 15.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 15.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 15., 15.,
       15., 15., 15., 15., 15., 15., 15., 15., 15., 15.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 15.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1., 15., 15., 15., 15., 15., 15., 15., 15., 15., 15., 15., 15.,
       15.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 15.,
       15., 15., 15., 15., 15., 15., 15., 15., 15., 15., 15.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 15., 15., 15., 15., 15.,
       15., 15., 15., 15., 15., 15., 15., 15.,  1.,  1.,  1.,  1.,  1.,
        1.,  1., 15., 15., 15., 15., 15., 15., 15., 15., 15., 15., 15.,
       15., 15., 15., 15., 15.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 15.,
       15., 15., 15., 15., 15., 15., 15., 15., 15., 15., 15.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
       15., 15., 15., 15., 15., 15., 15., 15., 15.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1., 15., 15., 15., 15., 15., 15., 15.,
       15., 15., 15.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 15., 15., 15., 15., 15.,
       15., 15., 15., 15., 15., 15., 15., 15., 15., 15.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1., 15., 15.,  1.,  1.,  1.,  1.,
       15.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,
        1.])]}

I am not sure how to read that and whether it is possible to get kinds of sensitive data counts out of the above.

The second snippet of code did produce the desired results. It seems like the "trick" here is to create the Labeler object once and reuse it, as that is where TF gets initialized. Am I right?

(_To summarize, our need to is to be able to get counts of sensitive data by type from different samples without having to undergo time intensive re-initialization. You have demonstrated how to meet it, though it would be great if there was a more straightforward API to profile different samples of data without re-initialization)

JGSweets · 2022-05-03T14:43:10Z

@sergeypine
First, I'm going to assume that that you are working with unstructured text given the output above rather than tabular data.

Currently, the output is showing a value per character. you can technically alter the output of the postprocessing by setting its params as well:

labeler.set_params(
    { 'postprocessor': { 'output_format': 'ner', 'use_word_level_argmax': True } }
)

this will do two things. Apply a word level aggregation to the character output. This doesn't guarantee a single vote per word though as some labels are multiple words, but instead uses a threshold to guess:
https://github.com/capitalone/DataProfiler/blob/main/dataprofiler/labelers/data_processing.py#L787

It also converts it to a more readable format and maybe what you desire, an output of contiguous labels and the label associated rather than the integer.
Ultimately, in your array above, those integers represent the classification for each character given the label_mapping:

print(labeler.label_mapping)

The second snippet of code did produce the desired results. It seems like the "trick" here is to create the Labeler object once and reuse it, as that is where TF gets initialized. Am I right?

Yes, essentially if the profiler doesn't have a labeler already, like at init, it will create it for you. In this case, you are just reusing your own labeler by setting your preference in the options.

though it would be great if there was a more straightforward API to profile different samples of data without re-initialization

Agreed, having to reinitialize every profile does seem problematic. It's possible we could cache a labeler for init after the first profile and reuse it if it is desired.

sergeypine added the New Feature A feature addition not currently in the library label Apr 28, 2022

sergeypine assigned JGSweets, ksneab7, micdavis and taylorfturner Apr 28, 2022

sergeypine changed the title ~~Ability to reuse. the same dp.Profiler object for different data~~ Ability to reuse the same dp.Profiler object for different data Apr 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to reuse the same `dp.Profiler` object for different data #461

Ability to reuse the same `dp.Profiler` object for different data #461

sergeypine commented Apr 28, 2022 •

edited

Loading

JGSweets commented May 2, 2022

sergeypine commented May 3, 2022 •

edited

Loading

JGSweets commented May 3, 2022

Ability to reuse the same dp.Profiler object for different data #461

Ability to reuse the same dp.Profiler object for different data #461

Comments

sergeypine commented Apr 28, 2022 • edited Loading

JGSweets commented May 2, 2022

sergeypine commented May 3, 2022 • edited Loading

JGSweets commented May 3, 2022

Ability to reuse the same `dp.Profiler` object for different data #461

Ability to reuse the same `dp.Profiler` object for different data #461

sergeypine commented Apr 28, 2022 •

edited

Loading

sergeypine commented May 3, 2022 •

edited

Loading