-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to reuse the same dp.Profiler
object for different data
#461
Comments
dp.Profiler
object for different datadp.Profiler
object for different data
@sergeypine I want to fully understand the problem. If you are looking for just labeling, you can use the labelers directly: data = dp.Data(...)
labeler = dp.DataLabeler(labeler_type=<...>) # structured or unstructured for type
results = labeler.predict(data) For structured this will give you prediction per cell. I think adding a profile reset function as a feature is a good value add. If you want profiling + labeling in addition to the other components (albeit a little convoluted): # you can create a labeler ahead of time and set it via options
labeler = dp.DataLabeler(labeler_type=<...>) # structured or unstructured for type
# create the options (may also want to turn off multiprocessing)
profiler_options = dp.ProfilerOptions()
profile_options.set({'*.data_labeler.data_labeler_object': labeler})
profiler = dp.Profiler(..., options=profile_options) Happy to iterate on your thoughts. |
Thank you for the timely reply, @JGSweets . Our requirement is, given a sample of data, to count instances of sensitive data in it, by type (e.g. I tried the first snippet of the code you shared (with
I am not sure how to read that and whether it is possible to get kinds of sensitive data counts out of the above. The second snippet of code did produce the desired results. It seems like the "trick" here is to create the Labeler object once and reuse it, as that is where TF gets initialized. Am I right? (_To summarize, our need to is to be able to get counts of sensitive data by type from different samples without having to undergo time intensive re-initialization. You have demonstrated how to meet it, though it would be great if there was a more straightforward API to profile different samples of data without re-initialization) |
@sergeypine Currently, the output is showing a value per character. you can technically alter the output of the postprocessing by setting its params as well: labeler.set_params(
{ 'postprocessor': { 'output_format': 'ner', 'use_word_level_argmax': True } }
) this will do two things. Apply a word level aggregation to the character output. This doesn't guarantee a single vote per word though as some labels are multiple words, but instead uses a threshold to guess: It also converts it to a more readable format and maybe what you desire, an output of contiguous labels and the label associated rather than the integer. print(labeler.label_mapping)
Yes, essentially if the profiler doesn't have a labeler already, like at init, it will create it for you. In this case, you are just reusing your own labeler by setting your preference in the options.
Agreed, having to reinitialize every profile does seem problematic. It's possible we could cache a labeler for init after the first profile and reuse it if it is desired. |
Is your feature request related to a problem? Please describe.
We are evaluating DataProfiler as a possible way to label data in our pipeline.
We need to be able to profile many small samples of data at a high frequency. As things stand now, it appears that we need to create a new
dp.Profiler
object for each such sample. That creation takes several seconds (apparently due to TensorFlow loading) and is therefore not scalable.At the same time, the
update_profile
method only adds data to the data previously submitted to the Profiler. So if we use the same Profiler object with theupdate_profile
method the data inside of it keeps growing.What we would need is a
replace_data
functionality: basically, make the Profiler forget the data it was given previously and instead receive new data.Describe the outcome you'd like:
Ability to use reuse the same
dp.Profiler
object on different data samples and avoid the time costly initialization.Additional context:
The text was updated successfully, but these errors were encountered: