Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegexGenerator fails to generate values if there are too many possibilities #623

Closed
npatki opened this issue Mar 16, 2023 · 0 comments · Fixed by #628
Closed

RegexGenerator fails to generate values if there are too many possibilities #623

npatki opened this issue Mar 16, 2023 · 0 comments · Fixed by #628
Assignees
Labels
bug Something isn't working feature:transformer Related to adding a new transformer
Milestone

Comments

@npatki
Copy link
Contributor

npatki commented Mar 16, 2023

Environment Details

  • RDT version: 1.3.0 (latest)
  • Python version: Any
  • Operating System: Any

Error Description

Sometimes, the provided regex format in RegexGenerator might leave a lot of possibilities. For example the string "[a-z0-9]{32}" encapsulates 36^32 possible strings. If there are too many possibilities, it seems that there is an overflow in the computation. Our RegexGenerator thinks that there are 0 possibilities and then produces an error when enforce_uniqueness=True.

Note that this problem used to exist in the SDV (see SDV issue 1127). However as part of SDV 1.0, we moved the regex logic to RDT so the issue is now in RDT library.

Steps to reproduce

from rdt import HyperTransformer
from rdt.transformers.text import RegexGenerator
import pandas as pd

data = pd.DataFrame(data={
    'id': ['a', 'b', 'c', 'd', 'e'],
    'column': [1, 2, 3, 2, 1]
})

ht = HyperTransformer()
ht.detect_initial_config(data)
ht.update_sdtypes({
    'id': 'text'
})

ht.update_transformers({
    'id': RegexGenerator(regex_format='[a-z0-9]{32}', enforce_uniqueness=True)
})

transformed = ht.fit_transform(data)
ht.reverse_transform(transformed)

Output:

TransformerProcessingError: The regex is not able to generate 5 unique values. Please use a different regex for column ('id').

Stack Trace

---------------------------------------------------------------------------
TransformerProcessingError                Traceback (most recent call last)
[<ipython-input-19-b22571af494e>](https://localhost:8080/#) in <module>
----> 1 ht.reverse_transform(transformed)

4 frames
[/usr/local/lib/python3.9/dist-packages/rdt/hyper_transformer.py](https://localhost:8080/#) in reverse_transform(self, data)
    794                 reversed data.
    795         """
--> 796         return self._reverse_transform(data, prevent_subset=True)

[/usr/local/lib/python3.9/dist-packages/rdt/hyper_transformer.py](https://localhost:8080/#) in _reverse_transform(self, data, prevent_subset)
    758 
    759             for transformer in reversed(self._transformers_sequence):
--> 760                 data = transformer.reverse_transform(data)
    761 
    762         else:

[/usr/local/lib/python3.9/dist-packages/rdt/transformers/base.py](https://localhost:8080/#) in wrapper(self, *args, **kwargs)
     50         method_name = function.__name__
     51         with set_random_states(self.random_states, method_name, self.set_random_state):
---> 52             return function(self, *args, **kwargs)
     53 
     54     return wrapper

[/usr/local/lib/python3.9/dist-packages/rdt/transformers/base.py](https://localhost:8080/#) in reverse_transform(self, data)
    418         data = data.copy()
    419         columns_data = self._get_columns_data(data, self.output_columns)
--> 420         reversed_data = self._reverse_transform(columns_data)
    421         data = data.drop(self.output_columns, axis=1)
    422         data = self._add_columns_to_data(data, reversed_data, self.columns)

[/usr/local/lib/python3.9/dist-packages/rdt/transformers/text.py](https://localhost:8080/#) in _reverse_transform(self, data)
     97         if sample_size > self.generator_size:
     98             if self.enforce_uniqueness:
---> 99                 raise TransformerProcessingError(
    100                     f'The regex is not able to generate {sample_size} unique values. '
    101                     f"Please use a different regex for column ('{self.get_input_column()}')."

TransformerProcessingError: The regex is not able to generate 5 unique values. Please use a different regex for column ('id').
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature:transformer Related to adding a new transformer
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants