Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot generate regex if there are too many possibilities #1127

Closed
npatki opened this issue Dec 2, 2022 · 2 comments
Closed

Cannot generate regex if there are too many possibilities #1127

npatki opened this issue Dec 2, 2022 · 2 comments
Labels
bug Something isn't working resolution:resolved The issue was fixed, the question was answered, etc.

Comments

@npatki
Copy link
Contributor

npatki commented Dec 2, 2022

Environment Details

  • SDV version: 0.17.1 (+ many older ones)
  • Python version: 3.7
  • Operating System: Linux

Error Description

Sometimes, the regex string for my primary key column might leave a lot of possibilities. For example the string "[a-z0-9]{32}" encapsulates 36^32 possible strings.

If there are too many possibilities, it seems that there is an overflow in the computation. Our regex generator thinks that there are 0 possibilities and then produces an error.

Steps to reproduce

If you try this in the SDV library, this shows up as a ValueError:

import pandas as pd
from sdv.tabular import GaussianCopula

# create some dummy data
data = pd.DataFrame(data={
    'id': ['a', 'b', 'c', 'd', 'e'],
    'numerical': [0.4, 0.3, 0.34, 0.2, 0.11],
    'categorical': ['YES', 'NO', 'NO', 'NO', 'YES']
})


metadata = {
    'primary_key': 'id',
    'fields': {
        # provide a regex with a lot of possibilities
        'id': { 'type': 'id', 'subtype': 'string', 'regex': '[a-z0-9]{32}' }, 
        'numerical': { 'type': 'numerical', 'subtype': 'float' },
        'categorical': { 'type': 'categorical' }
    }
}

model = GaussianCopula(table_metadata=metadata)
model.fit(data)
model.sample(5)

Output

[/usr/local/lib/python3.8/dist-packages/sdv/metadata/table.py](https://localhost:8080/#) in _make_ids(cls, field_metadata, length)
    674             generator, max_size = strings_from_regex(regex)
    675             if max_size < length:
--> 676                 raise ValueError((
    677                     'Unable to generate {} unique values for regex {}, the '
    678                     'maximum number of unique values is {}.'

ValueError: Unable to generate 5 unique values for regex [a-z0-9]{32}, the maximum number of unique values is 0.

Regex Utils: You can see that our regex utility function is returning that there are 0 possibilities for this regex. This may be the root cause.

from sdv.metadata.utils import strings_from_regex

regex_string = '[a-z0-9]{32}'

_, size = strings_from_regex(regex_string, max_repeat=1)
print(size)

Output

0
@npatki
Copy link
Contributor Author

npatki commented Mar 16, 2023

Note that starting from SDV 1.0, we have moved the regex generation code into the RDT library (see RegexGenerator RDT). So the underlying bug would need to be fixed in RDT issue 623.

We'll keep this SDV issue open too, as the SDV synthesizers are still unable to generate such regexes.

@npatki
Copy link
Contributor Author

npatki commented Apr 20, 2023

This issue has been resolved in SDV v1.0.1. Note that since SDV 1.0, we have a updated API.

import pandas as pd
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata

# create some dummy data
data = pd.DataFrame(data={
    'id': ['a', 'b', 'c', 'd', 'e'],
    'numerical': [0.4, 0.3, 0.34, 0.2, 0.11],
    'categorical': ['YES', 'NO', 'NO', 'NO', 'YES']
})


metadata_dict = {
    'primary_key': 'id',
    'columns': {
        # provide a regex with a lot of possibilities
        'id': { 'sdtype': 'id', 'regex_format': '[a-z0-9]{32}' }, 
        'numerical': { 'sdtype': 'numerical' },
        'categorical': { 'sdtype': 'categorical' }
    }
}

metadata = SingleTableMetadata.load_from_dict(metadata_dict)
model = GaussianCopulaSynthesizer(metadata)
model.fit(data)
model.sample(5)

@npatki npatki closed this as completed Apr 20, 2023
@npatki npatki added the resolution:resolved The issue was fixed, the question was answered, etc. label Apr 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working resolution:resolved The issue was fixed, the question was answered, etc.
Projects
None yet
Development

No branches or pull requests

1 participant