Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use TQDM to track index compilation progress #915

Merged
merged 1 commit into from
May 23, 2024
Merged

Conversation

lapp0
Copy link
Contributor

@lapp0 lapp0 commented May 23, 2024

Fixes #810

Changes

Use tqdm progress bar within regex.py's create_fsm_index_end_to_end() so the program doesn't silently hang with no indication of status.

Update pre-commit hook to ensure types-tqdm is installed for mypy type checking.

Example

$ python3 810.py 
Compiling FSM index for all state transitions: 62%|██████████████████           | 146/235 [00:12<00:00, 18.96it/s]

Contents of 810.py:

import interegular
from transformers import AutoTokenizer

from outlines.models.transformers import TransformerTokenizer
from outlines.fsm.regex import (
    make_deterministic_fsm,
    make_byte_level_better_fsm,
    create_fsm_index_tokenizer
)


regex_str = "(?:(?:[0-9](?:(?:_)?[0-9])*(?:e|E)(?:(?:\\+|\\-))?[0-9](?:(?:_)?[0-9])*|(?:[0-9](?:(?:_)?[0-9])*\\.(?:[0-9](?:(?:_)?[0-9])*)?|\\.[0-9](?:(?:_)?[0-9])*)(?:(?:e|E)(?:(?:\\+|\\-))?[0-9](?:(?:_)?[0-9])*)?)|[0-9](?:(?:_)?[0-9])*)(?:J|j)|(?:[0-9](?:(?:_)?[0-9])*(?:e|E)(?:(?:\\+|\\-))?[0-9](?:(?:_)?[0-9])*|(?:[0-9](?:(?:_)?[0-9])*\\.(?:[0-9](?:(?:_)?[0-9])*)?|\\.[0-9](?:(?:_)?[0-9])*)(?:(?:e|E)(?:(?:\\+|\\-))?[0-9](?:(?:_)?[0-9])*)?)|0(?:x|X)(?:(?:_)?(?:[0-9]|[a-f]|[A-F]))+|0(?:b|B)(?:(?:_)?[0-1])+|0(?:o|O)(?:(?:_)?[0-7])+|(?:(?i:([ubf]?r?|r[ubf])('([^\\\\']|.)*?'))|(?i:([ubf]?r?|r[ubf])(\"([^\\\"]|.)*?\")))|(?:(?:\r?\n[\t ]*|#[^\n]*))+|[1-9](?:(?:_)?[0-9])*|\\\\[\t \x0c]*\r?\n|continue|nonlocal|assert|global|import|lambda|return|async|await|break|class|False|match|raise|while|yield|case|from|None|pass|True|with|def|del|for|not|try|if|[^\\W\\d]\\w*|#[^\n]*|[\t \x0c]+|\\.\\.\\.|@|\\{|\\(|\\[|\\-|\\+|\\*|\\~"

regex_pattern = interegular.parse_pattern(regex_str)
# Not reduced, so that there are many states
regex_fsm, _ = make_deterministic_fsm(regex_pattern.to_fsm())
bytes_fsm = make_byte_level_better_fsm(regex_fsm, keep_utf8=True)

num_fsm_states = len(regex_fsm.states)
assert num_fsm_states == 220

num_bytes_fsm_states = len(bytes_fsm.states)
assert num_bytes_fsm_states == 235

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer = TransformerTokenizer(tokenizer)

states_to_token_subsets, empty_token_ids = create_fsm_index_tokenizer(
    bytes_fsm, tokenizer
)

@rlouf rlouf merged commit ffab2ac into dottxt-ai:main May 23, 2024
5 checks passed
@rlouf
Copy link
Member

rlouf commented May 23, 2024

Looks great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a progress bar for compilation
2 participants