Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detokenizer fixes #8039

Merged
merged 30 commits into from
Jul 5, 2024
Merged

Detokenizer fixes #8039

merged 30 commits into from
Jul 5, 2024

Commits on Jun 20, 2024

  1. Add llama_detokenize()

    jaime-m-p committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    eea8dfa View commit details
    Browse the repository at this point in the history
  2. Using llama_tokenize() in tests

    jaime-m-p committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    d779bab View commit details
    Browse the repository at this point in the history
  3. Using llama_tokenize() in tests

    jaime-m-p committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    40a6660 View commit details
    Browse the repository at this point in the history
  4. Fix tokenizer tests

    jaime-m-p committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    16a7503 View commit details
    Browse the repository at this point in the history
  5. minor: confusing hexadecimal codepoint

    jaime-m-p committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    03dbcc8 View commit details
    Browse the repository at this point in the history
  6. Clean old known problematic codepoints

    jaime-m-p committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    071bf42 View commit details
    Browse the repository at this point in the history
  7. Update bruteforce random tests

    Add detokenizer checks
    New generator: ascii_lr_strip
    New generator: apostrophe
    Add more vocabs files
    jaime-m-p committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    064b35e View commit details
    Browse the repository at this point in the history
  8. Fix add_space_prefix, set false by default

    jaime-m-p committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    503b753 View commit details
    Browse the repository at this point in the history
  9. Remove previous space

    jaime-m-p committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    0cc6593 View commit details
    Browse the repository at this point in the history
  10. Remove previous space

    jaime-m-p committed Jun 20, 2024
    Configuration menu
    Copy the full SHA
    6d233bc View commit details
    Browse the repository at this point in the history

Commits on Jun 21, 2024

  1. Add tokenizer flag: clean_up_tokenization_spaces

    jaime-m-p committed Jun 21, 2024
    Configuration menu
    Copy the full SHA
    b452e82 View commit details
    Browse the repository at this point in the history

Commits on Jun 23, 2024

  1. tests: unexpected vocab type as test fail instead of error

    Useful when automating tests:
     - If you don't know in advance the vocab type.
     - Differenciate other loading errors.
    jaime-m-p committed Jun 23, 2024
    Configuration menu
    Copy the full SHA
    9af762c View commit details
    Browse the repository at this point in the history
  2. tests: gracefully exit threads

    Using exit() is throwing random exceptions
    jaime-m-p committed Jun 23, 2024
    Configuration menu
    Copy the full SHA
    0cf2989 View commit details
    Browse the repository at this point in the history
  3. tets: skip unicode surrogaes and undefined

    jaime-m-p committed Jun 23, 2024
    Configuration menu
    Copy the full SHA
    38d54b3 View commit details
    Browse the repository at this point in the history
  4. Fix detokenizer():

    UNKNOWN and CONTROL are 'special pieces'.
    Remove space after UNKNOWN and CONTROL.
    Refactor llama_token_to_piece().
    jaime-m-p committed Jun 23, 2024
    Configuration menu
    Copy the full SHA
    44c8648 View commit details
    Browse the repository at this point in the history

Commits on Jun 24, 2024

  1. Do not remove space when decoding special tokens

    jaime-m-p committed Jun 24, 2024
    Configuration menu
    Copy the full SHA
    9eb0fca View commit details
    Browse the repository at this point in the history
  2. style: remove trailing whitespace

    jaime-m-p committed Jun 24, 2024
    Configuration menu
    Copy the full SHA
    12e2c31 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    95a0df5 View commit details
    Browse the repository at this point in the history
  4. Update brute force test:

    Detokenize special tokens.
    Replace errors with '\uFFFD' when detokenizing to 'utf-8'.
    More edge cases.
    Better detokenization results check.
    jaime-m-p committed Jun 24, 2024
    Configuration menu
    Copy the full SHA
    4a28063 View commit details
    Browse the repository at this point in the history

Commits on Jun 25, 2024

  1. Configuration menu
    Copy the full SHA
    9854a9c View commit details
    Browse the repository at this point in the history
  2. Better leading space removal

    jaime-m-p committed Jun 25, 2024
    Configuration menu
    Copy the full SHA
    107923c View commit details
    Browse the repository at this point in the history
  3. Update bruteforce test

    jaime-m-p committed Jun 25, 2024
    Configuration menu
    Copy the full SHA
    68220fe View commit details
    Browse the repository at this point in the history

Commits on Jul 4, 2024

  1. style : remove spaces

    jaime-m-p committed Jul 4, 2024
    Configuration menu
    Copy the full SHA
    98fc182 View commit details
    Browse the repository at this point in the history
  2. Merge commit 'f8c4c073' into detokenizer

    jaime-m-p committed Jul 4, 2024
    Configuration menu
    Copy the full SHA
    8072089 View commit details
    Browse the repository at this point in the history
  3. 'viking' detokenizer clean spaces

    jaime-m-p committed Jul 4, 2024
    Configuration menu
    Copy the full SHA
    8f5e1e0 View commit details
    Browse the repository at this point in the history
  4. Better leading space removal

    jaime-m-p committed Jul 4, 2024
    Configuration menu
    Copy the full SHA
    2f15019 View commit details
    Browse the repository at this point in the history
  5. Update bruteforce test: header files location

    jaime-m-p committed Jul 4, 2024
    Configuration menu
    Copy the full SHA
    11ac641 View commit details
    Browse the repository at this point in the history
  6. Update bruteforce test: add more models

    jaime-m-p committed Jul 4, 2024
    Configuration menu
    Copy the full SHA
    4db8c0d View commit details
    Browse the repository at this point in the history
  7. style: spaces

    jaime-m-p committed Jul 4, 2024
    Configuration menu
    Copy the full SHA
    906476f View commit details
    Browse the repository at this point in the history

Commits on Jul 5, 2024

  1. style: spaces

    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
    jaime-m-p and ggerganov authored Jul 5, 2024
    Configuration menu
    Copy the full SHA
    0137683 View commit details
    Browse the repository at this point in the history