-
Notifications
You must be signed in to change notification settings - Fork 907
Insights: huggingface/tokenizers
Overview
Could not load contribution data
Please try again later
9 Pull requests merged by 8 people
-
Upgrade onig, to get it compiling with GCC 15
#1771 merged
May 27, 2025 -
Use ApiBuilder::from_env() in from_pretrained function
#1737 merged
May 27, 2025 -
Update pyo3 and rust-numpy depends for no-gil/free-threading compat
#1774 merged
May 27, 2025 -
clippy
#1781 merged
May 27, 2025 -
Fix data path in test_continuing_prefix_trainer_mismatch
#1747 merged
May 27, 2025 -
Bump http-proxy-middleware from 2.0.6 to 2.0.9 in /tokenizers/examples/unstable_wasm/www
#1762 merged
May 27, 2025 -
Fix type notation of merges in BPE Python binding
#1766 merged
May 27, 2025 -
Fix typos in strings and comments
#1770 merged
May 27, 2025 -
Fix no-onig no-wasm builds
#1772 merged
May 27, 2025
4 Pull requests opened by 2 people
-
Update decode stream api
#1780 opened
May 27, 2025 -
Add benchmark for deserializing large added vocab + optimizations
#1782 opened
May 27, 2025 -
Add Truncate pre-tokenizer
#1783 opened
May 27, 2025 -
[docs] Whitespace
#1785 opened
May 27, 2025
17 Issues closed by 2 people
-
Behavior differences between non-fast and fast versions of the same tokenizer
#1786 closed
May 28, 2025 -
Tokenizer encode and decode get different token ids and text
#1757 closed
May 27, 2025 -
Bert: Slow vs fast decoding inconsistency
#1723 closed
May 27, 2025 -
Tokenizers fails to build in Python3.13t (NoGIL build)
#1749 closed
May 27, 2025 -
Memory leak is observed when using the `AutoTokenizer` and `AutoModel` with Python 3.10.*
#1706 closed
May 27, 2025 -
Support free-threaded Python and ship 3.13t wheels
#1767 closed
May 27, 2025 -
Reduce vocab size for BPE tokenizer
#1668 closed
May 27, 2025 -
batch_encode_plus doesn't work correctly
#1704 closed
May 27, 2025 -
out of memory when training a BBPE tokenizer on a large corpus
#1681 closed
May 27, 2025 -
Trying to load slow tokenizer in Rust
#1761 closed
May 27, 2025 -
Option to disable cache for FromPretrained and FromFile
#1680 closed
May 27, 2025 -
Return_offsets_mapping when decoding
#1769 closed
May 27, 2025 -
Tokenizers fail to build due to Rust dependency
#1776 closed
May 27, 2025 -
Property 'pre_tokenizer' cannot be set
#1773 closed
May 27, 2025 -
Building without `onig` feature fails
#1729 closed
May 27, 2025
2 Issues opened by 2 people
-
Add cp313t tests, mark extension modules as compatible, and ship free-threaded wheels
#1784 opened
May 27, 2025 -
Possible (minor) inconsistency in`unk_token` argument between WordPiece and UnigramLM
#1779 opened
May 25, 2025
21 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
Make unigram cache optional
#1763 commented on
May 27, 2025 • 1 new comment -
Implement `from_bytes` and `read_bytes` Methods in WordPiece Tokenizer for WebAssembly Compatibility
#1758 commented on
May 27, 2025 • 1 new comment -
Update __init__.pyi: fix 525: SyntaxWarning: invalid escape sequence '\w'
#1764 commented on
May 27, 2025 • 0 new comments -
Itertools upgrade
#1756 commented on
May 27, 2025 • 0 new comments -
Implement Append normalizer
#1755 commented on
May 27, 2025 • 0 new comments -
Draft backtrack
#1712 commented on
May 27, 2025 • 0 new comments -
Bpe clones
#1707 commented on
May 27, 2025 • 0 new comments -
[WIP] free speed/mem optimizations with ahash, dary_heap, and compact_str
#1618 commented on
May 27, 2025 • 0 new comments -
wikitext-103-raw-v1.zip is not available on the amazonaws anymore
#1683 commented on
May 27, 2025 • 0 new comments -
Adding many AddedTokens makes loading a tokenizer extremely slow.
#1635 commented on
May 27, 2025 • 0 new comments -
Thread safe?
#1726 commented on
May 27, 2025 • 0 new comments -
normalizers.Replace able to support regex group capture
#1760 commented on
May 27, 2025 • 0 new comments -
Proposal: Add Golang Bindings for tokenizers
#1751 commented on
May 27, 2025 • 0 new comments -
ERROR occurs when running "tokenizer._tokenizer.model.clear_cache()"
#1738 commented on
May 27, 2025 • 0 new comments -
Tokenizer Training Errors: pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: TryFromIntError(())
#1698 commented on
May 27, 2025 • 0 new comments -
Fails to build from source with GCC 15 due to mismatched function declarations
#1731 commented on
May 27, 2025 • 0 new comments -
Patching version 0.10.3 to fix invalid_reference_casting for legacy projects
#1765 commented on
May 27, 2025 • 0 new comments -
AttributeError: module 'decoders' has no attribute 'DecodeStream'
#1759 commented on
May 27, 2025 • 0 new comments -
Suggestion to Clarify WordPiece Documentation
#1777 commented on
May 27, 2025 • 0 new comments -
Cannot import name '__version__' from 'tokenizers.tokenizers'
#1741 commented on
May 27, 2025 • 0 new comments -
Truncation performs slowly. Tokenizer firstly encodes long sequence and then truncates it.
#1573 commented on
May 27, 2025 • 0 new comments