Python bindings for Rust library unicode-segmentation.
Note that the *_indices functions return Python character offsets,
not byte offsets as in the original Rust library.
>>> import unicode_segmentation_py
>>> text = "Hello world!"
>>> unicode_segmentation_py.to_words(text)
['Hello', 'world']
>>> unicode_segmentation_py.to_word_indices(text)
[(0, 'Hello'), (6, 'world')]
>>> unicode_segmentation_py.split_word_bounds(text)
['Hello', ' ', 'world', '!']
>>> unicode_segmentation_py.split_word_bound_indices(text)
[(0, 'Hello'), (5, ' '), (6, 'world'), (11, '!')]Other functions with similar signatures:
to_graphemesto_grapheme_indicesto_sentencesto_sentence_indicessplit_sentence_boundssplit_sentence_bound_indices
The underlying Unicode version used by unicode-segmentation
can be inspected through the constant UNICODE_VERSION,
which takes the form of a tuple of three integers.