unicode-segmentation-py

Python bindings for Rust library unicode-segmentation.

Note that the *_indices functions return Python character offsets, not byte offsets as in the original Rust library.

Usage

>>> import unicode_segmentation_py
>>> text = "Hello world!"
>>> unicode_segmentation_py.to_words(text)
['Hello', 'world']
>>> unicode_segmentation_py.to_word_indices(text)
[(0, 'Hello'), (6, 'world')]
>>> unicode_segmentation_py.split_word_bounds(text)
['Hello', ' ', 'world', '!']
>>> unicode_segmentation_py.split_word_bound_indices(text)
[(0, 'Hello'), (5, ' '), (6, 'world'), (11, '!')]

Other functions with similar signatures:

to_graphemes
to_grapheme_indices
to_sentences
to_sentence_indices
split_sentence_bounds
split_sentence_bound_indices

The underlying Unicode version used by unicode-segmentation can be inspected through the constant UNICODE_VERSION, which takes the form of a tuple of three integers.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
python		python
src		src
.gitignore		.gitignore
.python-version		.python-version
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements-dev.lock		requirements-dev.lock
requirements.lock		requirements.lock
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

unicode-segmentation-py

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Uh oh!

License

Uh oh!

zoushun1997/unicode-segmentation-py

Folders and files

Latest commit

History

Repository files navigation

unicode-segmentation-py

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages