Name	Name	Last commit message	Last commit date
Latest commit History 162 Commits
.cargo	.cargo
.github	.github
demo	demo
dockerfiles	dockerfiles
img	img
note	note
python	python
src	src
.dockerignore	.dockerignore
.gitignore	.gitignore
.pre-commit-config.yaml	.pre-commit-config.yaml
CONTRIBUTING.md	CONTRIBUTING.md
Cargo.toml	Cargo.toml
LICENSE	LICENSE
README.md	README.md

Name

Last commit message

Last commit date

162 Commits

.pre-commit-config.yaml

Robust and Fast tokenizations alignment library for Rust and Python

Demo: demo
Rust document: docs.rs
Python document: python/README.md
Blog post: How to calculate the alignment between BERT and spaCy tokens effectively and robustly

Usage (Python)

Installation:

$ pip install pytokenizations

`get_alignments`

def get_alignments(a: Sequence[str], b: Sequence[str]) -> Tuple[List[List[int]], List[List[int]]]: ...

Returns alignment mappings for two different tokenizations:

>>> tokens_a = ["å", "BC"]
>>> tokens_b = ["abc"] # the accent is dropped (å -> a) and the letters are lowercased(BC -> bc)
>>> a2b, b2a = tokenizations.get_alignments(tokens_a, tokens_b)
>>> print(a2b)
[[0], [0]]
>>> print(b2a)
[[0, 1]]

a2b[i] is a list representing the alignment from tokens_a to tokens_b.

`get_original_spans`

def get_original_spans(tokens: Sequence[str], original_text: str) -> List[Optional[Tuple[int, int]]]: ...

Returns the span indices in original_text from the tokens. This is useful, for example, when a processed result is mapped to the original text that is not normalized yet.

>>> tokens = ["a", "bc"]
>>> original_text = "å  BC"
>>> get_original_spans(tokens, original_text)
[(0,1), (3,5)]

`get_charmap`

def get_charmap(a: str, b: str) -> Tuple[List[Optional[int]], List[Optional[int]]]: ...

Returns character mappings a2b (from a to b) and b2a (from b to a).

>>> a = "åBC"
>>> b = "abc"
>>> get_charmap(a, b)
([0,1,2], [0,1,2])

Algorithm

Algorithm overview
Blog post
seqdiff is used for the diff process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Robust and Fast tokenizations alignment library for Rust and Python

Usage (Python)

`get_alignments`

`get_original_spans`

`get_charmap`

Algorithm

About

Uh oh!

Releases 3

Packages

Uh oh!

Uh oh!

Contributors 4

Uh oh!

Languages

License

explosion/tokenizations

Folders and files

Latest commit

History

Repository files navigation

Robust and Fast tokenizations alignment library for Rust and Python

Usage (Python)

get_alignments

get_original_spans

get_charmap

Algorithm

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Uh oh!

Contributors 4

Uh oh!

Languages

`get_alignments`

`get_original_spans`

`get_charmap`

Packages