Skip to content
This repository was archived by the owner on Dec 7, 2023. It is now read-only.

explosion/tokenizations

Repository files navigation

Robust and Fast tokenizations alignment library for Rust and Python

creates.io pypi Actions Status

sample

Demo: demo
Rust document: docs.rs
Python document: python/README.md
Blog post: How to calculate the alignment between BERT and spaCy tokens effectively and robustly

Usage (Python)

Installation:

$ pip install pytokenizations

get_alignments

def get_alignments(a: Sequence[str], b: Sequence[str]) -> Tuple[List[List[int]], List[List[int]]]: ...

Returns alignment mappings for two different tokenizations:

>>> tokens_a = ["å", "BC"]
>>> tokens_b = ["abc"] # the accent is dropped (å -> a) and the letters are lowercased(BC -> bc)
>>> a2b, b2a = tokenizations.get_alignments(tokens_a, tokens_b)
>>> print(a2b)
[[0], [0]]
>>> print(b2a)
[[0, 1]]

a2b[i] is a list representing the alignment from tokens_a to tokens_b.

get_original_spans

def get_original_spans(tokens: Sequence[str], original_text: str) -> List[Optional[Tuple[int, int]]]: ... 

Returns the span indices in original_text from the tokens. This is useful, for example, when a processed result is mapped to the original text that is not normalized yet.

>>> tokens = ["a", "bc"]
>>> original_text = "å  BC"
>>> get_original_spans(tokens, original_text)
[(0,1), (3,5)]

get_charmap

def get_charmap(a: str, b: str) -> Tuple[List[Optional[int]], List[Optional[int]]]: ...

Returns character mappings a2b (from a to b) and b2a (from b to a).

>>> a = "åBC"
>>> b = "abc"
>>> get_charmap(a, b)
([0,1,2], [0,1,2])

Algorithm

About

Robust and Fast tokenizations alignment library for Rust and Python https://tamuhey.github.io/tokenizations/

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 4

  •  
  •  
  •  
  •