Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tokenizer exceptions for known work names #3

Open
4 tasks
thatbudakguy opened this issue Dec 9, 2021 · 0 comments
Open
4 tasks

Add tokenizer exceptions for known work names #3

thatbudakguy opened this issue Dec 9, 2021 · 0 comments
Labels
enhancement New feature or request

Comments

@thatbudakguy
Copy link
Member

thatbudakguy commented Dec 9, 2021

these can be pulled from https://github.com/direct-phonology/ect-krp/blob/main/metadata.json to start with.

  • create a lookups file mapping each of these names to the WORK_OF_ART NER tag
  • create a tokenizer_exceptions.py file (see example for german)
  • import the lookups file (using srsly?) and, for each entry listed, create one tokenizer exception mapping the key to {"ORTH": [key] }
  • store the exceptions by calling:
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, exceptions)
@thatbudakguy thatbudakguy added the enhancement New feature or request label Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant