Skip to content

Added Haitian Creole (ht) Language Support to spaCy#13807

Merged
honnibal merged 3 commits into
explosion:masterfrom
JephteyAdolphe:add-ht-support
May 28, 2025
Merged

Added Haitian Creole (ht) Language Support to spaCy#13807
honnibal merged 3 commits into
explosion:masterfrom
JephteyAdolphe:add-ht-support

Conversation

@JephteyAdolphe

@JephteyAdolphe JephteyAdolphe commented Apr 27, 2025

Copy link
Copy Markdown
Contributor

Description

This PR adds official support for Haitian Creole (ht) to spaCy's spacy/lang module.
It includes:

  • Added all core language data files for spacy/lang/ht:

    • tokenizer_exceptions.py
    • punctuation.py
    • lex_attrs.py
    • syntax_iterators.py
    • lemmatizer.py
    • stop_words.py
    • tag_map.py
  • Unit tests for tokenizer and noun chunking (test_tokenizer.py, test_noun_chunking.py, etc.). Passed all 58 pytest spacy/tests/lang/ht tests that I've created.

  • Basic tokenizer rules adapted for Haitian Creole orthography and informal contractions.

  • Custom like_num atrribute supporting Haitian number formats (e.g., "3yèm").

  • Support for common informal apostrophe usage (e.g., "m'ap", "n'ap", "di'm").

  • Ensured no breakages in other language modules.

  • Followed spaCy coding style (PEP8, Black).

This provides a foundation for Haitian Creole NLP development using spaCy.

Type of change

My PR covers the addition of a new language (new feature).

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

Additional Notes

  • Haitian Creole does not have an official XPOS tagset, so UPOS (Universal POS) tags are used.
  • The tokenizer was carefully adapted for informal orthographic contractions (m'ap, l'ap, etc.).
  • Minimal stop_words were compiled, based on common function words and expressions.
  • The contribution focuses on making ht available in the core library, and future models can be trained later based on this work.
  • Trained using valid UD CoNLL-U data and received a final LAS score of 0.52 (based on a train set of 2670 sentences and dev set of 333 sentences). Looking to increase the treebank size over time and add on to this foundational ht spaCy module either myself or with the help of other collaborators that are fluent in Haitian Creole. I went with 96 hidden width, 10000 max steps, .25 dropout, 1 accumalate gradient, and a batch size of 50.

Thanks

I'm very excited to get the ball rolling for a low-resource language like Haitian Creole and contribute to an amazing library like spaCy!

Example Usage

import spacy

nlp = spacy.blank("ht")

# text = "Map manje gato a pandan map gade televizyon lem lakay mwen."
# text = "M'ap vini, eske wap la avek lajan'm? Si ou, di'l non pou fre'w."
# text = "M ap teste sa (pou kounye a)."
# text = "Si'm ka vini, m'ap pale ak li."
# text = "\"regre lanmò twò bonè\""
text = """Onè ap fèt pou ansyen lidè Pati Travayè Britanik

Moun atravè lemond ap voye onè pou ansyen lidè
Pati Travayè a, John Smith, ki mouri pi bonè jodi a apre li te fè yon gwo kriz kadyak a laj 55 an.

Nan Washington, Depatman Deta Etazini pibliye yon deklarasyon ki eksprime "regre lanmò twò bonè" avoka ak palmantè eskoze a.

"Misye Smith, pandan tout karyè li ki te make ak distenksyon"""

doc = nlp(text)

print("Tokens:")
print(len(doc))
for token in doc:
    print(f"{token.text} | {token.orth_} | {token.norm_} | {token.whitespace_}")

@JephteyAdolphe

Copy link
Copy Markdown
Contributor Author

Bump @honnibal @syllog1sm @ines

@JephteyAdolphe

Copy link
Copy Markdown
Contributor Author

Bump ☹️ @ines @honnibal @syllog1sm

@JephteyAdolphe

Copy link
Copy Markdown
Contributor Author

Bump 🧎🏾‍♂️ @honnibal @ines @syllog1sm

@honnibal

Copy link
Copy Markdown
Member

Sorry about the delay on this. I've been behind on other maintenance tasks while working to get the Python 3.13 support completed.

I can't review other languages in detail but I'm happy to merge this if the tests are passing and it's ready. Is there anything else you want to add? If not, or if I don't hear back in a couple of days, I'll go ahead and merge 🚢

@JephteyAdolphe

Copy link
Copy Markdown
Contributor Author

No more additions and I made sure that all spaCy tests have passed! Ready to ship 🔥

@honnibal honnibal merged commit 41e0777 into explosion:master May 28, 2025
ryantqiu pushed a commit to snorkel-marlin-repos/spaCy_4d432caa that referenced this pull request Oct 1, 2025
Original PR #13807 by JephteyAdolphe
Original: explosion/spaCy#13807
ryantqiu added a commit to snorkel-marlin-repos/spaCy_4d432caa that referenced this pull request Oct 1, 2025
ryantqiu pushed a commit to snorkel-marlin-repos/explosion_spaCy_pr_13807_1aa249ef-cd4a-45fa-afa1-5f5cd0dd4a73 that referenced this pull request Oct 1, 2025
Original PR #13807 by JephteyAdolphe
Original: explosion/spaCy#13807
ryantqiu added a commit to snorkel-marlin-repos/explosion_spaCy_pr_13807_1aa249ef-cd4a-45fa-afa1-5f5cd0dd4a73 that referenced this pull request Oct 1, 2025
ryantqiu pushed a commit to snorkel-marlin-repos/explosion_spaCy_pr_13807_7cbf973d-c04d-4674-9ae9-3182caddcc01 that referenced this pull request Oct 2, 2025
Original PR #13807 by JephteyAdolphe
Original: explosion/spaCy#13807
ryantqiu added a commit to snorkel-marlin-repos/explosion_spaCy_pr_13807_7cbf973d-c04d-4674-9ae9-3182caddcc01 that referenced this pull request Oct 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants