Skip to content

Latest commit

 

History

History
252 lines (238 loc) · 26.2 KB

README.md

File metadata and controls

252 lines (238 loc) · 26.2 KB

Github CI Crates.io docs.rs

About

Stop words are words that don't carry much meaning, and are typically removed as a preprocessing step before text analysis or natural language processing. This crate contains common stop words for a variety of languages. This crate uses stop word lists from Stopwords ISO and also from NLTK.

Usage

Using this crate is fairly straight-forward:

// Get the stop words
let words = stop_words::get(stop_words::LANGUAGE::English);

// Print them
for word in words {
    println!("{}", word);
}

The function get will take either a member of the LANGUAGE enum or a two-letter ISO language code as either a str or a String type.

You can find a complete example of how to read in a text file and remove stop words here.

ISO Language Availability

This crate supports all languages from Stopwords ISO and also from NLTK. Expand the table below to see a comprehensive description.

Language Coverage Table
ISO 639-1 Code Language Stopwords ISO NLTK
aa Afar
ab Abkhazian
af Afrikaans
ak Akan
sq Albanian
am Amharic
ar Arabic
an Aragonese
hy Armenian
as Assamese
av Avaric
ae Avestan
ay Aymara
az Azerbaijani
ba Bashkir
bm Bambara
eu Basque
be Belarusian
bn Bengali
bh Bihari languages
bi Bislama
bo Tibetan
bs Bosnian
br Breton
bg Bulgarian
my Burmese
ca Catalan; Valencian
cs Czech
ch Chamorro
ce Chechen
zh Chinese
cu Church Slavic; Old Slavonic; Church Slavonic; Old Bulgarian; Old Church Slavonic
cv Chuvash
kw Cornish
co Corsican
cr Cree
cy Welsh
da Danish
de German
dv Divehi; Dhivehi; Maldivian
nl Dutch; Flemish
dz Dzongkha
el Greek, Modern (1453-)
en English
eo Esperanto
et Estonian
ee Ewe
fo Faroese
fa Persian
fj Fijian
fi Finnish
fr French
fy Western Frisian
ff Fulah
ka Georgian
gd Gaelic; Scottish Gaelic
ga Irish
gl Galician
gv Manx
gn Guarani
gu Gujarati
ht Haitian; Haitian Creole
ha Hausa
he Hebrew
hz Herero
hi Hindi
ho Hiri Motu
hr Croatian
hu Hungarian
ig Igbo
is Icelandic
io Ido
ii Sichuan Yi; Nuosu
iu Inuktitut
ie Interlingue; Occidental
ia Interlingua (International Auxiliary Language Association)
id Indonesian
ik Inupiaq
it Italian
jv Javanese
ja Japanese
kl Kalaallisut; Greenlandic
kn Kannada
ks Kashmiri
kr Kanuri
kk Kazakh
km Central Khmer
ki Kikuyu; Gikuyu
rw Kinyarwanda
ky Kirghiz; Kyrgyz
kv Komi
kg Kongo
ko Korean
kj Kuanyama; Kwanyama
ku Kurdish
lo Lao
la Latin
lv Latvian
li Limburgan; Limburger; Limburgish
ln Lingala
lt Lithuanian
lb Luxembourgish; Letzeburgesch
lu Luba-Katanga
lg Ganda
mk Macedonian
mh Marshallese
ml Malayalam
mi Maori
mr Marathi
ms Malay
mg Malagasy
mt Maltese
mn Mongolian
na Nauru
nv Navajo; Navaho
nr Ndebele, South; South Ndebele
nd Ndebele, North; North Ndebele
ng Ndonga
ne Nepali
nn Norwegian Nynorsk; Nynorsk, Norwegian
nb Bokmål, Norwegian; Norwegian Bokmål
no Norwegian
ny Chichewa; Chewa; Nyanja
oc Occitan (post 1500)
oj Ojibwa
or Oriya
om Oromo
os Ossetian; Ossetic
pa Panjabi; Punjabi
pi Pali
pl Polish
pt Portuguese
ps Pushto; Pashto
qu Quechua
rm Romansh
ro Romanian; Moldavian; Moldovan
rn Rundi
ru Russian
sg Sango
sa Sanskrit
si Sinhala; Sinhalese
sk Slovak
sl Slovenian
se Northern Sami
sm Samoan
sn Shona
sd Sindhi
so Somali
st Sotho, Southern
es Spanish; Castilian
sc Sardinian
sr Serbian
ss Swati
su Sundanese
sw Swahili
sv Swedish
ty Tahitian
ta Tamil
tt Tatar
te Telugu
tg Tajik
tl Tagalog
th Thai
ti Tigrinya
to Tonga (Tonga Islands)
tn Tswana
ts Tsonga
tk Turkmen
tr Turkish
tw Twi
ug Uighur; Uyghur
uk Ukrainian
ur Urdu
uz Uzbek
ve Venda
vi Vietnamese
vo Volapük
wa Walloon
wo Wolof
xh Xhosa
yi Yiddish
yo Yoruba
za Zhuang; Chuang
zu Zulu

Constructed Language Availability

We also support some constructed (fictional/fantasy) languages! Expand the table below to see a comprehensive description. ChatGPT was used to generate these lists quickly, so they are incomplete and approximate. Help welcome! To use these languages, add the constructed feature.

Language Coverage Table
ISO 639-3 Code Language
qya Quenya
sjn Sindarin
tlh Klingon
mis (dot is used here) Dothraki
mis (dov is used here) Dovahzul
mis (nav is used here) Navi
mis (val is used here) High Valyrian

The following prompt was used with the Mar 14, 2023 version of ChatGPT:

Please give me one list of 20 stop words for each of the following languages: Sindarin, Quenya, DOthraki, Na'vi, 
Dovahzul, Klingon, and High Valyrian. I'd like the lists to be formatted as follows:
Sindarin
1. [word goes here]
2. [word goes here]
...
20. [word goes here]
Quenya
1. [word goes here]
...

And so on