A fast, multilingual text processing utility that filters stopwords from input text. Supports 33 languages with efficient O(1) lookup using Bash associative arrays.
For performance-critical applications or processing large documents (> 2,000 words), consider the Python implementation which offers superior performance on larger datasets. The Bash version is optimal for:
- Quick command-line filtering
- Shell script integration
- Processing smaller texts (< 2,000 words)
- Low startup overhead requirements
Both implementations use the same NLTK stopwords data and provide compatible interfaces.
- Multilingual Support: Filter stopwords in 33 different languages
- Multiple Output Formats: Single-line, list, or word frequency counts
- Flexible Input: Accept text via command-line arguments or stdin
- Punctuation Control: Optionally preserve or remove punctuation marks
- Case-Insensitive: Matches stopwords regardless of case
- Fast Performance: O(1) stopword lookup using associative arrays
- Dual Usage: Use as a standalone script or source as a Bash function
- Bash 4.0+ (for associative array support)
curl -fsSL https://raw.githubusercontent.com/Open-Technology-Foundation/stopwords.bash/main/install.sh | sudo bashThis one-liner will download and execute the installation script, installing stopwords system-wide.
System-wide installation (recommended, requires sudo):
git clone https://github.com/Open-Technology-Foundation/stopwords.bash
cd stopwords.bash
sudo make installUser-local installation (no sudo required):
git clone https://github.com/Open-Technology-Foundation/stopwords.bash
cd stopwords.bash
make PREFIX=$HOME/.local installThis installs:
- Script to
$PREFIX/bin/stopwords(system:/usr/local/bin/, user:~/.local/bin/) - Stopwords data to
/usr/share/stopwords/(33 languages, ~170KB)- Note: If you have Python NLTK installed with stopwords, data installation is automatically skipped
- Documentation to
$PREFIX/share/doc/stopwords/
If you prefer not to use Make:
# Using the install script directly
./install.sh install
# For user-local installation
PREFIX=$HOME/.local ./install.sh install
# Custom NLTK data location
NLTK_DATA=$HOME/nltk_data ./install.sh installCheck that everything is installed correctly:
make check
# or
./install.sh checkThis verifies:
- Script is executable and in PATH
- Stopwords data files are present (33 languages)
- Basic functionality works
# System installation
sudo make uninstall
# User installation
make PREFIX=$HOME/.local uninstall
# Or using install.sh
./install.sh uninstallThe script uses the NLTK_DATA environment variable to locate stopwords data:
- Default:
/usr/share/nltk_data(no configuration needed) - Custom location: Set
NLTK_DATAenvironment variable
# Add to ~/.bashrc or ~/.profile for custom location
export NLTK_DATA=/path/to/your/nltk_dataIf installing to a user directory, ensure the bin directory is in your PATH:
# Add to ~/.bashrc or ~/.profile
export PATH="$HOME/.local/bin:$PATH"Custom prefix:
make PREFIX=/opt/local installPackage staging (for package maintainers):
make DESTDIR=/tmp/package PREFIX=/usr installInstallation script help:
./install.sh --helpFilter stopwords from English text (default language):
./stopwords 'the quick brown fox jumps over the lazy dog'
# Output: quick brown fox jumps lazy dogecho 'the quick brown fox' | ./stopwords
# Output: quick brown fox
cat document.txt | ./stopwordsUse the -l or --language option to specify a language:
# Spanish
./stopwords -l spanish 'el rápido zorro marrón salta sobre el perro perezoso'
# Output: rápido zorro marrón salta perro perezoso
# Indonesian
./stopwords -l indonesian 'Pohon mangga tumbuh di halaman rumah'
# Output: pohon mangga tumbuh halaman rumah
# French
./stopwords -l french 'le chat noir dort sur le canapé'
# Output: chat noir dort canapéBy default, punctuation is removed. Use -p or --keep-punctuation to preserve it:
./stopwords 'Hello, world! How are you?'
# Output: hello world
./stopwords -p 'Hello, world! How are you?'
# Output: hello, world!Use -w or --list-words to output one word per line:
./stopwords -w 'the quick brown fox'
# Output:
# quick
# brown
# foxUse -c or --count to count word frequencies:
./stopwords -c 'the quick brown fox jumps over the lazy dog and the fox runs'
# Output:
# 1 brown
# 1 dog
# 1 jumps
# 1 lazy
# 1 quick
# 1 runs
# 2 fox
# From a file
./stopwords -c < document.txtThe output format is count word, sorted numerically by count (ascending).
# Spanish text with punctuation preserved, output as list
./stopwords -l spanish -p -w 'Hola, ¿cómo estás? Muy bien, gracias.'
# Word frequency from German text
./stopwords -l german -c 'Der Hund läuft und der Hund spielt'# Show version
./stopwords -V
# Output: stopwords 1.0.0
# Show help message
./stopwords -hThe tool supports stopword filtering in 33 languages:
- albanian
- arabic
- azerbaijani
- basque
- belarusian
- bengali
- catalan
- chinese
- danish
- dutch
- english
- finnish
- french
- german
- greek
- hebrew
- hinglish
- hungarian
- indonesian
- italian
- kazakh
- nepali
- norwegian
- portuguese
- romanian
- russian
- slovene
- spanish
- swedish
- tajik
- tamil
- turkish
Filtered words separated by spaces:
quick brown fox jumps lazy dog
One word per line:
quick
brown
fox
jumps
lazy
dog
Word frequency as count word pairs, sorted by count (ascending):
1 brown
1 dog
1 fox
1 jumps
1 lazy
1 quick
Stopword lists are located in one of these directories (checked in priority order):
$NLTK_DATA/corpora/stopwords/(if NLTK_DATA environment variable is set)/usr/share/nltk_data/corpora/stopwords/(if Python NLTK is installed)/usr/share/stopwords/(bundled installation fallback)
Each location contains:
- One file per language (e.g.,
english,spanish) - One stopword per line
- Alphabetically sorted
- UTF-8 encoded
The script automatically detects and uses existing NLTK installations, avoiding data duplication.
The stopwords filter can also be sourced and used as a Bash function:
# Source the script
source stopwords
# Use the function
stopwords 'the quick brown fox'
# Output: quick brown fox
stopwords -l spanish 'el rápido zorro'
# Output: rápido zorro# Extract keywords from a document
cat article.txt | ./stopwords -w | sort | uniq
# Find most common words in a document
./stopwords -c < article.txt | tail -20# Clean up search queries
echo "how to install python on ubuntu" | ./stopwords
# Output: install python ubuntu# Analyze Spanish content
curl -s https://example.com/es/article | ./stopwords -l spanish -c# Remove stopwords before feeding to ML model
for file in corpus/*.txt; do
./stopwords < "$file" > "processed/$(basename "$file")"
done| Option | Long Form | Description |
|---|---|---|
-l LANG |
--language LANG |
Set the language for stopwords (default: english) |
-p |
--keep-punctuation |
Keep punctuation marks (default: remove) |
-w |
--list-words |
Output filtered words as a list (one per line) |
-c |
--count |
Output word frequency counts (sorted by count) |
-V |
--version |
Show version information |
-h |
--help |
Show help message |
Short options can be combined: -lw, -pc, etc.
0: Success1: Data directory or stopwords file not found2: Missing argument for option22: Invalid option
- Load Stopwords: Reads stopwords from
$NLTK_DATA/corpora/stopwords/{language}into a Bash associative array for O(1) lookup - Normalize Text: Converts input to lowercase for case-insensitive matching
- Tokenize: Splits text on whitespace (optionally removes punctuation first)
- Filter: Checks each word against stopwords dictionary
- Output: Formats results based on selected output mode
- O(1) stopword lookup using Bash associative arrays
- Efficient for processing moderate-sized texts (< 1MB)
- Recommendation: For documents > 2,000 words, Python is the better choice for performance-critical applications
- Bash excels at small inputs (< 2,000 words) due to lower startup overhead
For small texts Bash is typically faster due to Python's startup overhead. The crossover point starts at around 2,000 words, where Python's superior string processing begins to dominate.
If you get an error about missing stopwords data, try:
-
Install this package:
sudo make install
-
OR use Python NLTK:
pip install nltk python -m nltk.downloader stopwords
-
OR set NLTK_DATA manually:
export NLTK_DATA=/path/to/your/nltk_data
If you have Python NLTK installed with stopwords, the script will automatically detect and use it. No additional configuration needed! The installation process will skip installing duplicate data.
GPL-3. See LICENSE
Contributions are welcome! Please feel free to submit issues or pull requests.
Stopword lists are sourced from the NLTK corpus, which provides curated stopword lists for multiple languages.