A Python command-line tool that generates word clouds from directory contents. It reads various file formats (text, PDF, DOCX, etc.), stores word frequencies in a SQLite database, filters out common stopwords based on language, and generates beautiful word cloud visualizations.
- Multi-format support: Reads text files, PDFs, DOCX, HTML, XML, JSON, Markdown, and source code files
- Language-aware filtering: Removes common stopwords (articles, prepositions, etc.) based on language
- Persistent storage: Uses SQLite to store word frequencies, enabling incremental processing
- Multiple output formats: PNG, SVG, JPG, PDF for graphical output, or ASCII art for terminal display
- Highly configurable: Customize colors, dimensions, word limits, and more
- Cross-platform: Works on Linux, macOS, and Windows
# Clone the repository
git clone https://github.com/example/py-word-cloud.git
cd py-word-cloud
# Install uv if not already installed
# curl -LsSf https://astral.sh/uv/install.sh | sh
# Create a virtual environment and install dependencies
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install the package in development mode
uv pip install -e .
# Or install dependencies only
uv pip install -r requirements.txt# Run directly with uv (no venv needed)
uv run python -m wordcloud_generator --help
# Or with dependencies synced
uv run --with wordcloud --with matplotlib --with pypdf --with python-docx --with chardet --with Pillow python -m wordcloud_generatorThe required dependencies will be installed automatically:
wordcloud- Word cloud generationmatplotlib- Plotting and image generationpypdf- PDF file readingpython-docx- DOCX file readingchardet- Character encoding detectionPillow- Image processing
# Generate a word cloud from the current directory
wordcloud-gen
# Process a specific directory
wordcloud-gen /path/to/documents
# Generate with specific language stopwords
wordcloud-gen -l es /path/to/spanish/docs
# Generate ASCII art (for terminal display)
wordcloud-gen --ascii
# Save to a specific file
wordcloud-gen -o my_cloud.png /path/to/docsusage: wordcloud-gen [-h] [-l LANG] [--list-languages] [-o FILE]
[--format {png,svg,jpg,pdf}] [-a] [--simple] [--db FILE]
[--clear-db] [-r] [--no-recursive] [-e EXT [EXT ...]]
[--min-length N] [--max-length N] [--min-count N]
[--max-words N] [--ignore-words WORD [WORD ...]]
[-W N] [-H N] [--bgColor COLOR] [--color COLOR]
[--ignore-contrast-warning] [--colormap NAME] [--title TEXT]
[-v] [-q] [--stats] [--dry-run] [--version] [directory]
| Argument | Description |
|---|---|
directory |
Directory to process (default: current directory) |
| Option | Description |
|---|---|
-l, --language LANG |
Language code for stopwords (e.g., 'en', 'es'). Auto-detects from OS locale by default |
--list-languages |
List available language codes and exit |
| Option | Description |
|---|---|
-o, --output FILE |
Output file path (default: wordcloud.png or .txt) |
--format {png,svg,jpg,pdf} |
Output format for graphical clouds (default: png) |
-a, --ascii |
Generate ASCII art output instead of image |
--simple |
Generate simple text frequency chart |
| Option | Description |
|---|---|
--db FILE |
SQLite database file for persistent storage (default: in-memory) |
--clear-db |
Clear existing database before processing |
| Option | Description |
|---|---|
-r, --recursive |
Recursively process subdirectories (default: True) |
--no-recursive |
Don't process subdirectories |
-e, --extensions EXT |
File extensions to process (e.g., txt pdf docx) |
--min-length N |
Minimum word length (default: 2) |
--max-length N |
Maximum word length (default: 50) |
--min-count N |
Minimum word count to include (default: 1) |
--max-words N |
Maximum words in cloud (default: 200) |
--ignore-words WORD |
Additional words to ignore |
| Option | Description |
|---|---|
-W, --width N |
Output width in pixels/chars (default: 800) |
-H, --height N |
Output height in pixels/lines (default: 400) |
--bgColor COLOR--bgcolor COLOR |
Background color (default: white). Aliases supported for flexibility. |
--color COLOR |
Word color(s) for image output only (not ASCII/console). Supports RGB (rgb(255,0,0)), hex (#ff0000), or 10 common names: red, green, blue, yellow, orange, purple, pink, brown, black, gray. Comma-separated for multiples; applies shades of color(s). |
--ignore-contrast-warning |
Skip auto-adjustment of colors to pass WCAG AA contrast; use requested colors as-is (with warning). |
--colormap NAME |
Matplotlib colormap for word colors (e.g., 'viridis', 'plasma') |
--title TEXT |
Title for the word cloud |
| Option | Description |
|---|---|
-v, --verbose |
Enable verbose output |
-q, --quiet |
Suppress non-error output |
--stats |
Print statistics after processing |
--dry-run |
Process files but don't generate output |
--version |
Show version and exit |
# Process current directory with default settings
wordcloud-gen
# Process a specific directory
wordcloud-gen ~/Documents/research
# Specify output file
wordcloud-gen -o research_cloud.png ~/Documents/research# Use Spanish stopwords
wordcloud-gen -l es ~/Documents/spanish_texts
# Use German stopwords
wordcloud-gen -l de ~/Documents/german_docs
# List available languages
wordcloud-gen --list-languagesThe ASCII mode generates true variable-sized text art, where word size reflects frequency:
# Generate ASCII word cloud with sized text
wordcloud-gen --ascii
# Generate simple frequency chart
wordcloud-gen --simple
# Save ASCII output to file
wordcloud-gen --ascii -o cloud.txt
# Adjust dimensions for terminal
wordcloud-gen --ascii -W 120 -H 40Example output:
╔════════════════════════════════════════════════════════════════════╗
║ ║
║ ###### # # ####### # # ####### # # ║
║ # # # # # # # # # ## # ║
║ ###### # # ####### # # # # # ← Largest word ║
║ # # # # # # # # # # ║
║ # # # # # ####### # # ║
║ ║
║ ██▄ ██▄ ▄█▄ ▄█▀ ▄█▀ ▄█▄ ██▄ ██▀ ← Medium words ║
║ █▀▀ █▀▄ █ █ █ █ █ █ █ █ █ █▀ ║
║ █ █ █ ▀█▀ ▀█▄ ▀█▄ ▀█▀ ██▀ ██▄ ║
║ ║
║ ◆DEVELOPER◆ ◆COMPUTER◆ ◆SOFTWARE◆ ← Small words ║
║ ║
║ algorithm function variable class method ← Smallest ║
╚════════════════════════════════════════════════════════════════════╝
# Use persistent database for large projects
wordcloud-gen --db project_words.db ~/large/project
# Continue incrementally with same database
wordcloud-gen --db project_words.db ~/large/project/new_files
# Clear database and reprocess
wordcloud-gen --db project_words.db --clear-db ~/project
# Only process specific file types
wordcloud-gen -e txt md rst ~/docs
# Customize appearance (color for words + bgColor; --bgcolor alias also works)
# Use --ignore-contrast-warning to force original colors (skips auto-adjust)
wordcloud-gen --width 1200 --height 600 --colormap viridis --bgColor black --color red,blue
# Add custom ignore words
wordcloud-gen --ignore-words TODO FIXME DEBUG ~/code
# Get statistics without generating output
wordcloud-gen --dry-run --stats ~/docs# Generate scalable vector graphics
wordcloud-gen --format svg -o cloud.svg ~/docs| Extension | Description |
|---|---|
.txt |
Plain text files |
.md, .markdown |
Markdown files |
.pdf |
PDF documents |
.docx |
Microsoft Word documents |
.html, .htm |
HTML files |
.xml |
XML files |
.json |
JSON files |
.csv, .tsv |
CSV/TSV files |
.rst |
reStructuredText files |
.rtf |
Rich Text Format files |
.py, .js, .java, .c, .cpp, .go, .rs |
Source code files |
.yaml, .yml, .ini, .cfg, .conf |
Configuration files |
.log |
Log files |
The tool includes stopword lists for multiple languages:
en- Englishes- Spanishfr- Frenchde- Germanpt- Portugueseit- Italian
Stopword files are located in the ignore/ directory. Each file is a CSV containing common words to be filtered out.
You can add custom stopword files by creating a new file in the ignore/ directory:
# Create a custom stopword file for Dutch
echo "de,het,een,en,van,in,is,het,dat,op" > ignore/nl.csvStopword files should be CSV format with words separated by commas. Lines starting with # are treated as comments:
# English stopwords
a,an,the,and,or,but
is,are,was,were,be
# Add more words...You can also use the library programmatically in your Python code:
from wordcloud_generator import WordDatabase, FileReader, WordProcessor, CloudGenerator
# Initialize components
db = WordDatabase("words.db")
reader = FileReader()
processor = WordProcessor(language="en")
# Process files
for file_path in reader.get_files("/path/to/docs"):
content, file_hash = reader.read_file(file_path)
word_counts = processor.process_text(content)
db.add_words_batch(word_counts)
# Generate word cloud (color param uses specified color + shades for words;
# supports RGB/hex/names; auto-adjusts poor contrast unless ignore_contrast_warning=True)
generator = CloudGenerator(
width=800, height=400, background_color="white", color="blue,red"
)
word_counts = db.get_all_words(min_count=2)
generator.generate(word_counts, "output.png")py-word-cloud/
├── README.md # This file
├── requirements.txt # Python dependencies
├── setup.py # Package setup script
├── wordcloud_generator/ # Main package
│ ├── __init__.py # Package initialization
│ ├── cli.py # Command-line interface
│ ├── db.py # SQLite database operations
│ ├── file_reader.py # File reading utilities
│ ├── word_processor.py # Text processing and filtering
│ └── cloud_generator.py # Word cloud generation
└── ignore/ # Stopword files
├── en.csv # English stopwords
├── es.csv # Spanish stopwords
├── fr.csv # French stopwords
├── de.csv # German stopwords
├── pt.csv # Portuguese stopwords
└── it.csv # Italian stopwords
- Python 3.8 or higher
- uv package manager (recommended) or pip
- For graphical output: a display server (X11, Wayland) or use
--asciifor headless systems
MIT License - see LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.