py-word-cloud

A Python command-line tool that generates word clouds from directory contents. It reads various file formats (text, PDF, DOCX, etc.), stores word frequencies in a SQLite database, filters out common stopwords based on language, and generates beautiful word cloud visualizations.

Features

Multi-format support: Reads text files, PDFs, DOCX, HTML, XML, JSON, Markdown, and source code files
Language-aware filtering: Removes common stopwords (articles, prepositions, etc.) based on language
Persistent storage: Uses SQLite to store word frequencies, enabling incremental processing
Multiple output formats: PNG, SVG, JPG, PDF for graphical output, or ASCII art for terminal display
Highly configurable: Customize colors, dimensions, word limits, and more
Cross-platform: Works on Linux, macOS, and Windows

Installation

From source (using uv)

# Clone the repository
git clone https://github.com/example/py-word-cloud.git
cd py-word-cloud

# Install uv if not already installed
# curl -LsSf https://astral.sh/uv/install.sh | sh

# Create a virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install the package in development mode
uv pip install -e .

# Or install dependencies only
uv pip install -r requirements.txt

Quick run without installation

# Run directly with uv (no venv needed)
uv run python -m wordcloud_generator --help

# Or with dependencies synced
uv run --with wordcloud --with matplotlib --with pypdf --with python-docx --with chardet --with Pillow python -m wordcloud_generator

Dependencies

The required dependencies will be installed automatically:

wordcloud - Word cloud generation
matplotlib - Plotting and image generation
pypdf - PDF file reading
python-docx - DOCX file reading
chardet - Character encoding detection
Pillow - Image processing

Quick Start

# Generate a word cloud from the current directory
wordcloud-gen

# Process a specific directory
wordcloud-gen /path/to/documents

# Generate with specific language stopwords
wordcloud-gen -l es /path/to/spanish/docs

# Generate ASCII art (for terminal display)
wordcloud-gen --ascii

# Save to a specific file
wordcloud-gen -o my_cloud.png /path/to/docs

Usage

usage: wordcloud-gen [-h] [-l LANG] [--list-languages] [-o FILE]
                     [--format {png,svg,jpg,pdf}] [-a] [--simple] [--db FILE]
                     [--clear-db] [-r] [--no-recursive] [-e EXT [EXT ...]]
                     [--min-length N] [--max-length N] [--min-count N]
                     [--max-words N] [--ignore-words WORD [WORD ...]]
                     [-W N] [-H N] [--bgColor COLOR] [--color COLOR]
                     [--ignore-contrast-warning] [--colormap NAME] [--title TEXT]
                     [-v] [-q] [--stats] [--dry-run] [--version] [directory]

Arguments

Argument	Description
`directory`	Directory to process (default: current directory)

Options

Language Options

Option	Description
`-l, --language LANG`	Language code for stopwords (e.g., 'en', 'es'). Auto-detects from OS locale by default
`--list-languages`	List available language codes and exit

Output Options

Option	Description
`-o, --output FILE`	Output file path (default: wordcloud.png or .txt)
`--format {png,svg,jpg,pdf}`	Output format for graphical clouds (default: png)
`-a, --ascii`	Generate ASCII art output instead of image
`--simple`	Generate simple text frequency chart

Database Options

Option	Description
`--db FILE`	SQLite database file for persistent storage (default: in-memory)
`--clear-db`	Clear existing database before processing

Processing Options

Option	Description
`-r, --recursive`	Recursively process subdirectories (default: True)
`--no-recursive`	Don't process subdirectories
`-e, --extensions EXT`	File extensions to process (e.g., txt pdf docx)
`--min-length N`	Minimum word length (default: 2)
`--max-length N`	Maximum word length (default: 50)
`--min-count N`	Minimum word count to include (default: 1)
`--max-words N`	Maximum words in cloud (default: 200)
`--ignore-words WORD`	Additional words to ignore

Appearance Options

Option	Description
`-W, --width N`	Output width in pixels/chars (default: 800)
`-H, --height N`	Output height in pixels/lines (default: 400)
`--bgColor COLOR` `--bgcolor COLOR`	Background color (default: white). Aliases supported for flexibility.
`--color COLOR`	Word color(s) for image output only (not ASCII/console). Supports RGB (`rgb(255,0,0)`), hex (`#ff0000`), or 10 common names: red, green, blue, yellow, orange, purple, pink, brown, black, gray. Comma-separated for multiples; applies shades of color(s).
`--ignore-contrast-warning`	Skip auto-adjustment of colors to pass WCAG AA contrast; use requested colors as-is (with warning).
`--colormap NAME`	Matplotlib colormap for word colors (e.g., 'viridis', 'plasma')
`--title TEXT`	Title for the word cloud

General Options

Option	Description
`-v, --verbose`	Enable verbose output
`-q, --quiet`	Suppress non-error output
`--stats`	Print statistics after processing
`--dry-run`	Process files but don't generate output
`--version`	Show version and exit

Examples

Basic Usage

# Process current directory with default settings
wordcloud-gen

# Process a specific directory
wordcloud-gen ~/Documents/research

# Specify output file
wordcloud-gen -o research_cloud.png ~/Documents/research

Language-Specific Processing

# Use Spanish stopwords
wordcloud-gen -l es ~/Documents/spanish_texts

# Use German stopwords  
wordcloud-gen -l de ~/Documents/german_docs

# List available languages
wordcloud-gen --list-languages

ASCII Output (for non-graphical systems)

The ASCII mode generates true variable-sized text art, where word size reflects frequency:

# Generate ASCII word cloud with sized text
wordcloud-gen --ascii

# Generate simple frequency chart
wordcloud-gen --simple

# Save ASCII output to file
wordcloud-gen --ascii -o cloud.txt

# Adjust dimensions for terminal
wordcloud-gen --ascii -W 120 -H 40

Example output:

╔════════════════════════════════════════════════════════════════════╗
║                                                                    ║
║  ######  #     # ####### #     # ####### #     #                   ║
║  #     #  #   #     #    #     # #     # ##    #                   ║
║  ######     #       #    ####### #     # #  #  #   ← Largest word  ║
║  #          #       #    #     # #     # #   # #                   ║
║  #          #       #    #     # ####### #     #                   ║
║                                                                    ║
║     ██▄ ██▄ ▄█▄ ▄█▀   ▄█▀ ▄█▄ ██▄ ██▀            ← Medium words   ║
║      █▀▀ █▀▄ █ █ █ █   █   █ █ █ █ █▀                              ║
║     █   █ █ ▀█▀ ▀█▄   ▀█▄ ▀█▀ ██▀ ██▄                              ║
║                                                                    ║
║           ◆DEVELOPER◆  ◆COMPUTER◆  ◆SOFTWARE◆     ← Small words   ║
║                                                                    ║
║     algorithm function variable class method       ← Smallest     ║
╚════════════════════════════════════════════════════════════════════╝

Advanced Options

# Use persistent database for large projects
wordcloud-gen --db project_words.db ~/large/project

# Continue incrementally with same database
wordcloud-gen --db project_words.db ~/large/project/new_files

# Clear database and reprocess
wordcloud-gen --db project_words.db --clear-db ~/project

# Only process specific file types
wordcloud-gen -e txt md rst ~/docs

# Customize appearance (color for words + bgColor; --bgcolor alias also works)
# Use --ignore-contrast-warning to force original colors (skips auto-adjust)
wordcloud-gen --width 1200 --height 600 --colormap viridis --bgColor black --color red,blue

# Add custom ignore words
wordcloud-gen --ignore-words TODO FIXME DEBUG ~/code

# Get statistics without generating output
wordcloud-gen --dry-run --stats ~/docs

SVG Output

# Generate scalable vector graphics
wordcloud-gen --format svg -o cloud.svg ~/docs

Supported File Formats

Extension	Description
`.txt`	Plain text files
`.md`, `.markdown`	Markdown files
`.pdf`	PDF documents
`.docx`	Microsoft Word documents
`.html`, `.htm`	HTML files
`.xml`	XML files
`.json`	JSON files
`.csv`, `.tsv`	CSV/TSV files
`.rst`	reStructuredText files
`.rtf`	Rich Text Format files
`.py`, `.js`, `.java`, `.c`, `.cpp`, `.go`, `.rs`	Source code files
`.yaml`, `.yml`, `.ini`, `.cfg`, `.conf`	Configuration files
`.log`	Log files

Stopwords

The tool includes stopword lists for multiple languages:

en - English
es - Spanish
fr - French
de - German
pt - Portuguese
it - Italian

Stopword files are located in the ignore/ directory. Each file is a CSV containing common words to be filtered out.

Adding Custom Stopwords

You can add custom stopword files by creating a new file in the ignore/ directory:

# Create a custom stopword file for Dutch
echo "de,het,een,en,van,in,is,het,dat,op" > ignore/nl.csv

File Format

Stopword files should be CSV format with words separated by commas. Lines starting with # are treated as comments:

# English stopwords
a,an,the,and,or,but
is,are,was,were,be
# Add more words...

Programmatic Usage

You can also use the library programmatically in your Python code:

from wordcloud_generator import WordDatabase, FileReader, WordProcessor, CloudGenerator

# Initialize components
db = WordDatabase("words.db")
reader = FileReader()
processor = WordProcessor(language="en")

# Process files
for file_path in reader.get_files("/path/to/docs"):
    content, file_hash = reader.read_file(file_path)
    word_counts = processor.process_text(content)
    db.add_words_batch(word_counts)

# Generate word cloud (color param uses specified color + shades for words;
# supports RGB/hex/names; auto-adjusts poor contrast unless ignore_contrast_warning=True)
generator = CloudGenerator(
    width=800, height=400, background_color="white", color="blue,red"
)
word_counts = db.get_all_words(min_count=2)
generator.generate(word_counts, "output.png")

Project Structure

py-word-cloud/
├── README.md                 # This file
├── requirements.txt          # Python dependencies
├── setup.py                  # Package setup script
├── wordcloud_generator/      # Main package
│   ├── __init__.py          # Package initialization
│   ├── cli.py               # Command-line interface
│   ├── db.py                # SQLite database operations
│   ├── file_reader.py       # File reading utilities
│   ├── word_processor.py    # Text processing and filtering
│   └── cloud_generator.py   # Word cloud generation
└── ignore/                   # Stopword files
    ├── en.csv               # English stopwords
    ├── es.csv               # Spanish stopwords
    ├── fr.csv               # French stopwords
    ├── de.csv               # German stopwords
    ├── pt.csv               # Portuguese stopwords
    └── it.csv               # Italian stopwords

Requirements

Python 3.8 or higher
uv package manager (recommended) or pip
For graphical output: a display server (X11, Wayland) or use --ascii for headless systems

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
ignore		ignore
sample		sample
tests		tests
wordcloud_generator		wordcloud_generator
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

License

xnt/py-word-cloud

Folders and files

Latest commit

History

Repository files navigation

py-word-cloud

Features

Installation

From source (using uv)

Quick run without installation

Dependencies

Quick Start

Usage

Arguments

Options

Language Options

Output Options

Database Options

Processing Options

Appearance Options

General Options

Examples

Basic Usage

Language-Specific Processing

ASCII Output (for non-graphical systems)

Advanced Options

SVG Output

Supported File Formats

Stopwords

Adding Custom Stopwords

File Format

Programmatic Usage

Project Structure

Requirements

License

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages