Skip to content

A text analyzer and word cloud generator in python. PNG or CLI (ASCII art), supports different languages and ignore words, txt or pdf inputs.

License

Notifications You must be signed in to change notification settings

xnt/py-word-cloud

Repository files navigation

py-word-cloud

A Python command-line tool that generates word clouds from directory contents. It reads various file formats (text, PDF, DOCX, etc.), stores word frequencies in a SQLite database, filters out common stopwords based on language, and generates beautiful word cloud visualizations.

Features

  • Multi-format support: Reads text files, PDFs, DOCX, HTML, XML, JSON, Markdown, and source code files
  • Language-aware filtering: Removes common stopwords (articles, prepositions, etc.) based on language
  • Persistent storage: Uses SQLite to store word frequencies, enabling incremental processing
  • Multiple output formats: PNG, SVG, JPG, PDF for graphical output, or ASCII art for terminal display
  • Highly configurable: Customize colors, dimensions, word limits, and more
  • Cross-platform: Works on Linux, macOS, and Windows

Installation

From source (using uv)

# Clone the repository
git clone https://github.com/example/py-word-cloud.git
cd py-word-cloud

# Install uv if not already installed
# curl -LsSf https://astral.sh/uv/install.sh | sh

# Create a virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install the package in development mode
uv pip install -e .

# Or install dependencies only
uv pip install -r requirements.txt

Quick run without installation

# Run directly with uv (no venv needed)
uv run python -m wordcloud_generator --help

# Or with dependencies synced
uv run --with wordcloud --with matplotlib --with pypdf --with python-docx --with chardet --with Pillow python -m wordcloud_generator

Dependencies

The required dependencies will be installed automatically:

  • wordcloud - Word cloud generation
  • matplotlib - Plotting and image generation
  • pypdf - PDF file reading
  • python-docx - DOCX file reading
  • chardet - Character encoding detection
  • Pillow - Image processing

Quick Start

# Generate a word cloud from the current directory
wordcloud-gen

# Process a specific directory
wordcloud-gen /path/to/documents

# Generate with specific language stopwords
wordcloud-gen -l es /path/to/spanish/docs

# Generate ASCII art (for terminal display)
wordcloud-gen --ascii

# Save to a specific file
wordcloud-gen -o my_cloud.png /path/to/docs

Usage

usage: wordcloud-gen [-h] [-l LANG] [--list-languages] [-o FILE]
                     [--format {png,svg,jpg,pdf}] [-a] [--simple] [--db FILE]
                     [--clear-db] [-r] [--no-recursive] [-e EXT [EXT ...]]
                     [--min-length N] [--max-length N] [--min-count N]
                     [--max-words N] [--ignore-words WORD [WORD ...]]
                     [-W N] [-H N] [--bgColor COLOR] [--color COLOR]
                     [--ignore-contrast-warning] [--colormap NAME] [--title TEXT]
                     [-v] [-q] [--stats] [--dry-run] [--version] [directory]

Arguments

Argument Description
directory Directory to process (default: current directory)

Options

Language Options

Option Description
-l, --language LANG Language code for stopwords (e.g., 'en', 'es'). Auto-detects from OS locale by default
--list-languages List available language codes and exit

Output Options

Option Description
-o, --output FILE Output file path (default: wordcloud.png or .txt)
--format {png,svg,jpg,pdf} Output format for graphical clouds (default: png)
-a, --ascii Generate ASCII art output instead of image
--simple Generate simple text frequency chart

Database Options

Option Description
--db FILE SQLite database file for persistent storage (default: in-memory)
--clear-db Clear existing database before processing

Processing Options

Option Description
-r, --recursive Recursively process subdirectories (default: True)
--no-recursive Don't process subdirectories
-e, --extensions EXT File extensions to process (e.g., txt pdf docx)
--min-length N Minimum word length (default: 2)
--max-length N Maximum word length (default: 50)
--min-count N Minimum word count to include (default: 1)
--max-words N Maximum words in cloud (default: 200)
--ignore-words WORD Additional words to ignore

Appearance Options

Option Description
-W, --width N Output width in pixels/chars (default: 800)
-H, --height N Output height in pixels/lines (default: 400)
--bgColor COLOR
--bgcolor COLOR
Background color (default: white). Aliases supported for flexibility.
--color COLOR Word color(s) for image output only (not ASCII/console). Supports RGB (rgb(255,0,0)), hex (#ff0000), or 10 common names: red, green, blue, yellow, orange, purple, pink, brown, black, gray. Comma-separated for multiples; applies shades of color(s).
--ignore-contrast-warning Skip auto-adjustment of colors to pass WCAG AA contrast; use requested colors as-is (with warning).
--colormap NAME Matplotlib colormap for word colors (e.g., 'viridis', 'plasma')
--title TEXT Title for the word cloud

General Options

Option Description
-v, --verbose Enable verbose output
-q, --quiet Suppress non-error output
--stats Print statistics after processing
--dry-run Process files but don't generate output
--version Show version and exit

Examples

Basic Usage

# Process current directory with default settings
wordcloud-gen

# Process a specific directory
wordcloud-gen ~/Documents/research

# Specify output file
wordcloud-gen -o research_cloud.png ~/Documents/research

Language-Specific Processing

# Use Spanish stopwords
wordcloud-gen -l es ~/Documents/spanish_texts

# Use German stopwords  
wordcloud-gen -l de ~/Documents/german_docs

# List available languages
wordcloud-gen --list-languages

ASCII Output (for non-graphical systems)

The ASCII mode generates true variable-sized text art, where word size reflects frequency:

# Generate ASCII word cloud with sized text
wordcloud-gen --ascii

# Generate simple frequency chart
wordcloud-gen --simple

# Save ASCII output to file
wordcloud-gen --ascii -o cloud.txt

# Adjust dimensions for terminal
wordcloud-gen --ascii -W 120 -H 40

Example output:

╔════════════════════════════════════════════════════════════════════╗
║                                                                    ║
║  ######  #     # ####### #     # ####### #     #                   ║
║  #     #  #   #     #    #     # #     # ##    #                   ║
║  ######     #       #    ####### #     # #  #  #   ← Largest word  ║
║  #          #       #    #     # #     # #   # #                   ║
║  #          #       #    #     # ####### #     #                   ║
║                                                                    ║
║     ██▄ ██▄ ▄█▄ ▄█▀   ▄█▀ ▄█▄ ██▄ ██▀            ← Medium words   ║
║      █▀▀ █▀▄ █ █ █ █   █   █ █ █ █ █▀                              ║
║     █   █ █ ▀█▀ ▀█▄   ▀█▄ ▀█▀ ██▀ ██▄                              ║
║                                                                    ║
║           ◆DEVELOPER◆  ◆COMPUTER◆  ◆SOFTWARE◆     ← Small words   ║
║                                                                    ║
║     algorithm function variable class method       ← Smallest     ║
╚════════════════════════════════════════════════════════════════════╝

Advanced Options

# Use persistent database for large projects
wordcloud-gen --db project_words.db ~/large/project

# Continue incrementally with same database
wordcloud-gen --db project_words.db ~/large/project/new_files

# Clear database and reprocess
wordcloud-gen --db project_words.db --clear-db ~/project

# Only process specific file types
wordcloud-gen -e txt md rst ~/docs

# Customize appearance (color for words + bgColor; --bgcolor alias also works)
# Use --ignore-contrast-warning to force original colors (skips auto-adjust)
wordcloud-gen --width 1200 --height 600 --colormap viridis --bgColor black --color red,blue

# Add custom ignore words
wordcloud-gen --ignore-words TODO FIXME DEBUG ~/code

# Get statistics without generating output
wordcloud-gen --dry-run --stats ~/docs

SVG Output

# Generate scalable vector graphics
wordcloud-gen --format svg -o cloud.svg ~/docs

Supported File Formats

Extension Description
.txt Plain text files
.md, .markdown Markdown files
.pdf PDF documents
.docx Microsoft Word documents
.html, .htm HTML files
.xml XML files
.json JSON files
.csv, .tsv CSV/TSV files
.rst reStructuredText files
.rtf Rich Text Format files
.py, .js, .java, .c, .cpp, .go, .rs Source code files
.yaml, .yml, .ini, .cfg, .conf Configuration files
.log Log files

Stopwords

The tool includes stopword lists for multiple languages:

  • en - English
  • es - Spanish
  • fr - French
  • de - German
  • pt - Portuguese
  • it - Italian

Stopword files are located in the ignore/ directory. Each file is a CSV containing common words to be filtered out.

Adding Custom Stopwords

You can add custom stopword files by creating a new file in the ignore/ directory:

# Create a custom stopword file for Dutch
echo "de,het,een,en,van,in,is,het,dat,op" > ignore/nl.csv

File Format

Stopword files should be CSV format with words separated by commas. Lines starting with # are treated as comments:

# English stopwords
a,an,the,and,or,but
is,are,was,were,be
# Add more words...

Programmatic Usage

You can also use the library programmatically in your Python code:

from wordcloud_generator import WordDatabase, FileReader, WordProcessor, CloudGenerator

# Initialize components
db = WordDatabase("words.db")
reader = FileReader()
processor = WordProcessor(language="en")

# Process files
for file_path in reader.get_files("/path/to/docs"):
    content, file_hash = reader.read_file(file_path)
    word_counts = processor.process_text(content)
    db.add_words_batch(word_counts)

# Generate word cloud (color param uses specified color + shades for words;
# supports RGB/hex/names; auto-adjusts poor contrast unless ignore_contrast_warning=True)
generator = CloudGenerator(
    width=800, height=400, background_color="white", color="blue,red"
)
word_counts = db.get_all_words(min_count=2)
generator.generate(word_counts, "output.png")

Project Structure

py-word-cloud/
├── README.md                 # This file
├── requirements.txt          # Python dependencies
├── setup.py                  # Package setup script
├── wordcloud_generator/      # Main package
│   ├── __init__.py          # Package initialization
│   ├── cli.py               # Command-line interface
│   ├── db.py                # SQLite database operations
│   ├── file_reader.py       # File reading utilities
│   ├── word_processor.py    # Text processing and filtering
│   └── cloud_generator.py   # Word cloud generation
└── ignore/                   # Stopword files
    ├── en.csv               # English stopwords
    ├── es.csv               # Spanish stopwords
    ├── fr.csv               # French stopwords
    ├── de.csv               # German stopwords
    ├── pt.csv               # Portuguese stopwords
    └── it.csv               # Italian stopwords

Requirements

  • Python 3.8 or higher
  • uv package manager (recommended) or pip
  • For graphical output: a display server (X11, Wayland) or use --ascii for headless systems

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

A text analyzer and word cloud generator in python. PNG or CLI (ASCII art), supports different languages and ignore words, txt or pdf inputs.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages