Skip to content

visto/unicode-cleaner-plugin

Repository files navigation

Unicode Text Cleaner - IntelliJ Plugin

License: MIT IntelliJ Plugin

A powerful IntelliJ IDEA plugin that detects and cleans problematic Unicode characters that can trigger AI detection systems or cause compatibility issues. Helps normalize text to use standard ASCII characters for better cross-platform compatibility.

🎯 Why Use This Plugin?

AI detection tools often flag content based on subtle Unicode "fingerprints" that indicate machine generation:

  • Smart quotes (" ") instead of straight quotes (")
  • Em dashes () instead of hyphens (-)
  • Hidden characters like zero-width spaces, soft hyphens
  • Full-width characters from Asian character sets
  • Typographic punctuation like ellipsis () instead of three periods (...)

This plugin helps you create content that appears naturally human-typed while maintaining perfect readability.

✨ Features

🔍 Real-time Detection

  • Highlights problematic Unicode characters as you type
  • Shows Unicode code points and character descriptions
  • Color-coded warnings by severity level

Smart Quick Fixes

  • Individual fixes: Replace single characters with ASCII equivalents
  • Category fixes: Clean all quotes, dashes, or spaces at once
  • File-wide fixes: Clean entire documents in one click
  • Project-wide fixes: Batch process multiple files

⚙️ Configurable Categories

Enable/disable detection for specific character types:

  • Hidden/Control Characters - Zero-width spaces, soft hyphens, directional marks
  • Smart Quotes & Apostrophes - Typographic quotes (" " ' ')
  • Em & En Dashes - Long dashes ( ) → hyphens (-)
  • Ellipsis & Bullets - Special punctuation ( ) → ASCII equivalents
  • Full-Width Characters - Asian character variants (ABC) → standard ASCII
  • Non-Standard Spaces - Non-breaking, ideographic spaces → regular spaces
  • Variation Selectors - Unicode formatting modifiers

📁 File Type Support

Configurable file type filtering supports:

  • Text files: .txt, .md, .rst
  • Source code: .java, .js, .ts, .py, .cpp, .c, .h
  • Configuration: .xml, .json, .yaml, .yml, .properties
  • Web files: .html, .css
  • Custom extensions: Add your own file types

🎛️ Flexible Usage

  • Menu actions: Edit → Unicode Cleaner → [action]
  • Keyboard shortcuts:
    • Ctrl+Shift+U - Clean current file
    • Ctrl+Alt+U - Clean selected text
  • Context menus: Right-click for quick access
  • Project view: Clean selected files from project tree

🚀 Installation

Method 1: From Plugin Marketplace (Coming Soon)

  1. Open IntelliJ IDEA
  2. Go to File → Settings → Plugins
  3. Search for "Unicode Text Cleaner"
  4. Click Install and restart IntelliJ

Method 2: Manual Installation

  1. Download the latest release from GitHub Releases
  2. Open IntelliJ IDEA
  3. Go to File → Settings → Plugins
  4. Click ⚙️ gear icon → Install Plugin from Disk...
  5. Select the downloaded ZIP file
  6. Restart IntelliJ IDEA

📖 Usage Guide

Quick Start

  1. Open any text file with Unicode characters
  2. See highlighted warnings on problematic characters
  3. Click the lightbulb 💡 or press Alt+Enter for quick fixes
  4. Use menu actions for bulk operations

Example: Cleaning AI-Generated Text

Before cleaning:

This text has "smart quotes" and em—dashes.
It also contains ellipsis… and bullets •
Some hidden characters like soft­hyphens.
Full-width numbers: 12345

After cleaning:

This text has "smart quotes" and em-dashes.
It also contains ellipsis... and bullets *
Some hidden characters like soft-hyphens.
Full-width numbers: 12345

Detailed Usage

🔧 Settings Configuration

  1. Go to File → Settings → Unicode Cleaner
  2. Enable/disable character categories you want to detect
  3. Configure file extensions to process
  4. Set performance options for large files

🎯 Targeted Cleaning

  • Current file: Edit → Unicode Cleaner → Clean Unicode Characters in File
  • Selected text: Select text, then Edit → Unicode Cleaner → Clean Selected Text
  • Entire project: Edit → Unicode Cleaner → Clean Unicode Characters in Project
  • Selected files: Right-click files in Project view → "Clean Unicode Characters"

🔍 Real-time Detection

  • Warning highlights: Problematic characters are underlined
  • Hover tooltips: Show Unicode code points and descriptions
  • Problem severity: Different colors for different issue types
    • 🔴 Red: Hidden/control characters (serious issues)
    • 🟡 Yellow: Typographic characters (moderate issues)
    • 🟠 Orange: Space and punctuation issues (minor issues)

🌍 Language Support

Safe for International Text

The plugin is designed to preserve legitimate international characters:

  • German umlauts: ä, ö, ü, Ä, Ö, Ü, ß
  • French accents: é, è, à, ç, ê, ë, î, ï, ô, ù, û, ü, ÿ
  • Spanish characters: ñ, Ñ, á, é, í, ó, ú
  • Nordic characters: å, Å, æ, Æ, ø, Ø
  • Slavic characters: č, š, ž, ř, ň, etc.
  • Other Latin scripts: All legitimate Latin-1 and Latin Extended characters

⚠️ What Gets Cleaned

Only problematic characters that commonly indicate AI generation:

  • ❌ Smart quotes and fancy punctuation
  • ❌ Hidden/invisible characters
  • ❌ Full-width Asian variants of ASCII characters
  • ❌ Typographic dashes and spaces

📋 Character Reference

📝 Complete List of Detected Characters

Hidden/Control Characters

Character Unicode Description Action
­ U+00AD Soft hyphen Remove
U+200B Zero width space Remove
U+200C Zero width non-joiner Remove
U+200D Zero width joiner Remove
U+200E Left-to-right mark Remove
U+200F Right-to-left mark Remove
U+2060 Word joiner Remove

Smart Quotes & Apostrophes

Character Unicode Description Replacement
" U+201C Left double quote "
" U+201D Right double quote "
' U+2018 Left single quote '
' U+2019 Right single quote '
U+201E Double low-9 quote "
U+201A Single low-9 quote '

Dashes

Character Unicode Description Replacement
U+2014 Em dash -
U+2013 En dash -
U+2012 Figure dash -
U+2212 Minus sign -

Punctuation

Character Unicode Description Replacement
U+2026 Horizontal ellipsis ...
U+2022 Bullet *
· U+00B7 Middle dot *

Full-Width Characters

Character Unicode Description Replacement
U+FF01 Full-width exclamation !
U+FF1F Full-width question mark ?
123 U+FF11-FF19 Full-width digits 123
ABC U+FF21-FF3A Full-width letters ABC

⚙️ Configuration Options

Character Categories

Configure which types of characters to detect:

☑️ Hidden/Control Characters    (Recommended: ON)
☑️ Smart Quotes & Apostrophes  (Recommended: ON) 
☑️ Em & En Dashes              (Recommended: ON)
☑️ Ellipsis & Bullets          (Recommended: ON)
☑️ Full-Width Characters       (Recommended: ON)
☑️ Non-Standard Spaces         (Recommended: ON)
☑️ Variation Selectors         (Recommended: ON)

File Types

Specify which file extensions to process:

Default: txt,md,rst,java,js,ts,py,cpp,c,h,xml,json,yaml,yml,properties,html,css
Custom: Add your own comma-separated extensions

Performance Settings

  • Max file size: Set limit for real-time detection (default: 10MB)
  • Real-time detection: Enable/disable live highlighting
  • Batch processing: Configure timeouts for large operations

🎮 Keyboard Shortcuts

Action Default Shortcut Description
Clean Current File Ctrl+Shift+U Clean all Unicode issues in current file
Clean Selected Text Ctrl+Alt+U Clean Unicode issues in selected text
Open Settings - Open Unicode Cleaner settings panel

Shortcuts can be customized in IntelliJ's Keymap settings

🔧 Development & Contributing

Building from Source

git clone https://github.com/unicodecleaner/intellij-plugin.git
cd intellij-plugin
./gradlew buildPlugin

Running Tests

./gradlew test
./gradlew performanceTest

Development Mode

./gradlew runIde

📊 Performance

Benchmarks

  • Small files (< 1MB): < 10ms detection
  • Medium files (1-10MB): < 100ms detection
  • Large files (10MB+): < 1000ms detection
  • Quick fixes: < 50ms per operation
  • Batch processing: ~200 files/second

Memory Usage

  • Plugin overhead: < 50MB
  • Large projects: < 200MB additional memory
  • Automatic cleanup: Caches cleared after 1 hour

🐛 Troubleshooting

Common Issues

Q: Plugin not appearing in menus

  • A: Check Settings → Plugins → Installed → "Unicode Text Cleaner" is enabled
  • Restart IntelliJ if needed

Q: Characters not being highlighted

  • A: Check Settings → Unicode Cleaner → ensure character categories are enabled
  • Verify file extension is in the configured list

Q: Performance issues with large files

  • A: Increase max file size limit in settings
  • Disable real-time detection for very large files

Q: Plugin incompatible with IntelliJ version

  • A: Check compatibility - requires IntelliJ 2023.1 or later
  • Download latest plugin version

Getting Help

📋 FAQ

❓ Will this plugin delete my international characters?

No! The plugin is designed to preserve all legitimate international characters including:

  • German umlauts (ä, ö, ü)
  • French accents (é, è, à, ç)
  • Spanish ñ, Nordic å/æ/ø
  • All Latin-1 and Latin Extended characters

Only problematic Unicode characters that indicate AI generation are targeted.

❓ What's the difference between this and a simple find/replace?

This plugin:

  • Automatically detects 50+ problematic character types
  • Real-time highlighting shows issues as you type
  • Smart categorization lets you clean specific types
  • Bulk operations across multiple files
  • Preserves formatting and legitimate international text
  • Configurable rules for different use cases
❓ Will this make my text look robotic?

No! The plugin converts fancy typography to standard ASCII, which is how most people actually type. The result looks more natural and human-like, not robotic.

Before: This is "fancy" typography—with special characters… After: This is "normal" typography-with regular characters...

❓ Can I use this for code files?

Yes! The plugin is safe for code files and only targets problematic characters that shouldn't appear in source code anyway. It's particularly useful for:

  • Cleaning copy-pasted code from documentation
  • Fixing smart quotes in string literals
  • Removing hidden characters that cause syntax errors
❓ Does this work with all file types?

The plugin can work with any text-based file. By default, it processes common file types:

  • Text: .txt, .md, .rst
  • Code: .java, .js, .py, .cpp, etc.
  • Config: .json, .xml, .yaml, etc.

You can add custom file extensions in the settings.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Inspired by the growing need for AI detection avoidance
  • Thanks to the IntelliJ Platform team for excellent plugin APIs
  • Community feedback and suggestions from beta testers

Made with ❤️ for content creators who want their text to appear naturally human-typed.

For more tools and resources, visit stoitschev.de

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages