A fast, zero-dependency duplicate code detector and code clone finder. Analyzes source files across C, C++, Python, JavaScript, Java, Go, and other languages to identify similar code blocks for refactoring and code quality improvement.
- Multi-language support - Works with any bracket-based or indentation-based language
- Machine-readable output - JSON output mode for integration with AI agents and automated refactoring pipelines
- Zero dependencies - Single Python script, no external packages required
- HTML reports - Visual side-by-side diffs with syntax highlighting
- Smart filtering - Compares similarly-named files first, skips small blocks, prunes subtrees
- Scales to large codebases - Tested on Linux kernel, Bitcoin, Protobuf, and other major projects
- Refactoring - Identify duplicate code blocks to extract into common functions or templates
- Code review - Find copy-paste bugs where code was duplicated but only partially updated
- Technical debt - Measure and reduce code duplication across your codebase
- AI-assisted refactoring - Feed machine-readable output to LLM agents for automated code improvements
curl -O https://raw.githubusercontent.com/forhadahmed/refactor/main/refactor
chmod +x refactor
sudo mv refactor /usr/local/bin/git clone https://github.com/forhadahmed/refactor.git
cd refactor
chmod +x refactor
./refactor --helprefactor [options] [files]Or pipe files via stdin:
find src/ -name "*.cpp" | refactorBy default, outputs a colorized side-by-side diff to the terminal. Use --html for a visual report:
refactor is designed for machine-readable output, making it ideal for AI-powered code refactoring pipelines:
# JSON output for programmatic consumption
refactor --json src/*.py
# Pipe to LLM agents for automated refactoring suggestions
refactor --json src/ | your-ai-agent --task "refactor duplicates"The JSON output includes:
- File paths and line numbers for each duplicate pair
- Similarity scores between code blocks
- The actual code content for both blocks
This enables AI coding assistants and autonomous agents to:
- Discover code duplication automatically
- Analyze the similarity patterns
- Generate refactored code with common abstractions
- Apply changes programmatically
Perfect for integration with tools like Claude Code, Cursor, Aider, and other AI coding agents.
A block (or "scope") within a source file is code enclosed within brackets { ... } or within an indent level (Python).
Blocks are hierarchical:
- A block can have multiple child blocks/scopes
- A block is part of a parent block/scope (unless it is the top-level scope)
The tool parses all blocks from source files and does pairwise comparison for "similarity" using Python's SequenceMatcher. Blocks with a ratio() above the threshold (default=0.8) are considered similar.
Optimizations:
- Largest blocks are compared first
- If two blocks are similar, their child subtrees are pruned from further comparison
- Blocks from differently-named files are skipped by default
- Minimum block size filter eliminates noise from small code fragments
| Option | Description | Default |
|---|---|---|
--min-block-size |
Minimum block size in characters | 1500 |
--max-block-diff |
Maximum length difference between compared blocks | 500 |
--all-indents |
Compare blocks across different indent levels | off |
--all-files |
Compare blocks across all files (not just similarly-named) | off |
--json |
Output results in JSON format for machine consumption | off |
-o, --output |
HTML output file | report-<pid>.html |
| Project | Repository | Results |
|---|---|---|
| Linux Kernel Ethernet Drivers | torvalds/linux | drivers.html (~400 similar blocks) |
| C++ JSON Library | nlohmann/json | json.html (~350 similar blocks) |
| Bitcoin + Dogecoin | bitcoin/bitcoin | crypto.html (~270 similar blocks) |
| Go BGP Implementation | osrg/gobgp | gobgp.html (~250 similar blocks) |
| Google Protobuf | protocolbuffers/protobuf | protobuf.html (~215 similar blocks) |
| Dear ImGui | ocornut/imgui | imgui.html (~30 similar blocks) |
Interesting find: Dogecoin and Bitcoin share massive code duplication since Dogecoin is a fork. A common library would clean things up significantly!
MIT License - see LICENSE for details.


