Skip to content

Code duplication detector with machine-readable output for AI agents - finds similar code blocks across C/C++/Python/Go/Java/JS for automated refactoring

License

Notifications You must be signed in to change notification settings

forhadahmed/refactor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

refactor

License: MIT Python 3.6+

A fast, zero-dependency duplicate code detector and code clone finder. Analyzes source files across C, C++, Python, JavaScript, Java, Go, and other languages to identify similar code blocks for refactoring and code quality improvement.

Features

  • Multi-language support - Works with any bracket-based or indentation-based language
  • Machine-readable output - JSON output mode for integration with AI agents and automated refactoring pipelines
  • Zero dependencies - Single Python script, no external packages required
  • HTML reports - Visual side-by-side diffs with syntax highlighting
  • Smart filtering - Compares similarly-named files first, skips small blocks, prunes subtrees
  • Scales to large codebases - Tested on Linux kernel, Bitcoin, Protobuf, and other major projects

Use Cases

  • Refactoring - Identify duplicate code blocks to extract into common functions or templates
  • Code review - Find copy-paste bugs where code was duplicated but only partially updated
  • Technical debt - Measure and reduce code duplication across your codebase
  • AI-assisted refactoring - Feed machine-readable output to LLM agents for automated code improvements

The Impact of Duplicate Code

Installation

Direct Download (Recommended)

curl -O https://raw.githubusercontent.com/forhadahmed/refactor/main/refactor
chmod +x refactor
sudo mv refactor /usr/local/bin/

From Source

git clone https://github.com/forhadahmed/refactor.git
cd refactor
chmod +x refactor
./refactor --help

Quick Start

refactor [options] [files]

Or pipe files via stdin:

find src/ -name "*.cpp" | refactor

Example

usage

By default, outputs a colorized side-by-side diff to the terminal. Use --html for a visual report:

image

Agentic Workflows & AI Integration

refactor is designed for machine-readable output, making it ideal for AI-powered code refactoring pipelines:

# JSON output for programmatic consumption
refactor --json src/*.py

# Pipe to LLM agents for automated refactoring suggestions
refactor --json src/ | your-ai-agent --task "refactor duplicates"

The JSON output includes:

  • File paths and line numbers for each duplicate pair
  • Similarity scores between code blocks
  • The actual code content for both blocks

This enables AI coding assistants and autonomous agents to:

  1. Discover code duplication automatically
  2. Analyze the similarity patterns
  3. Generate refactored code with common abstractions
  4. Apply changes programmatically

Perfect for integration with tools like Claude Code, Cursor, Aider, and other AI coding agents.

How It Works

A block (or "scope") within a source file is code enclosed within brackets { ... } or within an indent level (Python).

Blocks are hierarchical:

  • A block can have multiple child blocks/scopes
  • A block is part of a parent block/scope (unless it is the top-level scope)

image

The tool parses all blocks from source files and does pairwise comparison for "similarity" using Python's SequenceMatcher. Blocks with a ratio() above the threshold (default=0.8) are considered similar.

Optimizations:

  • Largest blocks are compared first
  • If two blocks are similar, their child subtrees are pruned from further comparison
  • Blocks from differently-named files are skipped by default
  • Minimum block size filter eliminates noise from small code fragments

Options

Option Description Default
--min-block-size Minimum block size in characters 1500
--max-block-diff Maximum length difference between compared blocks 500
--all-indents Compare blocks across different indent levels off
--all-files Compare blocks across all files (not just similarly-named) off
--json Output results in JSON format for machine consumption off
-o, --output HTML output file report-<pid>.html

Examples on Popular Repositories

Project Repository Results
Linux Kernel Ethernet Drivers torvalds/linux drivers.html (~400 similar blocks)
C++ JSON Library nlohmann/json json.html (~350 similar blocks)
Bitcoin + Dogecoin bitcoin/bitcoin crypto.html (~270 similar blocks)
Go BGP Implementation osrg/gobgp gobgp.html (~250 similar blocks)
Google Protobuf protocolbuffers/protobuf protobuf.html (~215 similar blocks)
Dear ImGui ocornut/imgui imgui.html (~30 similar blocks)

Interesting find: Dogecoin and Bitcoin share massive code duplication since Dogecoin is a fork. A common library would clean things up significantly!

License

MIT License - see LICENSE for details.

About

Code duplication detector with machine-readable output for AI agents - finds similar code blocks across C/C++/Python/Go/Java/JS for automated refactoring

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published