Skip to content

chnnick/scrambler

Repository files navigation

Scrambler - CSV/Excel Anonymization Tool

A Python tool for anonymizing sensitive data in CSV and Excel files while preserving data structure and relationships. Perfect for creating test datasets, protecting privacy, and preparing data for sharing.

Features

  • 🔒 Smart Data Detection: Automatically identifies email, phone, name, SSN, address, date, ID, and numeric data
  • 🎯 Consistent Mapping: Same input always produces same output (with seed)
  • 📊 Multiple Formats: Supports CSV and Excel files
  • 📋 Clipboard Support: Process data directly from Excel/Google Sheets
  • ⚙️ Custom Rules: Override automatic detection with JSON configuration
  • 🔄 Reproducible: Use seeds for consistent anonymization results

Installation

Requirements

  • Python 3.6+

Install Dependencies

Will install the pandas, faker, openpyxl, colorama python libraries

pip install -r requirements.txt

Quick Start

Basic Usage

# Anonymize a CSV file
python main.py data.csv

# Anonymize an Excel file  
python main.py data.xlsx

# Specify output file
python main.py data.csv -o anonymized_output.csv

Important: Your input files should have column headers in row 1 (no empty rows above the headers). The program expects the first row to contain the column names.

Clipboard Processing

# Copy data from Excel/Sheets, then run:
python main.py --clipboard

Interactive Processing

# Interactive mode - manually choose anonymization for each column
python main.py data.csv -i
python main.py data.xlsx -i
python main.py --clipboard -i

Advanced Usage

Reproducible Results

# Use seed for consistent anonymization
python main.py data.csv -s 12345

Custom Anonymization Rules

Create a JSON file (rules.json) to specify how each column should be anonymized:

{
  "email_address": "email",
  "phone_number": "phone",
  "customer_name": "name", 
  "ssn": "ssn",
  "address": "address",
  "birth_date": "date",
  "user_id": "id",
  "salary": "float",
  "internal_notes": "skip"
}

Then use it:

python main.py data.csv -r rules.json

Supported Data Types

Type Description Example Output
email Email addresses john.doe@example.comsarah.wilson@fake.com
phone Phone numbers (555) 123-4567(555) 987-6543
name Names (consistent mapping) John SmithSarah Wilson
ssn Social Security Numbers 123-45-6789987-65-4321
address Addresses 123 Main St456 Oak Ave
date Dates (±30 day offset) 2023-01-152023-02-10
id IDs (randomized) user123random7digit
integer Integers (digit randomization) 12345678
decimal Floats (±10% noise) 1000.501050.25
skip Don't anonymize Original value preserved
generic Hash the data any texta1b2c3d4e5f6

Command Line Options

python main.py [input_file] [options]

Options:
  -o, --output FILE     Output file (default: anonymized_<input>)
  -c, --clipboard       Process clipboard data
  -s, --seed INT        Random seed for reproducible results
  -r, --rules FILE      JSON file with column anonymization rules
  -i, --interactive     Manually process each column for anonymization
  -h, --help            Show help message

Usage Examples

Example 1: Basic File Anonymization

# Input: customer_data.csv
# Output: anonymized_customer_data.csv
python main.py customer_data.csv

Example 2: Excel with Custom Output

python main.py sales_data.xlsx -o anonymized_sales.xlsx

Example 3: Clipboard Processing

  1. Copy data from Excel/Google Sheets
  2. Run: python main.py --clipboard
  3. Anonymized data is copied back to clipboard
  4. File is also saved as anonymized_data_YYYYMMDD_HHMMSS.csv

Example 4: Interactive Usage

Interactive mode lets you manually choose how to anonymize each column:

# Using clipboard input:
python main.py --clipboard -i

# Using file input:
python main.py customer_data.csv -i
python main.py customer_data.xlsx -i

In interactive mode, the program will:

  1. Show you each column and its detected data type
  2. Ask if you want to change the anonymization method
  3. Let you choose from available anonymization types
  4. Apply your choices and process the file

Example 5: Custom Rules

Create custom_rules.json:

{
  "customer_email": "email",
  "phone": "phone",
  "full_name": "name",
  "salary": "numeric",
  "employee_id": "skip"
}

Run with custom rules:

python main.py employee_data.csv -r custom_rules.json -o safe_employee_data.csv

How It Works

  1. Data Type Detection: The script analyzes column names and sample data to automatically detect data types
  2. Anonymization: Applies appropriate anonymization based on detected or specified type
  3. Consistency: Uses mapping cache to ensure same input always produces same output
  4. Preservation: Maintains data structure and statistical properties where possible

Privacy Features

  • Deterministic Hashing: IDs and generic data use SHA-256 hashing
  • Realistic Fake Data: Uses Faker library for believable replacements
  • Consistent Mapping: Same real name always maps to same fake name
  • Statistical Preservation: Numeric data gets noise instead of complete replacement
  • Date Relationships: Preserves relative timing with random offsets

Troubleshooting

Common Issues

"Unsupported file type"

  • Ensure file has .csv, .xlsx, or .xls extension
  • Check file is not corrupted

"Error reading clipboard"

  • Make sure you've copied data from Excel/Sheets first
  • Try copying a smaller dataset

Missing dependencies

pip install pandas faker openpyxl

Performance Tips

  • For large files (>100k rows), consider processing in chunks
  • Use skip type for columns that don't need anonymization
  • Set a seed for reproducible results during testing

try it out!

python main.py test_data.csv -i

About

so i can put my sheet data into chatgpt

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages