A Python tool for anonymizing sensitive data in CSV and Excel files while preserving data structure and relationships. Perfect for creating test datasets, protecting privacy, and preparing data for sharing.
- 🔒 Smart Data Detection: Automatically identifies email, phone, name, SSN, address, date, ID, and numeric data
- 🎯 Consistent Mapping: Same input always produces same output (with seed)
- 📊 Multiple Formats: Supports CSV and Excel files
- 📋 Clipboard Support: Process data directly from Excel/Google Sheets
- ⚙️ Custom Rules: Override automatic detection with JSON configuration
- 🔄 Reproducible: Use seeds for consistent anonymization results
- Python 3.6+
Will install the pandas, faker, openpyxl, colorama python libraries
pip install -r requirements.txt# Anonymize a CSV file
python main.py data.csv
# Anonymize an Excel file  
python main.py data.xlsx
# Specify output file
python main.py data.csv -o anonymized_output.csvImportant: Your input files should have column headers in row 1 (no empty rows above the headers). The program expects the first row to contain the column names.
# Copy data from Excel/Sheets, then run:
python main.py --clipboard# Interactive mode - manually choose anonymization for each column
python main.py data.csv -i
python main.py data.xlsx -i
python main.py --clipboard -i# Use seed for consistent anonymization
python main.py data.csv -s 12345Create a JSON file (rules.json) to specify how each column should be anonymized:
{
  "email_address": "email",
  "phone_number": "phone",
  "customer_name": "name", 
  "ssn": "ssn",
  "address": "address",
  "birth_date": "date",
  "user_id": "id",
  "salary": "float",
  "internal_notes": "skip"
}Then use it:
python main.py data.csv -r rules.json| Type | Description | Example Output | 
|---|---|---|
| email | Email addresses | john.doe@example.com→sarah.wilson@fake.com | 
| phone | Phone numbers | (555) 123-4567→(555) 987-6543 | 
| name | Names (consistent mapping) | John Smith→Sarah Wilson | 
| ssn | Social Security Numbers | 123-45-6789→987-65-4321 | 
| address | Addresses | 123 Main St→456 Oak Ave | 
| date | Dates (±30 day offset) | 2023-01-15→2023-02-10 | 
| id | IDs (randomized) | user123→random7digit | 
| integer | Integers (digit randomization) | 1234→5678 | 
| decimal | Floats (±10% noise) | 1000.50→1050.25 | 
| skip | Don't anonymize | Original value preserved | 
| generic | Hash the data | any text→a1b2c3d4e5f6 | 
python main.py [input_file] [options]
Options:
  -o, --output FILE     Output file (default: anonymized_<input>)
  -c, --clipboard       Process clipboard data
  -s, --seed INT        Random seed for reproducible results
  -r, --rules FILE      JSON file with column anonymization rules
  -i, --interactive     Manually process each column for anonymization
  -h, --help            Show help message# Input: customer_data.csv
# Output: anonymized_customer_data.csv
python main.py customer_data.csvpython main.py sales_data.xlsx -o anonymized_sales.xlsx- Copy data from Excel/Google Sheets
- Run: python main.py --clipboard
- Anonymized data is copied back to clipboard
- File is also saved as anonymized_data_YYYYMMDD_HHMMSS.csv
Interactive mode lets you manually choose how to anonymize each column:
# Using clipboard input:
python main.py --clipboard -i
# Using file input:
python main.py customer_data.csv -i
python main.py customer_data.xlsx -iIn interactive mode, the program will:
- Show you each column and its detected data type
- Ask if you want to change the anonymization method
- Let you choose from available anonymization types
- Apply your choices and process the file
Create custom_rules.json:
{
  "customer_email": "email",
  "phone": "phone",
  "full_name": "name",
  "salary": "numeric",
  "employee_id": "skip"
}Run with custom rules:
python main.py employee_data.csv -r custom_rules.json -o safe_employee_data.csv- Data Type Detection: The script analyzes column names and sample data to automatically detect data types
- Anonymization: Applies appropriate anonymization based on detected or specified type
- Consistency: Uses mapping cache to ensure same input always produces same output
- Preservation: Maintains data structure and statistical properties where possible
- Deterministic Hashing: IDs and generic data use SHA-256 hashing
- Realistic Fake Data: Uses Faker library for believable replacements
- Consistent Mapping: Same real name always maps to same fake name
- Statistical Preservation: Numeric data gets noise instead of complete replacement
- Date Relationships: Preserves relative timing with random offsets
"Unsupported file type"
- Ensure file has .csv,.xlsx, or.xlsextension
- Check file is not corrupted
"Error reading clipboard"
- Make sure you've copied data from Excel/Sheets first
- Try copying a smaller dataset
Missing dependencies
pip install pandas faker openpyxl- For large files (>100k rows), consider processing in chunks
- Use skiptype for columns that don't need anonymization
- Set a seed for reproducible results during testing
python main.py test_data.csv -i