Entity Neutering

A methodology to pre-process text data for preventing lookahead bias in Large Language Models (LLMs), particularly in financial text analysis.

Overview

Entity Neutering is a text preprocessing technique designed to anonymize financial documents and news articles to prevent LLMs from using prior knowledge about specific companies, industries, or time periods when making predictions. This ensures that model predictions are based solely on the textual content provided, eliminating lookahead bias that could artificially inflate model performance in text which is likely in the LLM's training sample.

Authors

Academic Paper

The full methodology and research findings are detailed in the academic paper available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5182756

If you use this methodology in your research, please cite the original paper:

@article{engelberg2025entity,
  title={Entity Neutering},
  author={Engelberg, Joseph and Manela, Asaf and Mullins, William and Vulicevic, Luka},
  journal={Available at SSRN},
  year={2025}
}

Key Features

Text is masked and/or paraphrased iteratively until the same LLM cannot infer the entity.

LLM-Based Masking (Optional): Anonymization of all identifying features of the text
- Company names → Company_1, Company_2, etc.
- Industry terms → industry_x, sector_x
- Dates and numbers → time_x, number_x
- Products and services → product_x, service_x
LLM-Based Paraphrasing (Optional): Structural and lexical transformation to preserve anonymity
- Maintains core information while making text unrecognizable
- Changes sentence structure and vocabulary
- Replaces identifying phrases and quotes
Flexible Processing Modes: Complete control over the neutering pipeline
- Enable both masking and paraphrasing (default)
- Use masking only
- Use paraphrasing only
- Skip neutering entirely and perform extraction on raw or pre-neutered text
Iterative Refinement: Up to 8 rounds (configurable) of additional masking and/or paraphrasing for texts that remain identifiable
Built-in Identification Testing: Automatic verification that neutering was successful
- Tests if the same LLM can identify the entity after neutering
- Tracks which round and step successfully neutered each text
- Supports multiple LLM providers (OpenAI GPT models, Ollama local models)
Robustness Testing: Additional script for testing identification rates across different models
- Measure identification success across various model sizes
- Support for open-source models (Llama, Gemma) via Ollama

Project Structure

Entity_Neutering/
├── README.md                 # This file
├── LICENSE                   # Project license
├── EntityNeutering.py        # Main processing pipeline
├── OllamaRobustness.py       # Robustness testing with Ollama models
├── CommonTools/              # Shared utilities and modules
│   ├── CommonTools.py        # Core processing functions
│   ├── GPT_agent.py          # OpenAI GPT agent wrapper
│   ├── prompts.py            # Template prompts for anonymization
│   └── credentials.py        # API credentials (not included)

Neutering Parameters

The NEUTER_ARGS dictionary contains all configurable parameters for the entity neutering process. Here's a detailed breakdown of each parameter:

Required Parameters

output_dir (str): Directory where all output files will be saved
gpt_api_key (str): Your API key (OpenAI API key if using OpenAI, not needed for Ollama)
mask_template (template): Initial masking prompt template
para_template (template): Initial paraphrasing prompt template
mask_template_iter (template): Iterative masking prompt template for additional rounds
para_template_iter (template): Iterative paraphrasing prompt template for additional rounds
de_neuter_name (template): Template for testing text identification
sentiment_template (template): Template for sentiment analysis

Processing Control Parameters

max_rounds (int, default: 8): Maximum number of iterative neutering rounds
model_name (str, default: 'gemma3:4b'): Model to use for all operations (e.g., 'gpt-4o-mini' for OpenAI, 'llama3.2' or 'gemma3:4b' for Ollama)
llm_provider (str, default: 'ollama'): LLM provider to use - either 'openai' or 'ollama'
masking (bool, default: True): Enable masking step in the neutering pipeline
paraphrase (bool, default: True): Enable paraphrasing step in the neutering pipeline
extraction_name (str, default: 'LLM_RAW'): Column name for raw text sentiment extraction
identification_column (str, default: 'Body Text Neutered'): Column name to use for identification testing when all processing is disabled

Output Control Parameters

save_intermediate (bool, default: True): Save intermediate processing steps
save_round_files (bool, default: True): Save separate files for each iterative round
use_random_filename (bool, default: True): Generate random filenames to prevent collisions
custom_filename_prefix (str, default: "instance"): Prefix for output filenames

Sentiment Analysis Parameters

perform_sentiment (bool, default: True): Enable sentiment analysis
sentiment_on_raw (bool, default: True): Perform sentiment analysis on original text
sentiment_on_neutered (bool, default: True): Perform sentiment analysis on neutered text

Logging and Debugging Parameters

verbose (bool, default: True): Print detailed progress messages
log_errors (bool, default: True): Log errors to file
error_log_path (str, default: None): Custom path for error log (uses output_dir if None)
return_dataframe (bool, default: False): Return processed DataFrame instead of saving only

Usage

Complete Entity Neutering Process (Recommended)

For the full Entity Neutering process with default settings:

import pandas as pd
from EntityNeutering import neuter_data, NEUTER_ARGS

# Load your financial text data
df = pd.read_csv('your_data.csv')

# Configure output directory
NEUTER_ARGS['output_dir'] = 'output_directory/'

# Run the complete entity neutering process with default settings
neuter_data(df, **NEUTER_ARGS)

Note: For processing large datasets, run EntityNeutering.py directly after configuring paths in the script. The main script automatically uses multiprocessing to utilize all available CPU cores, significantly speeding up processing time.

Customized Processing Variations

1. Quick Processing (No Intermediate Files)

from EntityNeutering import neuter_data, NEUTER_ARGS

# Minimal file saving for faster processing
custom_args = NEUTER_ARGS.copy()
custom_args.update({
    'save_intermediate': False,
    'save_round_files': False,
    'verbose': False,
})

neuter_data(df, **custom_args)

2. Masking Only (No Paraphrasing)

from EntityNeutering import neuter_data, NEUTER_ARGS

# Apply only masking without paraphrasing
masking_only_args = NEUTER_ARGS.copy()
masking_only_args.update({
    'masking': True,
    'paraphrase': False,
    'max_rounds': 8,
})

neuter_data(df, **masking_only_args)

3. Paraphrasing Only (No Masking)

from EntityNeutering import neuter_data, NEUTER_ARGS

# Apply only paraphrasing without masking
paraphrase_only_args = NEUTER_ARGS.copy()
paraphrase_only_args.update({
    'masking': False,
    'paraphrase': True,
    'max_rounds': 8,
})

neuter_data(df, **paraphrase_only_args)

4. No Neutering (Extraction on Raw Text Only)

from EntityNeutering import neuter_data, NEUTER_ARGS

# Skip neutering entirely, only perform extraction (e.g., sentiment) on raw text
raw_text_only_args = NEUTER_ARGS.copy()
raw_text_only_args.update({
    'masking': False,
    'paraphrase': False,
    'perform_sentiment': True,
    'sentiment_on_raw': True,
    'sentiment_on_neutered': False,  # No neutering performed
})

neuter_data(df, **raw_text_only_args)

5. Extraction on Pre-Neutered Text

from EntityNeutering import neuter_data, NEUTER_ARGS

# If you already have neutered text in your data and want to extract sentiment from it
# Note: Specify 'identification_column' if your neutered text column has a different name
# Default is 'Body Text Neutered'
pre_neutered_args = NEUTER_ARGS.copy()
pre_neutered_args.update({
    'masking': False,
    'paraphrase': False,
    'perform_sentiment': True,
    'sentiment_on_raw': False,
    'sentiment_on_neutered': True,
    'identification_column': 'Body Text Neutered',  # Specify your neutered text column name
})

neuter_data(df, **pre_neutered_args)

6. Increasing Maximum Iterations

from EntityNeutering import neuter_data, NEUTER_ARGS

# Increase maximum rounds for difficult-to-neuter texts
increased_rounds_args = NEUTER_ARGS.copy()
increased_rounds_args.update({
    'max_rounds': 20,  # Increase from default 8 to 20 rounds
})

neuter_data(df, **increased_rounds_args)

8. Using OpenAI GPT Models

from EntityNeutering import neuter_data, NEUTER_ARGS

# Use OpenAI GPT models instead of Ollama
openai_args = NEUTER_ARGS.copy()
openai_args.update({
    'llm_provider': 'openai',
    'model_name': 'gpt-4o-mini',  # or 'gpt-4-turbo', 'gpt-4', etc.
    'gpt_api_key': 'your-openai-api-key',
    'max_rounds': 6,
})

neuter_data(df, **openai_args)

9. Using Different Ollama Models

from EntityNeutering import neuter_data, NEUTER_ARGS

# Use different Ollama models
ollama_args = NEUTER_ARGS.copy()
ollama_args.update({
    'llm_provider': 'ollama',
    'model_name': 'llama3.2',  # or 'gemma3:27b', 'mistral', etc.
    'max_rounds': 8,
})

neuter_data(df, **ollama_args)

Installation & Requirements

Python Environment

Python Version: 3.11.11
Package Manager: condavenv

Dependencies

# Core packages
pandas
numpy
matplotlib
seaborn
plotly

# NLP and ML
nltk
gensim
scikit-learn
scipy
statsmodels

# LLM Integration
openai
ollama
requests

# Text Processing
fuzzywuzzy
lxml

# Utilities
click
openpyxl
pillow

Data Requirements

Your input DataFrame must contain these required columns to use the built in prompts:

Body Text: The text content to be neutered
COMNAM: Company name
Companies: Company ticker symbol
SIC_Industry: Industry classification
Date: Date information (will be converted to datetime)

Output Files

The neutering process generates several output files depending on your configuration:

Per-Instance Files

*_step1_masked.csv: After initial masking (if save_intermediate=True and masking=True)
*_step2_paraphrased.csv: After initial paraphrasing (if save_intermediate=True and paraphrase=True)
*_masking_step_X_neutered.csv: Successfully neutered texts from masking in round X (if save_round_files=True)
*_paraphrasing_step_X_neutered.csv: Successfully neutered texts from paraphrasing in round X (if save_round_files=True)
*_round_X_masked.csv: Intermediate masked text in round X (if save_intermediate=True and save_round_files=True)
*_round_X_paraphrased.csv: Intermediate paraphrased text in round X (if save_intermediate=True and save_round_files=True)
*_NEUTERED.csv: All neutered texts before sentiment analysis
*_sent_neut.csv: After sentiment analysis on neutered text (if save_intermediate=True and sentiment_on_neutered=True)
*_FINAL_FINAL_FINAL.csv: Final output with all processing complete

Combined Output

Finished_Data.csv: Combined results from all parallel processes (generated by main script)

Key Output Columns

Body Text Neutered: The final neutered text
neutered_at_round: Which iteration round successfully neutered the text (0 for initial round, -1 for never neutered)
neutered_at_step: Which step successfully neutered the text (e.g., 'masking_step_0', 'paraphrasing_step_2', 'never_neutered', 'no_neutering')
iort_guess_correct_neutered: Binary indicator (0/1) of whether the LLM successfully identified the entity after neutering (0 = success, 1 = failure)
Direction / Direction_neut: Sentiment direction on raw/neutered text (if sentiment analysis enabled)
Magnitude / Magnitude_neut: Sentiment magnitude on raw/neutered text (if sentiment analysis enabled)

Research Applications

Entity Neutering is particularly valuable for:

Financial NLP Research: Eliminating lookahead bias in financial sentiment analysis and text-based predictions
Bias-Free Model Evaluation: Testing LLM capabilities without the confounding effect of memorized information
Privacy Protection: Anonymizing sensitive financial documents and news articles
Sentiment Analysis: Studying firm-level news sentiment independent of company identity
Robustness Testing: Evaluating whether text-based signals are truly content-driven or identity-driven

Robustness Testing Across Models

The OllamaRobustness.py script allows you to test how well your neutered texts perform across different LLM models. This helps verify that neutering is robust across various model architectures and sizes.

Usage

from OllamaRobustness import process_llm_responses
import pandas as pd

# Load neutered data
df = pd.read_csv('neutered_data.csv')

# Test with Llama 3.2
df = process_llm_responses(
    df,
    'LLM_DENEUTER_GUESS_LLAMA_3B',
    'Body Text Neutered',
    de_neuter_name,
    model_to_use='llama3.2'
)

# Test with Gemma
df = process_llm_responses(
    df,
    'LLM_DENEUTER_GUESS_GEMMA_4B',
    'Body Text Neutered',
    de_neuter_name,
    model_to_use='gemma3:4b'
)

This allows you to measure identification rates across different models to ensure your neutering approach generalizes well.

Performance Considerations

Multiprocessing: The main script automatically utilizes all CPU cores for parallel processing of large datasets
Flexible LLM Providers:
- Ollama (default): Free local models with no API costs, ideal for large-scale processing
- OpenAI: Cloud-based models with API costs but potentially higher quality
API Rate Limiting: Built-in error handling for API rate limits and timeout issues
Incremental Saving: Regular saving of intermediate results to prevent data loss
Memory Management: Efficient processing of large datasets through chunking and parallel execution

Contributing

This codebase supports ongoing research into lookahead bias prevention in financial NLP. Contributions are welcome, particularly in the following areas:

Additional anonymization strategies beyond masking and paraphrasing
New evaluation metrics for measuring neutering effectiveness
Integration with additional LLM providers (e.g., Anthropic Claude, Google Gemini API)
Performance optimizations for processing speed
Enhanced prompt templates for specific use cases
Support for additional languages beyond English

License

See LICENSE file for project licensing terms.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
CommonTools		CommonTools
.gitignore		.gitignore
EntityNeutering.py		EntityNeutering.py
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
OllamaRobustness.py		OllamaRobustness.py
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt
setup.py		setup.py

License

LukaVuli/Entity_Neutering

Folders and files

Latest commit

History

Repository files navigation

Entity Neutering

Overview

Authors

Academic Paper

Key Features

Text is masked and/or paraphrased iteratively until the same LLM cannot infer the entity.

Project Structure

Neutering Parameters

Required Parameters

Processing Control Parameters

Output Control Parameters

Sentiment Analysis Parameters

Logging and Debugging Parameters

Usage

Complete Entity Neutering Process (Recommended)

Customized Processing Variations

1. Quick Processing (No Intermediate Files)

2. Masking Only (No Paraphrasing)

3. Paraphrasing Only (No Masking)

4. No Neutering (Extraction on Raw Text Only)

5. Extraction on Pre-Neutered Text

6. Increasing Maximum Iterations

8. Using OpenAI GPT Models

9. Using Different Ollama Models

Installation & Requirements

Python Environment

Dependencies

Data Requirements

Output Files

Per-Instance Files

Combined Output

Key Output Columns

Research Applications

Robustness Testing Across Models

Usage

Performance Considerations

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages