A methodology to pre-process text data for preventing lookahead bias in Large Language Models (LLMs), particularly in financial text analysis.
Entity Neutering is a text preprocessing technique designed to anonymize financial documents and news articles to prevent LLMs from using prior knowledge about specific companies, industries, or time periods when making predictions. This ensures that model predictions are based solely on the textual content provided, eliminating lookahead bias that could artificially inflate model performance in text which is likely in the LLM's training sample.
The full methodology and research findings are detailed in the academic paper available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5182756
If you use this methodology in your research, please cite the original paper:
@article{engelberg2025entity,
title={Entity Neutering},
author={Engelberg, Joseph and Manela, Asaf and Mullins, William and Vulicevic, Luka},
journal={Available at SSRN},
year={2025}
}
-
LLM-Based Masking (Optional): Anonymization of all identifying features of the text
- Company names →
Company_1,Company_2, etc. - Industry terms →
industry_x,sector_x - Dates and numbers →
time_x,number_x - Products and services →
product_x,service_x
- Company names →
-
LLM-Based Paraphrasing (Optional): Structural and lexical transformation to preserve anonymity
- Maintains core information while making text unrecognizable
- Changes sentence structure and vocabulary
- Replaces identifying phrases and quotes
-
Flexible Processing Modes: Complete control over the neutering pipeline
- Enable both masking and paraphrasing (default)
- Use masking only
- Use paraphrasing only
- Skip neutering entirely and perform extraction on raw or pre-neutered text
-
Iterative Refinement: Up to 8 rounds (configurable) of additional masking and/or paraphrasing for texts that remain identifiable
-
Built-in Identification Testing: Automatic verification that neutering was successful
- Tests if the same LLM can identify the entity after neutering
- Tracks which round and step successfully neutered each text
- Supports multiple LLM providers (OpenAI GPT models, Ollama local models)
-
Robustness Testing: Additional script for testing identification rates across different models
- Measure identification success across various model sizes
- Support for open-source models (Llama, Gemma) via Ollama
Entity_Neutering/
├── README.md # This file
├── LICENSE # Project license
├── EntityNeutering.py # Main processing pipeline
├── OllamaRobustness.py # Robustness testing with Ollama models
├── CommonTools/ # Shared utilities and modules
│ ├── CommonTools.py # Core processing functions
│ ├── GPT_agent.py # OpenAI GPT agent wrapper
│ ├── prompts.py # Template prompts for anonymization
│ └── credentials.py # API credentials (not included)
The NEUTER_ARGS dictionary contains all configurable parameters for the entity neutering process. Here's a detailed breakdown of each parameter:
output_dir(str): Directory where all output files will be savedgpt_api_key(str): Your API key (OpenAI API key if using OpenAI, not needed for Ollama)mask_template(template): Initial masking prompt templatepara_template(template): Initial paraphrasing prompt templatemask_template_iter(template): Iterative masking prompt template for additional roundspara_template_iter(template): Iterative paraphrasing prompt template for additional roundsde_neuter_name(template): Template for testing text identificationsentiment_template(template): Template for sentiment analysis
max_rounds(int, default: 8): Maximum number of iterative neutering roundsmodel_name(str, default: 'gemma3:4b'): Model to use for all operations (e.g., 'gpt-4o-mini' for OpenAI, 'llama3.2' or 'gemma3:4b' for Ollama)llm_provider(str, default: 'ollama'): LLM provider to use - either 'openai' or 'ollama'masking(bool, default: True): Enable masking step in the neutering pipelineparaphrase(bool, default: True): Enable paraphrasing step in the neutering pipelineextraction_name(str, default: 'LLM_RAW'): Column name for raw text sentiment extractionidentification_column(str, default: 'Body Text Neutered'): Column name to use for identification testing when all processing is disabled
save_intermediate(bool, default: True): Save intermediate processing stepssave_round_files(bool, default: True): Save separate files for each iterative rounduse_random_filename(bool, default: True): Generate random filenames to prevent collisionscustom_filename_prefix(str, default: "instance"): Prefix for output filenames
perform_sentiment(bool, default: True): Enable sentiment analysissentiment_on_raw(bool, default: True): Perform sentiment analysis on original textsentiment_on_neutered(bool, default: True): Perform sentiment analysis on neutered text
verbose(bool, default: True): Print detailed progress messageslog_errors(bool, default: True): Log errors to fileerror_log_path(str, default: None): Custom path for error log (uses output_dir if None)return_dataframe(bool, default: False): Return processed DataFrame instead of saving only
For the full Entity Neutering process with default settings:
import pandas as pd
from EntityNeutering import neuter_data, NEUTER_ARGS
# Load your financial text data
df = pd.read_csv('your_data.csv')
# Configure output directory
NEUTER_ARGS['output_dir'] = 'output_directory/'
# Run the complete entity neutering process with default settings
neuter_data(df, **NEUTER_ARGS)Note: For processing large datasets, run
EntityNeutering.pydirectly after configuring paths in the script. The main script automatically uses multiprocessing to utilize all available CPU cores, significantly speeding up processing time.
from EntityNeutering import neuter_data, NEUTER_ARGS
# Minimal file saving for faster processing
custom_args = NEUTER_ARGS.copy()
custom_args.update({
'save_intermediate': False,
'save_round_files': False,
'verbose': False,
})
neuter_data(df, **custom_args)from EntityNeutering import neuter_data, NEUTER_ARGS
# Apply only masking without paraphrasing
masking_only_args = NEUTER_ARGS.copy()
masking_only_args.update({
'masking': True,
'paraphrase': False,
'max_rounds': 8,
})
neuter_data(df, **masking_only_args)from EntityNeutering import neuter_data, NEUTER_ARGS
# Apply only paraphrasing without masking
paraphrase_only_args = NEUTER_ARGS.copy()
paraphrase_only_args.update({
'masking': False,
'paraphrase': True,
'max_rounds': 8,
})
neuter_data(df, **paraphrase_only_args)from EntityNeutering import neuter_data, NEUTER_ARGS
# Skip neutering entirely, only perform extraction (e.g., sentiment) on raw text
raw_text_only_args = NEUTER_ARGS.copy()
raw_text_only_args.update({
'masking': False,
'paraphrase': False,
'perform_sentiment': True,
'sentiment_on_raw': True,
'sentiment_on_neutered': False, # No neutering performed
})
neuter_data(df, **raw_text_only_args)from EntityNeutering import neuter_data, NEUTER_ARGS
# If you already have neutered text in your data and want to extract sentiment from it
# Note: Specify 'identification_column' if your neutered text column has a different name
# Default is 'Body Text Neutered'
pre_neutered_args = NEUTER_ARGS.copy()
pre_neutered_args.update({
'masking': False,
'paraphrase': False,
'perform_sentiment': True,
'sentiment_on_raw': False,
'sentiment_on_neutered': True,
'identification_column': 'Body Text Neutered', # Specify your neutered text column name
})
neuter_data(df, **pre_neutered_args)from EntityNeutering import neuter_data, NEUTER_ARGS
# Increase maximum rounds for difficult-to-neuter texts
increased_rounds_args = NEUTER_ARGS.copy()
increased_rounds_args.update({
'max_rounds': 20, # Increase from default 8 to 20 rounds
})
neuter_data(df, **increased_rounds_args)from EntityNeutering import neuter_data, NEUTER_ARGS
# Use OpenAI GPT models instead of Ollama
openai_args = NEUTER_ARGS.copy()
openai_args.update({
'llm_provider': 'openai',
'model_name': 'gpt-4o-mini', # or 'gpt-4-turbo', 'gpt-4', etc.
'gpt_api_key': 'your-openai-api-key',
'max_rounds': 6,
})
neuter_data(df, **openai_args)from EntityNeutering import neuter_data, NEUTER_ARGS
# Use different Ollama models
ollama_args = NEUTER_ARGS.copy()
ollama_args.update({
'llm_provider': 'ollama',
'model_name': 'llama3.2', # or 'gemma3:27b', 'mistral', etc.
'max_rounds': 8,
})
neuter_data(df, **ollama_args)- Python Version: 3.11.11
- Package Manager: condavenv
# Core packages
pandas
numpy
matplotlib
seaborn
plotly
# NLP and ML
nltk
gensim
scikit-learn
scipy
statsmodels
# LLM Integration
openai
ollama
requests
# Text Processing
fuzzywuzzy
lxml
# Utilities
click
openpyxl
pillowYour input DataFrame must contain these required columns to use the built in prompts:
Body Text: The text content to be neuteredCOMNAM: Company nameCompanies: Company ticker symbolSIC_Industry: Industry classificationDate: Date information (will be converted to datetime)
The neutering process generates several output files depending on your configuration:
*_step1_masked.csv: After initial masking (ifsave_intermediate=Trueandmasking=True)*_step2_paraphrased.csv: After initial paraphrasing (ifsave_intermediate=Trueandparaphrase=True)*_masking_step_X_neutered.csv: Successfully neutered texts from masking in round X (ifsave_round_files=True)*_paraphrasing_step_X_neutered.csv: Successfully neutered texts from paraphrasing in round X (ifsave_round_files=True)*_round_X_masked.csv: Intermediate masked text in round X (ifsave_intermediate=Trueandsave_round_files=True)*_round_X_paraphrased.csv: Intermediate paraphrased text in round X (ifsave_intermediate=Trueandsave_round_files=True)*_NEUTERED.csv: All neutered texts before sentiment analysis*_sent_neut.csv: After sentiment analysis on neutered text (ifsave_intermediate=Trueandsentiment_on_neutered=True)*_FINAL_FINAL_FINAL.csv: Final output with all processing complete
Finished_Data.csv: Combined results from all parallel processes (generated by main script)
Body Text Neutered: The final neutered textneutered_at_round: Which iteration round successfully neutered the text (0 for initial round, -1 for never neutered)neutered_at_step: Which step successfully neutered the text (e.g., 'masking_step_0', 'paraphrasing_step_2', 'never_neutered', 'no_neutering')iort_guess_correct_neutered: Binary indicator (0/1) of whether the LLM successfully identified the entity after neutering (0 = success, 1 = failure)Direction/Direction_neut: Sentiment direction on raw/neutered text (if sentiment analysis enabled)Magnitude/Magnitude_neut: Sentiment magnitude on raw/neutered text (if sentiment analysis enabled)
Entity Neutering is particularly valuable for:
- Financial NLP Research: Eliminating lookahead bias in financial sentiment analysis and text-based predictions
- Bias-Free Model Evaluation: Testing LLM capabilities without the confounding effect of memorized information
- Privacy Protection: Anonymizing sensitive financial documents and news articles
- Sentiment Analysis: Studying firm-level news sentiment independent of company identity
- Robustness Testing: Evaluating whether text-based signals are truly content-driven or identity-driven
The OllamaRobustness.py script allows you to test how well your neutered texts perform across different LLM models. This helps verify that neutering is robust across various model architectures and sizes.
from OllamaRobustness import process_llm_responses
import pandas as pd
# Load neutered data
df = pd.read_csv('neutered_data.csv')
# Test with Llama 3.2
df = process_llm_responses(
df,
'LLM_DENEUTER_GUESS_LLAMA_3B',
'Body Text Neutered',
de_neuter_name,
model_to_use='llama3.2'
)
# Test with Gemma
df = process_llm_responses(
df,
'LLM_DENEUTER_GUESS_GEMMA_4B',
'Body Text Neutered',
de_neuter_name,
model_to_use='gemma3:4b'
)This allows you to measure identification rates across different models to ensure your neutering approach generalizes well.
- Multiprocessing: The main script automatically utilizes all CPU cores for parallel processing of large datasets
- Flexible LLM Providers:
- Ollama (default): Free local models with no API costs, ideal for large-scale processing
- OpenAI: Cloud-based models with API costs but potentially higher quality
- API Rate Limiting: Built-in error handling for API rate limits and timeout issues
- Incremental Saving: Regular saving of intermediate results to prevent data loss
- Memory Management: Efficient processing of large datasets through chunking and parallel execution
This codebase supports ongoing research into lookahead bias prevention in financial NLP. Contributions are welcome, particularly in the following areas:
- Additional anonymization strategies beyond masking and paraphrasing
- New evaluation metrics for measuring neutering effectiveness
- Integration with additional LLM providers (e.g., Anthropic Claude, Google Gemini API)
- Performance optimizations for processing speed
- Enhanced prompt templates for specific use cases
- Support for additional languages beyond English
See LICENSE file for project licensing terms.