Repository: https://github.com/chirindaopensource/extracting_structure_press_releases_predicting_earnings_announcement_returns
Owner: 2025 Craig Chirinda (Open Source Projects)
This repository contains an independent, professional-grade Python implementation of the research methodology from the 2025 paper entitled "Extracting the Structure of Press Releases for Predicting Earnings Announcement Returns" by:
- Yuntao Wu
- Ege Mert Akin
- Charles Martineau
- Vincent Grégoire
- Andreas Veneris
The project provides a complete, end-to-end computational framework for replicating the paper's findings. It delivers a modular, auditable, and extensible pipeline that executes the entire research workflow: from rigorous data validation and multi-stage text cleaning to multi-modal feature engineering, rolling-window predictive modeling, advanced interpretability analysis, and empirical simulations.
- Introduction
- Theoretical Background
- Features
- Methodology Implemented
- Core Components (Notebook Structure)
- Key Callable:
run_full_analysis_pipeline - Workflow Diagram
- Prerequisites
- Installation
- Input Data Structure
- Usage
- Output Structure
- Project Structure
- Customization
- Contributing
- Recommended Extensions
- License
- Citation
- Acknowledgments
This project provides a Python implementation of the methodologies presented in the 2025 paper "Extracting the Structure of Press Releases for Predicting Earnings Announcement Returns." The core of this repository is the iPython Notebook extracting_structure_press_releases_predicting_earnings_announcement_returns_draft.ipynb, which contains a comprehensive suite of functions to replicate the paper's findings, from initial data validation to the final generation of all analytical tables and figures.
The paper investigates the predictive power of textual "soft information" in corporate earnings press releases relative to numerical "hard information" (earnings surprise). This codebase operationalizes the paper's framework, allowing users to:
- Rigorously validate and manage the entire experimental configuration.
- Process and clean raw HTML press releases from SEC filings.
- Generate five distinct sets of textual features using classical and deep learning NLP models.
- Train predictive models in a rolling-window framework to prevent look-ahead bias.
- Perform a full suite of econometric, interpretability, and simulation analyses to replicate the paper's key tables and figures.
The implemented methods are grounded in asset pricing, econometrics, and natural language processing.
1. Hard vs. Soft Information: The study tests the relative importance of two information types:
- Hard Information (Earnings Surprise): The quantitative surprise, defined as the deviation of reported earnings from analyst expectations, scaled by price. $$ \text{Surprise}{c,t} = \frac{\text{EPS}{c,\tau} - E_{\tau-1}[\text{EPS}{c,t}]}{P{c,\tau-5}} $$
- Soft Information (Textual Content): The qualitative narrative content of the press release, captured by various NLP models.
2. Rolling-Window Lasso Regression:
To convert high-dimensional text features into a single predictive "soft score" without look-ahead bias, a rolling-window estimation is used. For each year t, a Lasso regression is trained on year t's data to learn a mapping from text features to returns. This model is then used to predict out-of-sample scores for year t+1.
$$
\hat{\mathbf{w}}t = \arg\min{\mathbf{w}} \left{ \frac{1}{2N_{t}} |\mathbf{X}{t}\mathbf{w} - \mathbf{y}{t}|2^2 + \lambda |\mathbf{w}|1 \right} \quad \implies \quad \text{SoftScore}{t+1} = \mathbf{X}{t+1} \cdot \hat{\mathbf{w}}_t
$$
3. Panel Data Regression with Clustered Standard Errors:
To assess the explanatory power of the signals, a cross-sectional regression is estimated with standard errors clustered by both firm (permno) and time (ann_trade_date). This is the standard in financial econometrics for addressing potential correlation in residuals across both dimensions.
$$
\text{Ret}{c,\tau} = \alpha + \beta_0 \text{Surprise}{c,t} + \beta_1 \text{Soft}{c,t} + \epsilon{c,\tau}
$$
The provided iPython Notebook (extracting_structure_press_releases_predicting_earnings_announcement_returns_draft.ipynb) implements the full research pipeline, including:
- Modular, Multi-Task Architecture: The entire pipeline is broken down into 23 distinct, modular tasks, each with its own orchestrator function.
- Configuration-Driven Design: All study parameters are managed in an external
config.yamlfile, allowing for easy customization and replication. - Multi-Modal NLP Feature Engineering: Complete pipeline for generating features from five different models: BKMX, Online LDA (oLDA), BERT, FinBERT, and MPNET.
- Econometrically Sound Modeling: Implements rolling-window estimation to prevent look-ahead bias and uses two-way clustered standard errors for robust inference.
- Advanced Interpretability Suite: Includes SHAP analysis for feature importance, and a full LLM-based pipeline (using GPT-4o) for topic labeling, taxonomy creation, and token-level attribution.
- Realistic Trading Simulations: Implements a market efficiency test and a "hacking scenario" analysis with careful handling of transaction costs and market microstructure details.
- Automated Validation: Concludes with a comprehensive validation step that programmatically compares all generated results against the key numerical findings reported in the source paper.
The core analytical steps directly implement the methodology from the paper:
- Validation & Filtering (Tasks 1-2): Ingests and validates the
config.yamland raw data, then applies the paper's sample selection criteria. - Text Cleaning (Task 3): Processes raw HTML into two standardized text formats for different model types.
- Vectorization (Tasks 4-6): Generates all five sets of textual features (BKMX, oLDA, BERT, FinBERT, MPNET).
- Predictive Modeling (Tasks 7-11): Runs the rolling Lasso estimation to generate soft scores, prepares final signals, runs baseline and combined regressions, and computes SHAP importance.
- Interpretability (Tasks 12-17): Executes the full LLM-based pipeline to understand the thematic content of the oLDA and BERT-family models.
- Simulations (Tasks 18-21): Builds the tools for and executes the market efficiency and hacking scenario simulations.
- Diagnostics & Final Validation (Tasks 22-23): Computes summary statistics and visualizations, and runs a final, automated check of all results against the paper's benchmarks.
The extracting_structure_press_releases_predicting_earnings_announcement_returns_draft.ipynb notebook is structured as a logical pipeline with modular orchestrator functions for each of the 23 major tasks. All functions are self-contained, fully documented with type hints and docstrings, and designed for professional-grade execution.
The project is designed around a single, top-level user-facing interface function:
run_full_analysis_pipeline: This master orchestrator function, located in the final section of the notebook, runs the entire automated research pipeline from end-to-end. A single call to this function reproduces the entire computational portion of the project, from data validation to the final report.
The following diagram illustrates the high-level workflow orchestrated by the run_full_analysis_pipeline function.
graph TD
A[Start] --> B(Task 1: Validate Inputs);
B --> C(Task 2: Filter Sample);
C --> D(Task 3: Clean Text);
subgraph SG_FE [Feature Engineering]
D --> E(Task 4: Vectorize BKMX);
D --> F(Task 5: Vectorize oLDA);
D --> G(Task 6: Vectorize Transformers);
end
subgraph SG_ME [Modeling & Evaluation]
G --> H(Task 7: Estimate Soft Scores);
H --> I(Task 8: Prepare Final Signals);
I --> J(Task 9: Baseline Regressions);
J --> K(Task 10: Combined Regressions);
J --> L(Task 11: SHAP Importance);
end
subgraph SG_I [Interpretability]
F --> M(Task 12: Label Topics);
M --> N(Task 13: Create Metatopics);
G --> O(Task 14: Extract Influential Tokens);
O & N --> P(Task 15: Classify Tokens);
H & N --> Q(Task 16: oLDA Analytics);
P --> R(Task 17: Token Polarity Analysis);
end
subgraph SG_S [Simulations]
I --> S(Task 18 & 19: Market Efficiency Sim);
I --> T(Task 20 & 21: Hacking Scenario Sim);
end
J & L & S & T --> U(Task 23: Final Validation);
C & G & I --> V(Task 22: Diagnostics);
V --> U;
U --> W[End: Final Report];
%% Style Definitions for Subgraphs
style SG_FE fill:#e6f2ff,stroke:#333,stroke-width:2px
style SG_ME fill:#d9ead3,stroke:#333,stroke-width:2px
style SG_I fill:#fff2cc,stroke:#333,stroke-width:2px
style SG_S fill:#f4cccc,stroke:#333,stroke-width:2px
- Python 3.9+
- An OpenAI API key set as an environment variable (
OPENAI_API_KEY). - Core dependencies:
pandas,numpy,scikit-learn,statsmodels,torch,transformers,beautifulsoup4,lxml,pyyaml,linearmodels,shap,joblib,nltk,openai.
-
Clone the repository:
git clone https://github.com/chirindaopensource/extracting_structure_press_releases_predicting_earnings_announcement_returns.git cd extracting_structure_press_releases_predicting_earnings_announcement_returns -
Create and activate a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install Python dependencies:
pip install -r requirements.txt
-
Set up API Keys: Follow the instructions in the notebook or documentation to set your
OPENAI_API_KEYas an environment variable.
The pipeline requires a pandas.DataFrame with a specific, comprehensive schema containing over 70 columns of event, market, and text data. The exact schema is validated by the _validate_dataframe_schema function in the notebook. All other parameters are controlled by the config.yaml file.
The extracting_structure_press_releases_predicting_earnings_announcement_returns_draft.ipynb notebook provides a complete, step-by-step guide. The primary workflow is to execute the final cell of the notebook, which calls the top-level run_full_analysis_pipeline orchestrator:
# Final cell of the notebook
# Define paths and load data/config
DATA_PATH = Path('path/to/your/data.parquet')
CONFIG_PATH = Path('config.yaml')
ARTIFACTS_PATH = Path('./study_artifacts')
# Load data and configuration
events_df = pd.read_parquet(DATA_PATH)
raw_events_df = events_df.copy()
with open(CONFIG_PATH, 'r') as f:
study_params = yaml.safe_load(f)
# Run the entire study
final_results = run_full_analysis_pipeline(
events_df=events_df,
raw_events_df=raw_events_df,
study_params=study_params,
artifacts_path=ARTIFACTS_PATH
)
# The `final_results` dictionary will contain all key outputs.The run_full_analysis_pipeline function returns a comprehensive dictionary containing all major results. Additionally, it creates an artifacts_path directory with the following structure for persisted models and embeddings:
study_artifacts/
│
├── token_embeddings/
│ ├── bert/
│ │ ├── event_id_1.pt
│ │ └── ...
│ ├── finbert/
│ └── mpnet/
│
└── lasso_models/
├── bkmx/
│ ├── 2005.pkl
│ └── ...
├── olda/
├── bert/
├── finbert/
└── mpnet/
extracting_structure_press_releases_predicting_earnings_announcement_returns/
│
├── extracting_structure_press_releases_predicting_earnings_announcement_returns_draft.ipynb # Main implementation notebook
├── config.yaml # Master configuration file
├── requirements.txt # Python package dependencies
├── LICENSE # MIT license file
└── README.md # This documentation file
The pipeline is highly customizable via the config.yaml file. Users can easily modify all study parameters, including date ranges, model hyperparameters, LLM prompts, and simulation settings, without altering the core Python code.
Contributions are welcome. Please fork the repository, create a feature branch, and submit a pull request with a clear description of your changes. Adherence to PEP 8, type hinting, and comprehensive docstrings is required.
Future extensions could include:
- Alternative Textual Features: Integrating other feature extraction methods like TF-IDF or different Transformer architectures.
- Non-Linear Models: Replacing the Lasso regression with more complex models like Gradient Boosting Machines or Neural Networks to capture non-linear relationships between text and returns.
- Expanded Interpretability: Using more advanced SHAP explainers (e.g.,
KernelExplainer) for non-linear models or other techniques like LIME. - Dynamic Strategy Simulation: Extending the trading simulation to allow for dynamic position sizing or holding periods based on signal strength.
This project is licensed under the MIT License.
If you use this code or the methodology in your research, please cite the original paper:
@inproceedings{wu2025extracting,
author = {Wu, Yuntao and Akin, Ege Mert and Martineau, Charles and Gr\'{e}goire, Vincent and Veneris, Andreas},
title = {Extracting the Structure of Press Releases for Predicting Earnings Announcement Returns},
booktitle = {Proceedings of the 6th ACM International Conference on AI in Finance},
series = {ICAIF '25},
year = {2025},
publisher = {ACM},
note = {arXiv:2509.24254}
}For the implementation itself, you may cite this repository:
Chirinda, C. (2025). A Professional-Grade Implementation of the "Extracting Structure of Press Releases" Framework.
GitHub repository: https://github.com/chirindaopensource/extracting_structure_press_releases_predicting_earnings_announcement_returns
- Credit to Yuntao Wu, Ege Mert Akin, Charles Martineau, Vincent Grégoire, and Andreas Veneris for the foundational research that forms the entire basis for this computational replication.
- This project is built upon the exceptional tools provided by the open-source community. Sincere thanks to the developers of the scientific Python ecosystem, including Pandas, NumPy, Scikit-learn, PyTorch, Hugging Face, Statsmodels, Linearmodels, SHAP, and Jupyter.
--
This README was generated based on the structure and content of the extracting_structure_press_releases_predicting_earnings_announcement_returns_draft.ipynb notebook and follows best practices for research software documentation.