This repository contains the code and data accompanying our IMC ’25 paper, An In-Depth Investigation of Data Collection in LLM App Ecosystems. These artifacts support GPT metadata crawling, data-collection analysis, and privacy-policy analysis for GPTs, including those with Actions.
To set up the environment, we suggest using Conda. See the Miniconda installation guide.
conda create -n langchain python=3.9
conda activate langchain
git clone https://github.com/llm-platform-security/gpt-data-exposure.git
cd gpt-data-exposure
pip install -r requirements.txt
Setup:
Open data_categorization/data_categorization.py
and insert your OpenAI API key:
os.environ['OPENAI_API_KEY'] = '' # Add your API key here
Running:
cd data_categorization
python data_categorization.py
Output:
extracted_data_types.json
in the same directory, containing each entry’s assigned data type.
This script processes entries labeled Other
by the standard categorizer and suggests sub-types.
Setup:
Open data_categorization/addressing_non_classifier_data_description.py
and insert your OpenAI API key:
os.environ['OPENAI_API_KEY'] = '' # Add your API key here
Running:
cd data_categorization
python addressing_non_classifier_data_description.py
Output:
addressing_non_classifier_results.json
in the same directory, containing expanded taxonomy decisions.
Setup:
Open privacy_policy_analysis/privacy_policy_analysis.py
and insert your OpenAI API key:
os.environ['OPENAI_API_KEY'] = '' # Add your API key here
Running:
cd privacy_policy_analysis
python privacy_policy_analysis.py
Output:
- Structured JSON results in
final_results/
. - Any failures logged to
error_files/
.
This package provides both individual scraper modules and a metascraper to gather GPT URLs and metadata.
Before running any scraper, adjust credentials and settings in:
gpt_crawlers/config.py
Set values such as OPENAI_BEARER_TOKEN
, SMTP/email settings, and logfile parameters.
cd gpt_crawlers
python metascraper.py
To use scrapers defined in config.json
:
python metascraper.py --use-json
fallback_urls.json
: A dump of all collected OpenAI chat URLs.gizmos_noref.json
/gizmos_ref.json
: Detailed GPT metadata without/with source references.replay_file.json
: Map of failed URLs and associated error reasons.
Note: The GPT metadata collected by April 26 (2024) is stored in ./gpt_crawlers/GPTs_4_26.7z
.
We welcome contributions via pull requests. For issues or feature requests, open a GitHub issue. Feel free to reach out if you have questions or need guidance.
Yuhao Wu (Washington University in St. Louis)
Evin Jaff (Washington University in St. Louis)
Ke Yang (Washington University in St. Louis)
Ning Zhang (Washington University in St. Louis)
Umar Iqbal (Washington University in St. Louis)
@inproceedings{wu2025llm-data-collection,
author = {Yuhao Wu and Evin Jaff and Ke Yang and Ning Zhang and Umar Iqbal},
title = {An In-Depth Investigation of Data Collection in {LLM} App Ecosystems},
booktitle = {Proceedings of the 2025 ACM Internet Measurement Conference (IMC '25)},
year = {2025}
}