Skip to content

llm-platform-security/gpt-data-exposure

Repository files navigation

Data Collection in LLM App Ecosystems

This repository contains the code and data accompanying our IMC ’25 paper, An In-Depth Investigation of Data Collection in LLM App Ecosystems. These artifacts support GPT metadata crawling, data-collection analysis, and privacy-policy analysis for GPTs, including those with Actions.

Table of Contents

Installation

To set up the environment, we suggest using Conda. See the Miniconda installation guide.

conda create -n langchain python=3.9
conda activate langchain
git clone https://github.com/llm-platform-security/gpt-data-exposure.git
cd gpt-data-exposure
pip install -r requirements.txt

Data Categorization

Standard Data Categorization

Setup:
Open data_categorization/data_categorization.py and insert your OpenAI API key:

os.environ['OPENAI_API_KEY'] = ''  # Add your API key here

Running:

cd data_categorization
python data_categorization.py

Output:

  • extracted_data_types.json in the same directory, containing each entry’s assigned data type.

Non-Classifier Data Handling

This script processes entries labeled Other by the standard categorizer and suggests sub-types.

Setup:
Open data_categorization/addressing_non_classifier_data_description.py and insert your OpenAI API key:

os.environ['OPENAI_API_KEY'] = ''  # Add your API key here

Running:

cd data_categorization
python addressing_non_classifier_data_description.py

Output:

  • addressing_non_classifier_results.json in the same directory, containing expanded taxonomy decisions.

Privacy Policy Analysis

Setup:
Open privacy_policy_analysis/privacy_policy_analysis.py and insert your OpenAI API key:

os.environ['OPENAI_API_KEY'] = ''  # Add your API key here

Running:

cd privacy_policy_analysis
python privacy_policy_analysis.py

Output:

  • Structured JSON results in final_results/.
  • Any failures logged to error_files/.

GPT Crawlers

This package provides both individual scraper modules and a metascraper to gather GPT URLs and metadata.

Configuration

Before running any scraper, adjust credentials and settings in:

gpt_crawlers/config.py

Set values such as OPENAI_BEARER_TOKEN, SMTP/email settings, and logfile parameters.

Running the GPT Scrapers

cd gpt_crawlers
python metascraper.py

To use scrapers defined in config.json:

python metascraper.py --use-json

Output Files

  • fallback_urls.json: A dump of all collected OpenAI chat URLs.
  • gizmos_noref.json / gizmos_ref.json: Detailed GPT metadata without/with source references.
  • replay_file.json: Map of failed URLs and associated error reasons.

Note: The GPT metadata collected by April 26 (2024) is stored in ./gpt_crawlers/GPTs_4_26.7z.

Contribution and Support

We welcome contributions via pull requests. For issues or feature requests, open a GitHub issue. Feel free to reach out if you have questions or need guidance.

Research Team

Yuhao Wu (Washington University in St. Louis)
Evin Jaff (Washington University in St. Louis)
Ke Yang (Washington University in St. Louis)
Ning Zhang (Washington University in St. Louis)
Umar Iqbal (Washington University in St. Louis)

Citation

@inproceedings{wu2025llm-data-collection,
  author    = {Yuhao Wu and Evin Jaff and Ke Yang and Ning Zhang and Umar Iqbal},
  title     = {An In-Depth Investigation of Data Collection in {LLM} App Ecosystems},
  booktitle = {Proceedings of the 2025 ACM Internet Measurement Conference (IMC '25)},
  year      = {2025}
}

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages