Steganography in text generated by autoregressive models

About

This repository includes source code and additional files for my Bachelor Thesis on CTU FIT.

My thesis expands the experiments of known steganographic approaches that uses LLMs to hide messages. Specifically, there is an interest on how different samplers/temperature or other factors affect security of those algorithms.

This is why I created a pluggable app with unified interface for steganoraphic algorithms/LLM/samplers. I can easily swap out different LLMs/samplers/steganographic algorithms and then for example generate datasets to assess security using machine learning models

This app, specifically steganographic algorithms, builds upon source code of Arithmetic Coding, Meteor and Discop. Each algorithm is using GPT-2 publicly available from Hugging Face library.

Also, during my experiments I created different datasets of random sampling or stegotext (generated sequences with hidden messages). These are located in folder experiments. Each subfolder have specific datasets used in that particular experiment. For example experiment1: Distinguishing Text by Sampling Methods has random sampled sequences for different samplers (Top P, Top K etc.). Further information about each experiment could be read in my thesis in Measuring the Security section

Datasets

Each dataset name follows a same name convention: {llm_model_name}-{omitten tokens indices}-{sampler}-{sampler-parameter}-{steganographic algorithm}-temp-{temperature parameter}

There will be two types of files with that name: csv and txt.

CSV file includes columns: algorithm name and individual tokens (in this case token1...token50).
TXT file include decoded tokens (generated text) for each of the row in CSV file

Additional notes:

If no steganographic algorithm is used, that part of the name is omitted entirely.
If no tokens are omitted, the placeholder [] is used to represent an empty list.

Start app

You need to be in steganography_llm directory. Create python virtual environment: python -m venv .venv. Activate it: source .venv/bin/activate. Then:

pip install -r requirements.txt
Modify the necessary parameters in the end of the file and then run:
- python scripts/generate_random_sampling_datasets_script.py to generate random sampling datasets (no hidden messages) with different parameters
- python scripts/generate_steganography_datasets_script.py to generate stego message (with random hidden messages) with different parameters
- python scripts/train_models_script.py to train classification models on pairs of datasets of random sampling a stegodatasets
- if you want to try the app itself with different messages and see encoded result: python app.py

Source Code Structure

.
├── abstract_classes  # abstract class definiition for unified interface
├── plugins # Pluggable components
│   ├── models # different LLMs (e.g. GPT-2, Llama)
│   ├── samplers # different samplers (Top P, Top K etc.)
│   └── steganography_algorithms # currently only 3 (Arithmetic, Meteor, Discop)
├── app.py # interface for encoding and decoding with different parameters
├── config.py # Configuration settings
├── scripts
│         ├── generate_random_sampling_datasets_script.py # Script to create different random sampling datasets with 1 command
│         ├── generate_steganography_datasets_script.py # Script to create different steganographic datasets with 1 command
│         └── train_models_script.py # Script to train Random Forest, GBDT on those created datasets
└── utils.py # Helper functions

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
experiments		experiments
steganography_llm		steganography_llm
tests		tests
thesis		thesis
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Steganography in text generated by autoregressive models

About

Datasets

Start app

Source Code Structure

About

Uh oh!

Languages

xsenyaaax/SteganographyWithLLM

Folders and files

Latest commit

History

Repository files navigation

Steganography in text generated by autoregressive models

About

Datasets

Start app

Source Code Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages