STAR-Website-Fingerprinting

The code and dataset for the paper STAR: Semantic-Traffic Alignment and Retrieval for Zero-Shot HTTPS Website Fingerprinting, accepted in IEEE International Conference on Computer Communications (INFOCOM) 2026.

📄 Read the Camera-Ready Paper
🌐 Read on arXiv

⚠️ For research purposes only. ⚠️

If you find this repository useful, please cite our paper:

@article{cheng2025star,
  title={STAR: Semantic-Traffic Alignment and Retrieval for Zero-Shot HTTPS Website Fingerprinting},
  author={Yifei Cheng and Yujia Zhu and Baiyang Li and Xinhao Deng and Yitong Cai and Yaochen Ren and Qingyun Liu},
  journal={arXiv preprint arXiv:2512.17667},
  year={2025}
}

The official IEEE INFOCOM version will be updated once published.

The processed dataset and pretrained checkpoints are publicly available via Zenodo

🚀 Key Idea and Findings

Problem

Modern HTTPS mechanisms (e.g., ECH and encrypted DNS) hide traditional identifiers such as SNI and DNS queries. However, existing website fingerprinting (WF) methods still rely on site-specific labeled traffic, which makes them:

expensive to deploy,
brittle to website evolution,
and incapable of recognizing previously unseen websites.

Key question:

Can we identify unseen websites from encrypted traffic without collecting any traffic from them?

Key Observation

We find that encrypted traffic is not arbitrary.

Even under full encryption, modern web protocols introduce structural semantic leakage that creates consistent alignment anchors between:

website-level semantic logic (e.g., URI length, resource size, protocol usage), and
encrypted traffic behavior (e.g., packet lengths, burst patterns, transport ratios).

We identify three intrinsic alignment anchors:

Request-side anchor:
Request packet lengths correlate with Huffman-encoded URI lengths due to HTTP/2 and HTTP/3 header compression.
Response-side anchor:
Aggregated response packet sizes reflect the total size of returned web resources.
Protocol anchor:
HTTP/3 adoption is observable via UDP traffic ratios at the transport layer.

Approach: STAR

Based on these anchors, we reformulate website fingerprinting as a zero-shot cross-modal retrieval problem.

STAR learns a shared embedding space between:

Logic modality: crawl-time semantic website profiles (resource-level structure), and
Traffic modality: encrypted packet-level traces.

A dual-encoder architecture aligns the two modalities using contrastive learning, enabling encrypted traffic traces to retrieve their most semantically aligned website profiles — without requiring any traffic from target websites during training.

Main Results

Zero-shot closed-world classification
- 87.9% Top-1 accuracy over 1,600 unseen websites
Open-world detection
- AUC = 0.963, outperforming supervised and few-shot baselines
Few-shot adaptation
- With only 4 labeled traces per site, Top-5 accuracy reaches 98.8%

These results demonstrate that semantic leakage, rather than header visibility, is now the dominant privacy risk in encrypted HTTPS traffic.

👉 Reproducibility

This section provides step-by-step instructions to reproduce the main experimental results reported in the paper.

1. Environment Setup

All experiments are implemented in Python.
Please first install the required dependencies listed in requirements.txt.

pip install -r requirements.txt

We recommend using a dedicated virtual environment (e.g., venv or conda) to avoid dependency conflicts.

2. Dataset and Pretrained Model

We provide the processed dataset and pretrained model checkpoints required for reproduction via a publicly accessible Zenodo repository.

Required Files and Directory Structure

Please organize the downloaded files as follows:

STAR/
├── STAR_dataset/
│   ├── (processed dataset files)
│   └── .gitkeep
├── STAR_model_pt/
│   ├── best_STAR_model.pt
│   └── .gitkeep

Pretrained Model

Download best_STAR_model.pt
Place it at:
```
/STAR_model_pt/best_STAR_model.pt
```

🔗 Zenodo link: https://doi.org/10.5281/zenodo.17060855

Notes on Data Availability

The dataset released in this repository is preprocessed according to the input format required by STAR, as described in the paper.

The raw data used in this work—including:

over 170,000 website visits,
more than 100 GB of raw traffic traces (PCAP format),
and corresponding logic-side crawl logs—

is not publicly hosted due to storage and distribution constraints. If access to the raw data is required for research purposes, please contact:

📧 chengyifei@iie.ac.cn

3. Running Experiments

All experiment scripts are located in the project root directory:

STAR/
├── cw_zero_shot.py
├── cw_linear_probe.py
├── cw_tip_adapter.py
├── ow_zero_shot.py
├── pretrain.py
├── logic_encoder_8d.py
├── traffic_encoder_3d.py

We categorize experiments by filename prefixes.

3.1 Closed-World Experiments (`cw_*.py`)

Scripts with the prefix cw_ correspond to closed-world evaluation, including:

Zero-shot classification
```
python cw_zero_shot.py
```
Few-shot adaptation
- Linear probing
```
python cw_linear_probe.py
```
- Tip-Adapter-style adaptation
```
python cw_tip_adapter.py
```

These scripts reproduce the closed-world results reported in the paper.

3.2 Open-World Experiments (`ow_*.py`)

Scripts with the prefix ow_ correspond to open-world evaluation, including rejection of unmonitored websites.

python ow_zero_shot.py

4. Model Pretraining (Optional)

Users may also choose to pretrain the STAR model from scratch using the provided training script:

python pretrain.py

Training Configuration

Training follows the data scale and optimization strategy described in the paper.
Default setting:
- 200 epochs
- Approximately 4 hours using 5 NVIDIA A100 GPUs with data parallelism.

⚠️ Pretraining is computationally expensive and not required for reproducing the main results, as pretrained checkpoints are provided.

5. Additional Notes

All random seeds are fixed by default for reproducibility.
GPU acceleration is recommended for both pretraining and evaluation.

If you encounter any issues during reproduction, feel free to open an issue or contact the authors.

📌 License

This project is released under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.idea		.idea
STAR		STAR
STAR_dataset		STAR_dataset
STAR_model_pt		STAR_model_pt
dataset_collection		dataset_collection
docs		docs
feature_extraction		feature_extraction
images		images
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

STAR-Website-Fingerprinting

🚀 Key Idea and Findings

Problem

Key Observation

Approach: STAR

Main Results

👉 Reproducibility

1. Environment Setup

2. Dataset and Pretrained Model

Required Files and Directory Structure

Pretrained Model

Notes on Data Availability

3. Running Experiments

3.1 Closed-World Experiments (`cw_*.py`)

3.2 Open-World Experiments (`ow_*.py`)

4. Model Pretraining (Optional)

Training Configuration

5. Additional Notes

📌 License

About

Uh oh!

Releases

Packages

Languages

License

2654400439/STAR-Website-Fingerprinting

Folders and files

Latest commit

History

Repository files navigation

STAR-Website-Fingerprinting

🚀 Key Idea and Findings

Problem

Key Observation

Approach: STAR

Main Results

👉 Reproducibility

1. Environment Setup

2. Dataset and Pretrained Model

Required Files and Directory Structure

Pretrained Model

Notes on Data Availability

3. Running Experiments

3.1 Closed-World Experiments (cw_*.py)

3.2 Open-World Experiments (ow_*.py)

4. Model Pretraining (Optional)

Training Configuration

5. Additional Notes

📌 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

3.1 Closed-World Experiments (`cw_*.py`)

3.2 Open-World Experiments (`ow_*.py`)

Packages