Skip to content

The code and dataset for the paper STAR: Semantic-Traffic Alignment and Retrieval for Zero-Shot HTTPS Website Fingerprinting accepted in IEEE INFOCOM 2026.

License

Notifications You must be signed in to change notification settings

2654400439/STAR-Website-Fingerprinting

Repository files navigation

STAR-Website-Fingerprinting

English | 中文

python license conference task

The code and dataset for the paper STAR: Semantic-Traffic Alignment and Retrieval for Zero-Shot HTTPS Website Fingerprinting, accepted in IEEE International Conference on Computer Communications (INFOCOM) 2026.

⚠️ For research purposes only. ⚠️

If you find this repository useful, please cite our paper:

@article{cheng2025star,
  title={STAR: Semantic-Traffic Alignment and Retrieval for Zero-Shot HTTPS Website Fingerprinting},
  author={Yifei Cheng and Yujia Zhu and Baiyang Li and Xinhao Deng and Yitong Cai and Yaochen Ren and Qingyun Liu},
  journal={arXiv preprint arXiv:2512.17667},
  year={2025}
}

The official IEEE INFOCOM version will be updated once published.

The processed dataset and pretrained checkpoints are publicly available via Zenodo


🚀 Key Idea and Findings

Problem

Modern HTTPS mechanisms (e.g., ECH and encrypted DNS) hide traditional identifiers such as SNI and DNS queries. However, existing website fingerprinting (WF) methods still rely on site-specific labeled traffic, which makes them:

  • expensive to deploy,
  • brittle to website evolution,
  • and incapable of recognizing previously unseen websites.

Key question:

Can we identify unseen websites from encrypted traffic without collecting any traffic from them?

Key Observation

We find that encrypted traffic is not arbitrary.

Even under full encryption, modern web protocols introduce structural semantic leakage that creates consistent alignment anchors between:

  • website-level semantic logic (e.g., URI length, resource size, protocol usage), and
  • encrypted traffic behavior (e.g., packet lengths, burst patterns, transport ratios).

We identify three intrinsic alignment anchors:

  • Request-side anchor:
    Request packet lengths correlate with Huffman-encoded URI lengths due to HTTP/2 and HTTP/3 header compression.

  • Response-side anchor:
    Aggregated response packet sizes reflect the total size of returned web resources.

  • Protocol anchor:
    HTTP/3 adoption is observable via UDP traffic ratios at the transport layer.

Approach: STAR

Based on these anchors, we reformulate website fingerprinting as a zero-shot cross-modal retrieval problem.

STAR learns a shared embedding space between:

  • Logic modality: crawl-time semantic website profiles (resource-level structure), and
  • Traffic modality: encrypted packet-level traces.

A dual-encoder architecture aligns the two modalities using contrastive learning, enabling encrypted traffic traces to retrieve their most semantically aligned website profiles — without requiring any traffic from target websites during training.

Main Results

  • Zero-shot closed-world classification

    • 87.9% Top-1 accuracy over 1,600 unseen websites
  • Open-world detection

    • AUC = 0.963, outperforming supervised and few-shot baselines
  • Few-shot adaptation

    • With only 4 labeled traces per site, Top-5 accuracy reaches 98.8%

These results demonstrate that semantic leakage, rather than header visibility, is now the dominant privacy risk in encrypted HTTPS traffic.


👉 Reproducibility

This section provides step-by-step instructions to reproduce the main experimental results reported in the paper.

1. Environment Setup

All experiments are implemented in Python.
Please first install the required dependencies listed in requirements.txt.

pip install -r requirements.txt

We recommend using a dedicated virtual environment (e.g., venv or conda) to avoid dependency conflicts.

2. Dataset and Pretrained Model

We provide the processed dataset and pretrained model checkpoints required for reproduction via a publicly accessible Zenodo repository.

Required Files and Directory Structure

Please organize the downloaded files as follows:

STAR/
├── STAR_dataset/
│   ├── (processed dataset files)
│   └── .gitkeep
├── STAR_model_pt/
│   ├── best_STAR_model.pt
│   └── .gitkeep

Pretrained Model

  • Download best_STAR_model.pt
  • Place it at:
    /STAR_model_pt/best_STAR_model.pt
    

🔗 Zenodo link: https://doi.org/10.5281/zenodo.17060855

Notes on Data Availability

The dataset released in this repository is preprocessed according to the input format required by STAR, as described in the paper.

The raw data used in this work—including:

  • over 170,000 website visits,
  • more than 100 GB of raw traffic traces (PCAP format),
  • and corresponding logic-side crawl logs—

is not publicly hosted due to storage and distribution constraints. If access to the raw data is required for research purposes, please contact:

📧 chengyifei@iie.ac.cn

3. Running Experiments

All experiment scripts are located in the project root directory:

STAR/
├── cw_zero_shot.py
├── cw_linear_probe.py
├── cw_tip_adapter.py
├── ow_zero_shot.py
├── pretrain.py
├── logic_encoder_8d.py
├── traffic_encoder_3d.py

We categorize experiments by filename prefixes.

3.1 Closed-World Experiments (cw_*.py)

Scripts with the prefix cw_ correspond to closed-world evaluation, including:

  • Zero-shot classification

    python cw_zero_shot.py
  • Few-shot adaptation

    • Linear probing

      python cw_linear_probe.py
    • Tip-Adapter-style adaptation

      python cw_tip_adapter.py

These scripts reproduce the closed-world results reported in the paper.

3.2 Open-World Experiments (ow_*.py)

Scripts with the prefix ow_ correspond to open-world evaluation, including rejection of unmonitored websites.

python ow_zero_shot.py

4. Model Pretraining (Optional)

Users may also choose to pretrain the STAR model from scratch using the provided training script:

python pretrain.py

Training Configuration

  • Training follows the data scale and optimization strategy described in the paper.

  • Default setting:

    • 200 epochs

    • Approximately 4 hours using 5 NVIDIA A100 GPUs with data parallelism.

⚠️ Pretraining is computationally expensive and not required for reproducing the main results, as pretrained checkpoints are provided.

5. Additional Notes

  • All random seeds are fixed by default for reproducibility.

  • GPU acceleration is recommended for both pretraining and evaluation.

If you encounter any issues during reproduction, feel free to open an issue or contact the authors.


📌 License

This project is released under the Apache License 2.0.

About

The code and dataset for the paper STAR: Semantic-Traffic Alignment and Retrieval for Zero-Shot HTTPS Website Fingerprinting accepted in IEEE INFOCOM 2026.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages