The code and dataset for the paper STAR: Semantic-Traffic Alignment and Retrieval for Zero-Shot HTTPS Website Fingerprinting, accepted in IEEE International Conference on Computer Communications (INFOCOM) 2026.
If you find this repository useful, please cite our paper:
@article{cheng2025star,
title={STAR: Semantic-Traffic Alignment and Retrieval for Zero-Shot HTTPS Website Fingerprinting},
author={Yifei Cheng and Yujia Zhu and Baiyang Li and Xinhao Deng and Yitong Cai and Yaochen Ren and Qingyun Liu},
journal={arXiv preprint arXiv:2512.17667},
year={2025}
}The official IEEE INFOCOM version will be updated once published.
The processed dataset and pretrained checkpoints are publicly available via Zenodo
Modern HTTPS mechanisms (e.g., ECH and encrypted DNS) hide traditional identifiers such as SNI and DNS queries. However, existing website fingerprinting (WF) methods still rely on site-specific labeled traffic, which makes them:
- expensive to deploy,
- brittle to website evolution,
- and incapable of recognizing previously unseen websites.
Key question:
Can we identify unseen websites from encrypted traffic without collecting any traffic from them?
We find that encrypted traffic is not arbitrary.
Even under full encryption, modern web protocols introduce structural semantic leakage that creates consistent alignment anchors between:
- website-level semantic logic (e.g., URI length, resource size, protocol usage), and
- encrypted traffic behavior (e.g., packet lengths, burst patterns, transport ratios).
We identify three intrinsic alignment anchors:
-
Request-side anchor:
Request packet lengths correlate with Huffman-encoded URI lengths due to HTTP/2 and HTTP/3 header compression. -
Response-side anchor:
Aggregated response packet sizes reflect the total size of returned web resources. -
Protocol anchor:
HTTP/3 adoption is observable via UDP traffic ratios at the transport layer.
Based on these anchors, we reformulate website fingerprinting as a zero-shot cross-modal retrieval problem.
STAR learns a shared embedding space between:
- Logic modality: crawl-time semantic website profiles (resource-level structure), and
- Traffic modality: encrypted packet-level traces.
A dual-encoder architecture aligns the two modalities using contrastive learning, enabling encrypted traffic traces to retrieve their most semantically aligned website profiles — without requiring any traffic from target websites during training.
-
Zero-shot closed-world classification
- 87.9% Top-1 accuracy over 1,600 unseen websites
-
Open-world detection
- AUC = 0.963, outperforming supervised and few-shot baselines
-
Few-shot adaptation
- With only 4 labeled traces per site, Top-5 accuracy reaches 98.8%
These results demonstrate that semantic leakage, rather than header visibility, is now the dominant privacy risk in encrypted HTTPS traffic.
This section provides step-by-step instructions to reproduce the main experimental results reported in the paper.
All experiments are implemented in Python.
Please first install the required dependencies listed in requirements.txt.
pip install -r requirements.txtWe recommend using a dedicated virtual environment (e.g.,
venvorconda) to avoid dependency conflicts.
We provide the processed dataset and pretrained model checkpoints required for reproduction via a publicly accessible Zenodo repository.
Please organize the downloaded files as follows:
STAR/
├── STAR_dataset/
│ ├── (processed dataset files)
│ └── .gitkeep
├── STAR_model_pt/
│ ├── best_STAR_model.pt
│ └── .gitkeep
- Download
best_STAR_model.pt - Place it at:
/STAR_model_pt/best_STAR_model.pt
🔗 Zenodo link: https://doi.org/10.5281/zenodo.17060855
The dataset released in this repository is preprocessed according to the input format required by STAR, as described in the paper.
The raw data used in this work—including:
- over 170,000 website visits,
- more than 100 GB of raw traffic traces (PCAP format),
- and corresponding logic-side crawl logs—
is not publicly hosted due to storage and distribution constraints. If access to the raw data is required for research purposes, please contact:
All experiment scripts are located in the project root directory:
STAR/
├── cw_zero_shot.py
├── cw_linear_probe.py
├── cw_tip_adapter.py
├── ow_zero_shot.py
├── pretrain.py
├── logic_encoder_8d.py
├── traffic_encoder_3d.py
We categorize experiments by filename prefixes.
Scripts with the prefix cw_ correspond to closed-world evaluation, including:
-
Zero-shot classification
python cw_zero_shot.py
-
Few-shot adaptation
-
Linear probing
python cw_linear_probe.py
-
Tip-Adapter-style adaptation
python cw_tip_adapter.py
-
These scripts reproduce the closed-world results reported in the paper.
Scripts with the prefix ow_ correspond to open-world evaluation, including rejection of unmonitored websites.
python ow_zero_shot.pyUsers may also choose to pretrain the STAR model from scratch using the provided training script:
python pretrain.py-
Training follows the data scale and optimization strategy described in the paper.
-
Default setting:
-
200 epochs
-
Approximately 4 hours using 5 NVIDIA A100 GPUs with data parallelism.
-
⚠️ Pretraining is computationally expensive and not required for reproducing the main results, as pretrained checkpoints are provided.
-
All random seeds are fixed by default for reproducibility.
-
GPU acceleration is recommended for both pretraining and evaluation.
If you encounter any issues during reproduction, feel free to open an issue or contact the authors.
This project is released under the Apache License 2.0.



