Legal dataset

Welcome to our datasets!

Those datasets can be used for website association and other related tasks. For more information, please refer to the paper: ""WebAA: Website Association Analysis via Multi-Resource Similarity""

There are two datasets here: Legal dataset and Illegal dataset.

Legal dataset

All the websites in the legal dataset are normal websites.

The Legal dataset folder includes two txt files, namely "chinaz.txt" and "available_webistes.txt".

chinaz.txt: All websites crawled from the chinaz website ranking system, each column is a website.
available_webistes.txt: Available websites retained after filtering all websites in chinaz.txt. Each column represents each website and its organizational information and resource information. The format is as follows: website \t registration number , {external resources} , HTMLText's embedding vector Note: The spaces in the above format are only used to separate each part, and there is no space in the actual file.

Illegal dataset

All the websites in the illegal dataset are black and gray industry websites, for example gambling websites, porn websites.

The illegal dataset folder includes two txt files, namely "IDtracker.txt" and "available_webistes.txt".

IDtracker.txt: This is a dataset of black and gray industry websites that we collected from the paper-IDtracker. The author has performed organization-level aggregation based on the analysis ID, and each column represents an organization and its websites. The format is as follows: analysis ID , {the websites corresponding to this ID}
available_webistes.txt: Available websites retained after filtering all websites in IDtracker.txt. Each column represents each website and its organizational information and resource information. The format is as follows: website \t registration number , {external resources} , HTMLText's embedding vector

We will update and maintain the dataset in real time, hoping that it can help community researchers in their research in related fields!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Illegal Dataset		Illegal Dataset
Legal Dataset		Legal Dataset
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Legal dataset

Illegal dataset

About

Uh oh!

Releases

Packages

Uh oh!

SevenZhang123/WebAA-Datasets

Folders and files

Latest commit

History

Repository files navigation

Legal dataset

Illegal dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages