Skip to content

Preparing DMOZ dataset for my n-Gram LM-based URL classification research

Notifications You must be signed in to change notification settings

aniket3167/dmoz-urlclassifier

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DMOZ URL Classifier

DMOZ is the largest, most comprehensive human-edited directory of the Web. It was historically known as the Open Directory Project (ODP). It contains a categorized list of Web URLs. Their listings are updated on a monthly bases and published in RDF files.

In my research project, I work on classifying web-pages based on their URLs only, hence DMOZ dataset is one of the datasets I use in my research.

If you are going to download their RDF files, you can find to scripts here that are useful to you.

  • dmoz2csv.py: This scripts converts their RDF data into a CSV file. Each line of CSV file contains a uniqie ID, URL and the category of that URL as seen in DMOZ.

  • csv2traintest.py: Then this script can take the resulting CSV from above and convert it into training and test datasets as explained by Bykan et al.

Feeding "csv2traintest.py" on "dmoz0409.csv" will result in producing 15 training and test file pairs.

Contacts

About

Preparing DMOZ dataset for my n-Gram LM-based URL classification research

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%