Webpage URL Classification

Note

This repo is not complete, we will add more experiments and code in the next days

This repository contains code for URL Classification , designed to classify URLs into predefined categories using machine learning techniques. The project utilizes a leanier classifier with Stochastic Gradient Descent (SGD) optimizer, alongside a TfidfVectorizer for feature extraction.

Key Components

data/: Contains the dataset used for training and evaluating the model.
model_results/: Stores the evaluation metrics and results from model training.
models/: Contains the serialized form of the trained model.
src/: Source scripts including model training and URL prediction functionalities.

Setup

Installation

Clone the repository:

git clone https://github.com/padas-lab-de/url-classification.git
cd url-classification

Set up a virtual environment:
```
python -m venv venv
```
Activate the virtual environment:
- Windows:
```
.\venv\Scripts\activate
```
- macOS/Linux:
```
source venv/bin/activate
```
Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

Training the Model

To train the model, navigate to the src/ directory and run the model_training.py script. You will be prompted to enter the path to the dataset file:

cd src
python model_training.py

Follow the prompts to enter the dataset name (e.g., OWS_URL_DS.csv). The script will train the model and save it along with the evaluation metrics.

Predicting URL Labels

To classify new URLs, use the predict_urls.py script. You will need to provide a path to a file containing URLs in either .csv or .txt format:

python predict_urls.py

The predictions will be saved to predictions.csv in the root directory.

To Do

Use Different ML Models and Compare Them: Implement and evaluate other machine learning models to compare their performance against the current SGD Classifier such as:
- SVC
- Random Forest
- Logistic Regression
- Neural Networks
Measure the Prediction Latency: Measure the time it takes to predict labels for new URLs.
Include URL Augmentation for the Training Phase: Investigate and integrate URL augmentation techniques to enhance the diversity and volume of the training data, which could improve model robustness and accuracy.

Done ✓

Utilize Class Weight: Explore using the class_weight parameter in the model training process to handle class imbalance

Contributing

Contributions to this project are welcome! Please feel free to fork the repository, make your changes, and submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
model_results		model_results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Webpage URL Classification

Key Components

Setup

Installation

Usage

Training the Model

Predicting URL Labels

To Do

Done ✓

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

padas-lab-de/url-classification

Folders and files

Latest commit

History

Repository files navigation

Webpage URL Classification

Key Components

Setup

Installation

Usage

Training the Model

Predicting URL Labels

To Do

Done ✓

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages