Note
This repo is not complete, we will add more experiments and code in the next days
This repository contains code for URL Classification , designed to classify URLs into predefined categories using machine learning techniques. The project utilizes a leanier classifier with Stochastic Gradient Descent (SGD) optimizer, alongside a TfidfVectorizer for feature extraction.
- data/: Contains the dataset used for training and evaluating the model.
- model_results/: Stores the evaluation metrics and results from model training.
- models/: Contains the serialized form of the trained model.
- src/: Source scripts including model training and URL prediction functionalities.
-
Clone the repository:
git clone https://github.com/padas-lab-de/url-classification.git cd url-classification -
Set up a virtual environment:
python -m venv venv
-
Activate the virtual environment:
- Windows:
.\venv\Scripts\activate
- macOS/Linux:
source venv/bin/activate
- Windows:
-
Install the required dependencies:
pip install -r requirements.txt
To train the model, navigate to the src/ directory and run the model_training.py script. You will be prompted to enter the path to the dataset file:
cd src
python model_training.pyFollow the prompts to enter the dataset name (e.g., OWS_URL_DS.csv). The script will train the model and save it along with the evaluation metrics.
To classify new URLs, use the predict_urls.py script. You will need to provide a path to a file containing URLs in either .csv or .txt format:
python predict_urls.pyThe predictions will be saved to predictions.csv in the root directory.
-
Use Different ML Models and Compare Them: Implement and evaluate other machine learning models to compare their performance against the current SGD Classifier such as:
- SVC
- Random Forest
- Logistic Regression
- Neural Networks
-
Measure the Prediction Latency: Measure the time it takes to predict labels for new URLs.
-
Include URL Augmentation for the Training Phase: Investigate and integrate URL augmentation techniques to enhance the diversity and volume of the training data, which could improve model robustness and accuracy.
- Utilize Class Weight: Explore using the
class_weightparameter in the model training process to handle class imbalance
Contributions to this project are welcome! Please feel free to fork the repository, make your changes, and submit a pull request.