Extracting features from URLs to build a data set for machine learning. The purpose is to find a machine learning model to predict phishing URLs, which are targeted to the Brazilian population.
This repo includes the implementation of our paper:
Lucas Dantas Gama Ayres, Italo Valcy S Brito and Rodrigo Rocha Gomes e Souza. Using Machine Learning to Automatically Detect Malicious URLs in Brazil. In Simpósio Brasileiro de Redes de Computadores e Sistemas Distribuídos (SBRC 2019) - 2019, Gramado - RS - Brazil.
The paper is available here: https://sol.sbc.org.br/index.php/sbrc/article/view/7416
DOI: https://doi.org/10.5753/sbrc.2019.7416
$ sudo apt-get update && sudo apt-get upgrade
$ sudo apt-get install virtualenv python3 python3-dev python-dev gcc libpq-dev libssl-dev libffi-dev build-essentials
$ virtualenv -p /usr/bin/python3 .env
$ source .env/bin/activate
$ pip install -r requirements.txtBefore running the software, add the API Keys to the Google Safe Browsing, Phishtank, and MyWot in the config.ini file.
Now, run:
$ python run.py <input-urls> <output-dataset>| LEXICAL | |||
|---|---|---|---|
| Count (.) in URL | Count (-) in URL | Count (_) in URL | Count (/) in URL |
| Count (?) in URL | Count (=) in URL | Count (@) in URL | Count (&) in URL |
| Count (!) in URL | Count ( ) in URL | Count (~) in URL | Count (,) in URL |
| Count (+) in URL | Count (*) in URL | Count (#) in URL | Count ($) in URL |
| Count (%) in URL | URL LengthL | TLD amount in URL | Count (.) in Domain |
| Count (-) in Domain | Count (_) in Domain | Count (/) in Domain | Count (?) in Domain |
| Count (=) in Domain | Count (@) in Domain | Count (&) in Domain | Count (!) in Domain |
| Count ( ) in Domain | Count (~) in Domain | Count (,) in Domain | Count (+) in Domain |
| Count (*) in Domain | Count (#) in Domain | Count ($) in Domain | Count (%) in Domain |
| Domain Length | Quantidade de vogais in Domain | URL domain in IP address format | Domain contains the key words "server" or "client" |
| Count (.) in Directory | Count (-) in Directory | Count (_) in Directory | Count (/) in Directory |
| Count (?) in Directory | Count (=) in Directory | Count (@) in Directory | Count (&) in Directory |
| Count (!) in Directory | Count ( ) in Directory | Count (~) in Directory | Count (,) in Directory |
| Count (+) in Directory | Count (*) in Directory | Count (#) in Directory | Count ($) in Directory |
| Count (%) in Directory | Directory Length | Count (.) in file | Count (-) in file |
| Count (_) in file | Count (/) in file | Count (?) in file | Count (=) in file |
| Count (@) in file | Count (&) in file | Count (!) in file | Count ( ) in file |
| Count (~) in file | Count (,) in file | Count (+) in file | Count (*) in file |
| Count (#) in file | Count ($) in file | Count (%) in file | File length |
| Count (.) in parameters | Count (-) in parameters | Count (_) in parameters | Count (/) in parameters |
| Count (?) in parameters | Count (=) in parameters | Count (@) in parameters | Count (&) in parameters |
| Count (!) in parameters | Count ( ) in parameters | Count (~) in parameters | Count (,) in parameters |
| Count (+) in parameters | Count (*) in parameters | Count (#) in parameters | Count ($) in parameters |
| Count (%) in parameters | Length of parameters | TLD presence in arguments | Number of parameters |
| Email present at URL | File extension | ||
| BLACKLIST | |||
|---|---|---|---|
| Presence of the URL in blacklists | Presence of the IP Address in blacklists | Presence of the domain in Blacklists | |
| HOST | |||
|---|---|---|---|
| Presence of the domain in RBL (Real-time Blackhole List) | Search time (response) domain (lookup) | Domain has SPF? | Geographical location of IP |
| AS Number (or ASN) | PTR of IP | Time (in days) of domain activation | Time (in days) of domain expiration |
| Number of resolved IPs | Number of resolved name servers (NameServers - NS) | Number of MX Servers | Time-to-live (TTL) value associated with hostname |
| OTHERS | |||
|---|---|---|---|
| Valid TLS / SSL Certificate | Number of redirects | Check if URL is indexed on Google | Check if domain is indexed on Google |
| Uses URL shortener service | |||
Any contribution is appreciated.
- Clone the project:
$ git clone https://github.com/lucasayres/url-feature-extractor.git
- Make your changes in a new git branch:
$ git checkout -b my-branch master
-
Add your changes.
-
Push your branch to Github.
-
Create a PR to master.