Problem stated by ITU AI For Good Global Summit and presented by ITU & ULAK
Dataset available at Zenodo Dataset
Software Defined Networks (SDNs) have revolutionised the way modern networks are managed and orchestrated. This sophisticated infrastructure can provide numerous benefits but at the same time introduce several security challenges. A centralised controller holds the responsibility of managing the network traffic, thus making it an attractive target to attackers. Intrusion detection systems (IDS) play a crucial role in identifying and addressing security threats within the SDN. In this paper, we developed an SDN-IDS system by utilising machine learning techniques for anomaly detection to identify deviations in network behaviour. This is specifically challenging due to the fact that we only have a few samples from several of the attack classes, i.e. minority classes. Five machine learning algorithms were employed to train the SDN-IDS, and ultimately, the most appropriate one was chosen. Moreover, we applied the SMOTE and TOMEK link re-samplings on the dataset as well as a cost-sensitive learning technique to enhance the classification performance of the minority attacks. The Decision Tree (DT) model, trained on a feature-reduced and resampled dataset using cost-sensitive learning, achieved an impressive overall performance with 99.87% accuracy and an F1-score of 99.87. Additionally, it demonstrated a classification accuracy above 99% in identifying 11 out of the 15 possible traffic classes.
Our SDN-IDS utilises a 4-step approach. Firstly the data are pre-processed (cleaned,encoded,normalised), the dimension of the dataset was reduced using a RF feature selection, then data were resampled when necessary. Finally, different ML models are trained and evaluated in order to obtain the best one.
The dataset was provided by ULAK. After the data cleaning phase the training and test sets are as described in the table below.
Five ML models in total were chosen to be used. Decision Trees (DT), Random Forest (RF) and K-Nearest Neighbours (K-NN) were selected since they have an extensive use in the topic of IDS, are easy to implement and support multi-class classification. Also a Bagging and a Boosting classifier were utilised.
Network traffic datasets used for IDS are usually imbalanced. Imbalanced data usually lead to a biased model towards the majority class. From our perspective to tackle this problem, resampling techniques such as SMOTE and Tomek’s link were utilised in order to alleviate data imbalances between classes.
Model | Precision | Recall | F1-Score | Accuracy |
---|---|---|---|---|
DT | 0.9983 | 0.9983 | 0.9983 | 0.9983 |
RF | 0.9981 | 0.9980 | 0.9980 | 0.9980 |
K-NN | 0.9971 | 0.9971 | 0.9971 | 0.9971 |
Bagging | 0.9984 | 0.9984 | 0.9984 | 0.9984 |
XGBoost | 0.9986 | 0.9986 | 0.9986 | 0.9986 |
SDN-IDS Weighted Average performance evaluation for 5-Fold Cross-validation using the final dataset.
Model | Precision | Recall | F1-Score | Accuracy |
---|---|---|---|---|
DT | 0.9988 | 0.9987 | 0.9987 | 0.9987 |
RF | 0.9988 | 0.9987 | 0.9987 | 0.9987 |
K-NN | 0.9957 | 0.9954 | 0.9955 | 0.9954 |
Bagging | 0.9986 | 0.9986 | 0.9986 | 0.9986 |
XGBoost | 0.9989 | 0.9989 | 0.9989 | 0.9989 |
SDN-IDS Weighted Average performance evaluation of the Test Set when models were trained with the final dataset.
Performance Evaluation Breakdown for every data traffic for XGBoost Model trained of the feature-reduced dataset.
Performance Evaluation Breakdown for every data traffic for DT Model trained of the feature-reduced and resampled dataset with cost-sensitive learning.
Sotiris Chatzimiltis, Mohammad Shojafar, Mahdi Boloursaz Mashhadi, and Rahim Tafazolli
5GIC & 6GIC, Institute for Communication Systems (ICS), University of Surrey, Guildford, UK
sc02449@surrey.ac.uk, m.shojafar@surrey.ac.uk, m.boloursazmashhadi@surrey.ac.uk, r.tafazolli@surrey.ac.uk