|
1 | 1 | # Python-Project
|
2 |
| -Part of CS241 and CS244 Courses in my Engineering 2nd Year at IIT Guwahati |
| 2 | +Part of CS241 and CS244 Courses in my Engineering 2nd Year at IIT Guwahati. |
| 3 | +Description of Australian Defence Force Academy-Linux Dataset (ADFA-LD) : |
| 4 | +Description of Australian Defence Force Academy-Linux Dataset (ADFA-LD) : |
| 5 | +1) The dataset was generated on Linux local server running on Ubuntu 11.04, offering a variety |
| 6 | +of functions such as file sharing, database, remote access and web server. |
| 7 | +2) Six types of attacks occur in ADFA-LD including two brute force password guessing |
| 8 | +attempts on the open ports enabled by FTP and SSH respectively, an unauthorised attempt to |
| 9 | +create a new user with root privileges through encoding a malicious payload into a normal |
| 10 | +executable, the uploads of Java and Linux executable Meterpreter payloads for the remote |
| 11 | +compromise of a target host, and the compromise and privilege escalation using C100 |
| 12 | +webshell. These types are termed as Hydra-FTP, Hydra-SSH, Adduser, Java-Meterpreter, |
| 13 | +Meterpreter and Webshell respectively. You can find these attacks inside the folder |
| 14 | +“Attack_Data_Master” |
| 15 | +3) 833 and 4373 normal traces are generated for training and validation respectively, over a |
| 16 | +period during which no attacks occur against the host and legitimate application activities |
| 17 | +ranging from web browsing to document writing are operated as usual. These training and |
| 18 | +validation can be found in the “Training_Data_Master” and “Validation_Data_Master” |
| 19 | +folders, respectively. |
| 20 | +Assignment Task: |
| 21 | +1) Split the Attack data of each category (Hydra-FTP, Hydra-SSH, Adduser, Java-Meterpreter, |
| 22 | +Meterpreter and Webshell ) into 70% training data and 30 % test data. For instance there are |
| 23 | +are 10 folders in “Adduser” attack. Therefore, 7 of these folders are to be used for training |
| 24 | +and 3 folders are to be used for testing. |
| 25 | +2) For the Normal data, files in “Training_Data_Master” folder are to be used as training data |
| 26 | +and files in “Validation_Data_Master” folder are to be used as test data. |
| 27 | +3) Write a python script to find the frequency of occurences of all unique 3-grams, 5-grams |
| 28 | +and 7-grams system call sequences in the training data for both Attack data (across all |
| 29 | +categories of attack) and Normal data. For e.g., consider the following trace file |
| 30 | +corresponding to the Adduser attack. |
| 31 | +265 168 168 265 168 168 168 265 168 265 168 168 . . . |
| 32 | +Your script to list all 3-grams should produce the following output: |
| 33 | +265 168 168 -->3 |
| 34 | +168 168 265 -->2 |
| 35 | +168 265 168 -->3 |
| 36 | +168 168 168 -->1 |
| 37 | +265 168 265 -->1 |
| 38 | +NOTE: To save time you can concatenate your entire training file for a particular class of |
| 39 | +attack and then run your script on the concatenated file instead of running it individually on |
| 40 | +each file. |
| 41 | +4) Perform the same task on files in the “Training_Data_Master” to obtain all the unique 3- |
| 42 | +grams, 5-grams and 7-grams. |
| 43 | +5) Once you have obtained the frequencies of all the unique n-grams terms in the training data, |
| 44 | +use the top 30% n-grams terms with the highest frequency to create a data set. For instance |
| 45 | +consider following results for Adduser data (1st File): |
| 46 | +('240', '102', '221') 7 |
| 47 | +('204', '203', '5') 2 |
| 48 | +('195', '199', '60') 1 |
| 49 | +('5', '197', '45') 1 |
| 50 | +('5', '195', '5') 12 |
| 51 | +('6', '220', '4') 1 |
| 52 | +('191', '5', '133') 9 |
| 53 | +('13', '45', '5') 2 |
| 54 | +('60', '5', '197') 4 |
| 55 | +('3', '142', '7') 2 |
| 56 | +Hydra-FTP data (2nd File): |
| 57 | +('3', '142', '7') 11 |
| 58 | +('219', '311', '240') 4 |
| 59 | +('240', '13', '240') 1 |
| 60 | +('33', '168', '146') 2 |
| 61 | +('6', '168', '102') 3 |
| 62 | +('5', '197', '45') 1 |
| 63 | +('5', '195', '5') 2 |
| 64 | +('3', '91', '5') 8 |
| 65 | +('42', '120', '197') 1 |
| 66 | +('174', '54', '5') 2 |
| 67 | +('6', '63', '6') 18 |
| 68 | +Normal training data (3rd File): |
| 69 | +('195', '10', '41') 1 |
| 70 | +('3', '142', '7') 3 |
| 71 | +('91', '240', '196') 2 |
| 72 | +('5', '195', '5') 2 |
| 73 | +('3', '102', '7') 17 |
| 74 | +('3', '195', '195') 14 |
| 75 | +('4', '78', '240') 1 |
| 76 | +('33', '195', '192') 2 |
| 77 | +('5', '197', '45') 15 |
| 78 | +('199', '45', '192') 1 |
| 79 | +The top 30 % 3-grams terms with highest frequencies in Adduser, Hydra-FTP and Normal |
| 80 | +data are [('5', '195', '5'), ('191', '5', '133'), ('240', '102', '221')], [('6', '63', '6'), ('3', '142', '7'), ('3', '91', |
| 81 | +'5')] and [('3', '102', '7'), ('5', '197', '45'), ('3', '195', '195')], respectively. Designate ('5', '195', '5') as |
| 82 | +feature 1(F1), ('191', '5', '133') as feature 2 (F2) ..... and ('3', '195', '195') as F9. Then, the generated |
| 83 | +dataset should have 9 features and one class label ( Adduser, Hydra-FTP, Normal ) with each feature |
| 84 | +corresponding to frequency of occurences of one of these 9 features. For instance for the 1st File, |
| 85 | +the generated data should be |
| 86 | +Freq of F1, Freq of F2, ...., Freq of F9 ----->12, 9, 7, 0, 2, 0, 0,1,0, Adduser |
| 87 | +Freq of F1, Freq of F2, ...., Freq of F9 ----->2, 0, 0, 0, 3, 0, 17,15,14, Normal |
| 88 | +This will be the final training data which will be used to train various classifiers. |
| 89 | +6) Apply the same procedure to generate the test dataset from the test files of the attack data |
| 90 | +(for all attack types) and the normal files in the “Validation_Data_Master” using the top |
| 91 | +30% 3-grams terms with highest frequencies obtained during the training phase. The |
| 92 | +classifier model developed during the training phase will finally be validated on the Test |
| 93 | +dataset. |
| 94 | +NOTE : You can refer the paper availabe at http://ieeexplore.ieee.org/stamp/stamp.jsp? |
| 95 | +arnumber=6743952 for further reference on ADFA-LD dataset. |
0 commit comments