Data Science 1

Category: Data Science, 150 points

Description

Welcome to the Data Science for Cyber Defense Challenge!

In this challenge you are presented with network logs of a mid-size organization. An anonymous source reported that the network has been compromised by not just one, but 4 different types of malware. Your objective is to analyze the logs and discover the infected endpoints.

Each correct identification of endpoints that have been attacked by a given malware type will grant you a flag.

All together there are 4 flags, corresponding to the 4 malware types.

The network logs are given in the challenge.csv file which contains the following columns:

timestamp

src_ip

dst_ip

src_port

dst_port

protocol

payload

Submission Guidelines

Submission of suspected endpoints should be in the example submission_example.csv format below. Each submission should contain the endpoints infected by a single malware type. Note that a single endpoint may be infected by at most one malware type.

Scoring

The submission score is the maximal F1 score https://en.wikipedia.org/wiki/F-score corresponding to any of the four malware types, and is calculated as follows: drawing

A minimal score of 0.8 is required in order to gain the flag which corresponds to the best-detected malware type.

Can you gain all four flags?

Solution:

Introduction

In this challenge we need to find anomalies in the attached network traffic dump which indicate infected hosts.

Here's a sample from the network dump:

root@kali:/media/sf_CTFs/shabak/DataScience# cat challenge.csv | head
timestamp,src_ip,dst_ip,src_port,dst_port,protocol,payload
2020-06-21 00:00:02.892702,120.18.164.170,120.18.53.84,2459,53,UDP,218
2020-06-21 00:00:03.771702,120.18.53.84,120.18.164.170,53,2459,UDP,285
2020-06-21 00:00:03.989702,120.18.164.170,148.26.83.117,2495,443,TCP,142
2020-06-21 00:00:04.547702,148.26.83.117,120.18.164.170,443,2495,TCP,130
2020-06-21 00:00:05.170953,120.18.187.161,120.18.63.42,3487,1993,TCP,69
2020-06-21 00:00:05.691953,120.18.63.42,120.18.187.161,1993,3487,TCP,243
2020-06-21 00:00:06.124702,120.18.164.170,148.26.83.117,2495,443,TCP,110
2020-06-21 00:00:07.230702,148.26.83.117,120.18.164.170,443,2495,TCP,277
2020-06-21 00:00:08.522702,120.18.164.170,148.26.83.117,2495,443,TCP,198

It's pretty long:

root@kali:/media/sf_CTFs/shabak/DataScience# wc -l challenge.csv
24456694 challenge.csv

We found a great reference for these kind of challenges here, and our solutions are based on it with some modifications.

We'll use pandas for the challenge. The reference suggests starting with some basic preprocessing in order to make queries more efficient:

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import gc

df = pd.read_csv(
    'challenge.csv',
    header = 0, 
    names= ["timestamp", "src_ip", "dst_ip", "src_port", "dst_port", "protocol", "payload"]
)

df['timestamp'] = pd.to_datetime(df.timestamp, format='%Y-%m-%d %H:%M:%S.%f')
df['hour']      = df.timestamp.dt.hour.astype('uint8')
df['minute']    = df.timestamp.dt.minute.astype('uint8')
df['src_port']  = df['src_port'].astype('uint16')
df['dst_port']  = df['dst_port'].astype('uint16')
df.head()

all_ips = set(df['src_ip'].unique()) | set(df['dst_ip'].unique())
print('Unique src:', df['src_ip'].nunique())
print('Unique dst:', df['dst_ip'].nunique())
print('Total Unique IPs:', len(all_ips))

ip_type = pd.CategoricalDtype(categories=all_ips)
df['src_ip'] = df['src_ip'].astype(ip_type)
df['dst_ip'] = df['dst_ip'].astype(ip_type)
gc.collect()

Output:

Unique src: 5497
Unique dst: 5497
Total Unique IPs: 5497

Nothing much here, just parsing some fields and setting up the source and destination IPs to categorical in order to save on memory.

Also, we'd like to identify the network prefix for our corporate network. Let's see which network prefixes are used most in the dump:

def get_network_prefix(s):
    return s.str.split(".").str[0]
df['src_network_prefix'] = get_network_prefix(df['src_ip'])
df.drop_duplicates("src_ip").groupby("src_network_prefix").size().sort_values(ascending=False)

Output:

src_network_prefix
120    523
142     38
131     32
100     30
7       30
      ... 
170     11
213     11
8       10
193     10
95       9
Length: 254, dtype: int64

From here it looks like our 120 is by far the most common network prefix, we'll assume it belongs to our corporation. We can add some convenient attributes via:

def is_internal(s):
    return s.str.startswith('120.') 

df['src_int'] = is_internal(df['src_ip'])
df['dst_int'] = is_internal(df['dst_ip'])

Flag A

For the first flag, we'll check traffic outside work hours.

First, we need to identify work hours:

df.groupby('hour').size()\
  .plot.bar(title='Activity per hour')\
  .set_ylabel('Connection counts').get_figure().savefig('hours.png')

Output:

It should be pretty clear from the graph that work hours are 06:00-23:00. So, let's see who are the most active external hosts outside work hours:

df[df['src_int'] & ~df['dst_int'] & ((df['hour'] >= 0) & (df['hour'] < 6))]\
  .groupby('dst_ip')\
  .size()\
  .pipe(lambda x: x[x > 0])\
  .sort_values(ascending=False).reset_index()

Output:

      dst_ip             0
0     148.26.83.117     3475
1     5.183.60.54       3467
2     251.139.203.226   355
3     250.42.245.153    330
4     8.8.8.8           279
5     234.8.152.205     218
6     55.107.247.62     51
7     24.161.197.98     34
8     198.70.42.36      12
9     15.254.213.206    5
10     80.213.210.227   1
11     65.64.19.34      1
12     21.91.160.44     1
13     112.146.59.67    1
14     242.138.45.182   1

We can see that that the number of requests to 148.26.83.117 and 5.183.60.54 is dramatically higher than the rest. So, let's gather up that hosts that made requests to these two hosts:

df[df['src_int'] & ~df['dst_int'] & df['dst_ip'].isin(["148.26.83.117", "5.183.60.54"]) & ((df['hour'] >= 0) & (df['hour'] < 6))]\
  .drop_duplicates("src_ip")["src_ip"]

Output:

2       120.18.164.170
31      120.18.213.107
65      120.18.192.121
186     120.18.206.150
220     120.18.120.241
240      120.18.118.50
318      120.18.104.36
398     120.18.216.242
432      120.18.48.222
480      120.18.18.124
498      120.18.45.124
521       120.18.26.88
592       120.18.57.96
797     120.18.162.115
963      120.18.143.87
996     120.18.182.116
1005     120.18.110.15
1119     120.18.66.102
1242     120.18.23.222
1335      120.18.71.37
1353      120.18.88.93
1506    120.18.159.247
1614    120.18.178.138
1629     120.18.143.73
1775    120.18.218.147
1777    120.18.132.135
1856    120.18.173.154
1902    120.18.168.250
2005     120.18.157.13
2318     120.18.103.76
2385    120.18.205.198
2406      120.18.96.34
2416    120.18.171.175
2433     120.18.55.138
2460      120.18.9.187
2852     120.18.207.47
2988    120.18.195.179
3105     120.18.251.35
3187      120.18.27.78
3281     120.18.235.29
3314     120.18.202.14
3481    120.18.156.202
Name: src_ip, dtype: category
Categories (5497, object): ['24.133.117.23', '138.62.106.170', '92.150.22.192', '209.69.254.129', ..., '173.119.192.16', '177.66.191.51', '178.195.62.138', '120.121.209.92']

We can export everything nicely to a CSV file according to the requested format:

df[df['src_int'] & ~df['dst_int'] & df['dst_ip'].isin(["148.26.83.117", "5.183.60.54"]) & ((df['hour'] >= 0) & (df['hour'] < 6))]\
  .drop_duplicates("src_ip")["src_ip"].to_csv("flag_a.csv", index = False, header = ["ip"])

Output:

ip
120.18.164.170
120.18.213.107
120.18.192.121
120.18.206.150
120.18.120.241
120.18.118.50
120.18.104.36
120.18.216.242
120.18.48.222
120.18.18.124
120.18.45.124
120.18.26.88
120.18.57.96
120.18.162.115
120.18.143.87
120.18.182.116
120.18.110.15
120.18.66.102
120.18.23.222
120.18.71.37
120.18.88.93
120.18.159.247
120.18.178.138
120.18.143.73
120.18.218.147
120.18.132.135
120.18.173.154
120.18.168.250
120.18.157.13
120.18.103.76
120.18.205.198
120.18.96.34
120.18.171.175
120.18.55.138
120.18.9.187
120.18.207.47
120.18.195.179
120.18.251.35
120.18.27.78
120.18.235.29
120.18.202.14
120.18.156.202

And we get our first flag:

Congratulations! You achieved the goal. Here is your flag: 

flag{hsjhjskdhjkhdjksd673276327sb}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data_Science_1.md

Data_Science_1.md

Data Science 1

Description

Solution:

Introduction

Flag A

Files

Data_Science_1.md

Latest commit

History

Data_Science_1.md

File metadata and controls

Data Science 1

Description

Solution:

Introduction

Flag A