Skip to content
hmoore-r7 edited this page Jan 1, 2015 · 10 revisions

Project Sonar datasets are a compromise between file size and ease of analysis. With the exception of the SSL and More SSL datasets, Sonar data is designed to be processed as concurrent streams on a single medium-spec multi-core server.

Introduction

The key tenets for Sonar data processing:

  1. File input and output should always be compressed to reduce disk IO.
  2. Every line is a unique record and the order of the dataset is not relevant.
  3. Parallel-friendly utilities should be used whenever possible.
  4. GNU Parallel should be used with non-parallel tools such as DAP.

For tenet #1, consider large datasets such as HTTP. A single file can be 75Gb or more. Decompressing this file is possible, but a waste of space, and there is a good chance that your decompression speed will be faster than the equivalent disk read speed.

All Sonar data is gzip compressed and we use the pigz utility to decompress datasets using multiple cores. The output of a processing pipeline should also be piped back into a compression utility whenever possible. In the past we have used the BZ2 algorithm with pbzip2, but the parallel mode of BZ2 is not compatible with many data processing tools (Hadoop, 7z, etc). The gzip format is not particularly efficient, but it is widely compatible with the data processing ecosystem.

The example below demonstrates an efficient way to extract all results for the 1.1.0.0/16 IP range from the 2014-10-20 UDP NetBIOS scan. This will use all available hardware threads and complete much faster than using gzip or zcat on its own.

$ pigz -dc 20141020-netbios-137.csv.gz | grep ',1\.1\.' | pigz -c > 20141020-netbios-137_1.1.0.0.csv.gz

Tenet #2 means that since the results are not ordered, there is no reason why your own processing has to preserve the record sequence. One case where this is important is when using GNU Parallel. There is no need for the -k option, which preserves order at the cost of slower processing. One caveat is that you may need use grep -v to exclude non-record lines such as comments or CSV headers.

For tenet #3, you may be surprised by the number of standard Unix tools that support parallel processing modes. The example below using the standard GNU sort utility to sort a CSV file by its second field in IPv4 order using 12 cores, 16Gb of RAM, and an SSD for temporary space:

$ pigz -dc data01.gz | sort -t , -k 2 -V --parallel=12 -S 16G -T /mnt/ssd/tmp | pigz -c > data01.sorted.gz

Finally, for tenet #4, keep in mind that any utilities that do not support parallel processing can be made to do so using GNU parallel. The example below runs 12 copies of grep to speed up filtering across a large dataset. If the utility has a long start-up time (such as DAP), its best to pass a large value to --block-size (100000000 or so) in order to speed things up. RAM usage will increase as a function of block size and the number of jobs, so keep that in mind when selecting the best parameters for your system.

$ pigz -dc data01.gz | parallel --gnu -j 12 --pipe "grep ',1\.1\.'" | pigz -c > data01_1.1.0.0.csv.gz

DAP: The Data Analysis Pipeline

We built DAP specifically for processing Sonar datasets. DAP is essentially a command line Map-Reduce utility that can handle all sorts of input formats and data encodings. DAP consumes data and pipes each record through a series of transforms, filters, and annotations before emitting the modified record. DAP prioritizes flexibility over speed, but works great on multi-core systems with GNU Parallel. The example below appends GeoIP information to each record in a CSV dataset.

$ pigz -dc data01.csv.gz | dap csv - header=yes + geo_ip saddr + json | head -n 1
{
  "saddr.longitude": "139.69000244140625",
  "saddr.latitude": "35.689998626708984",
  "saddr.country_name": "Japan",
  "saddr.country_code3": "JPN",
  "saddr.country_code": "JP",
  "timestamp-ts": "1413836995",
  "saddr": "1.0.100.195",
  "sport": "137",
  "daddr": "198.143.173.180",
  "dport": "36997",
  "ipid": "11091",
  "ttl": "50",
  "data": "e5d88400000000010000000020434b41414141414141414141414141414141414141414141414141414141414100002100010000000000410131393520202020202020202020202000640000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000"
}

Tweaking this example, we can find the first result where the system is in the United States, hex decode the data field, then parse the NetBIOS response data, select just the IP address, state, city, and decoded NetBIOS hostname, and encode the result in CSV format:

$ pigz -dc data01.csv.gz | dap csv - header=yes + \
  geo_ip saddr + include saddr.country_code=US + \
  transform data=hexdecode + decode_netbios_status_reply data + \
  select saddr saddr.region_name saddr.city data.netbios_hname + \
  csv | head -n 1

100.0.110.103,Massachusetts,Randolph,ROB-PC

This barely scratches the surface of what DAP can do. To find out more, grab a copy from Github, and look at the samples directory for example scripts. In addition to processing Sonar data, DAP has support for the WARC format used by CommonCrawl and archive.org.