- Introduction (2 human-minute)
- Configuration and Installation (1 human-minute, 6 compute-minutes)
- Dependency (3 compute-minutes)
- Build IPMES (3 compute-minutes)
- Reproduce and Validate Experiment Results (15 human-minutes, 7 compute-days)
- Preparation (5 compute-minutes)
- Matching efficiency (7 compute-days)
- Window size (5 compute-minutes)
- Execution / How to reuse beyond paper (12 human-minutes, 1 compute-minute)
- Command-line Syntax
- Input Graph Format
- Pattern Format
- Output and Side Effects
- Authors (1 human-minute)
IPMES (Incremental Behavioral Pattern Matching Algorithm over the System Audit Event Stream for APT Detection) is a system that performs incremental pattern matching over event streams.
The core concept of IPMES involves decomposing a target behavioral pattern into multiple subpatterns, matching events against these subpatterns, and then combining the match results of these subpatterns to generate complete instances. The figure below illustrates an overview of IPMES. Our implementation adopts a layered structure, tailored for efficient streaming data processing. Each layer processes the output of its preceding layer as input. IPMES consists of three layers: Matching Layer, Composition Layer, and Join Layer. A target behavioral pattern is decomposed in Prepossessing. Upon reading an event, Matching Layer splits it based on its interval timestamps, and reorders events according to their matched event patterns of the subpatterns. Then, the Composition Layer uses events to match against subpatterns. Finally, Join Layer collects all match results of all subpatterns and merges them into complete instances as outputs.
- Java(JDK) >= 11
- Apache Maven >= 3.6.0
Install on Ubuntu/Debian:
sudo apt-get update
sudo apt-get install openjdk-11-jdk mavenWith Maven, IPMES can be built with a simple command:
cd ipmes-java
mvn compileThe first build will take longer due to downloading the dependencies.
This section describes how to reproduce the experiment results in our paper.
We use Python scripts to automate the experiment. The experiment environment requires:
- RAM >= 100 GB
- Running experiment on SPADE require 32 ~ 48 GB
- Running on DARPA require 100 GB
- Unix-like environment (tested on Ubuntu 18.04 and 22.04)
- GNU bash >= 4.4.20
- Python >= 3.6.9
- with pip installed
- The Python packages listed in
ipmes-java/requirements.txtinstalled
For Ubuntu/Debian:
sudo apt-get update
sudo apt-get install python3 python3-pip
cd ipmes-java
pip3 install -r requirements.txt
It is recommended to install packages in python virtual environments like conda, venv or virtualenv to avoid package collisions.
The RAM requirement is high because the experiment involves running IPMES with and without optimizations for comparison. The unoptimized version may need a significant amount of memory to achieve results comparable to the optimized one. However, the optimized IPMES should be runnable on a personal computer with 32 GB of memory.
Our experiments use the preprocessed data graph as the input to IPMES. You can download the preprocessed provenance graph for our experiment at link.
Extract the file to a location of your choice. In the following example, we assume that the location of preprocessed data graph is located at <root of source files>/data/preprocessed/.
unzip preprocessed.zip -d <root of source files>/data/
This experiment compares different implementations of IPMES (purposed method, naive method, and CEP) across different patterns on different data graphs.
The following command uses a python script to automate the experiment. The script will output the tables similar to Table IV and Table V in the paper and save the tables to CSV files in ../results/ipmes-java/ (the save path can be changed with -o option). For convenience, the script will also print out the command it is currently running.
cd ipmes-java
python3 run.py -D all \
-d ../data/preprocessed/ \
-o ../results/ipmes-java/example output:
Running: bash -c time -p -- mvn -q exec:java -Dexec.args="-w 1800 ../data/universal_patterns/SP1_regex.json ../data/preprocessed/attack.csv "
...
SPADE CPU Time (sec)
attack attack-naive attack-cep mix mix-naive mix-cep benign benign-naive benign-cep
pattern
SP1_regex 8.3500 8.260000 16.260000 9.980000 9.640000 18.950000 6.650000 7.390000 11.4600
SP2_regex 8.4600 9.440000 12.840000 9.240000 9.250000 12.700000 7.060000 6.660000 9.8900
SP3_regex 9.7200 86.960000 18.510000 10.440000 93.010000 20.970000 7.180000 7.470000 12.8400
SP4_regex 9.1200 9.040000 12.760000 9.100000 9.860000 14.530000 7.060000 6.690000 9.8600
SP5_regex 9.1600 9.120000 17.460000 10.480000 10.540000 38.590000 7.730000 7.550000 181.7600
SP6_regex 9.0700 9.430000 14.250000 9.630000 10.570000 17.060000 7.500000 7.220000 11.8400
SP7_regex 8.7700 268.510000 13.810000 11.350000 2827.390000 19.860000 7.610000 7.840000 12.6600
SP8_regex 11.5000 16.460000 15.460000 11.410000 44.070000 16.970000 7.370000 44.110000 15.5700
SP9_regex 10.7300 10.900000 14.740000 11.690000 11.630000 29.160000 7.810000 7.370000 11.9600
SP10_regex 10.1400 9.610000 13.090000 10.400000 10.720000 15.670000 7.460000 7.300000 11.3900
SP11_regex 9.8400 10.810000 13.440000 10.870000 10.420000 16.310000 7.430000 7.620000 11.8500
SP12_regex 8.6300 9.120000 11.790000 9.570000 8.930000 11.810000 7.480000 6.910000 9.8700
Average 9.4575 38.138333 14.534167 10.346667 254.669167 19.381667 7.361667 10.344167 25.9125
This table is saved to ../results/ipmes-java/spade_cpu_time.csv
SPADE Memory Usage (MB)
attack attack-naive attack-cep mix mix-naive mix-cep benign benign-naive benign-cep
pattern
SP1_regex 944.0 768.0 1424.000000 1144.000000 1056.000000 1384.000000 792.000000 776.000000 1536.000000
SP2_regex 968.0 1192.0 1848.000000 1416.000000 1064.000000 1632.000000 792.000000 768.000000 1080.000000
SP3_regex 760.0 4424.0 4816.000000 1520.000000 5360.000000 2336.000000 1016.000000 784.000000 2288.000000
SP4_regex 1032.0 1032.0 1960.000000 1064.000000 1280.000000 2424.000000 744.000000 752.000000 1288.000000
SP5_regex 800.0 1192.0 2032.000000 1232.000000 1360.000000 2448.000000 1536.000000 1504.000000 4696.000000
SP6_regex 752.0 1016.0 2224.000000 768.000000 1192.000000 2096.000000 752.000000 1136.000000 1520.000000
SP7_regex 1192.0 15888.0 1984.000000 1248.000000 22488.000000 2184.000000 808.000000 1104.000000 1528.000000
SP8_regex 1560.0 1552.0 2208.000000 1024.000000 3720.000000 2368.000000 848.000000 1232.000000 1432.000000
SP9_regex 1304.0 1280.0 2416.000000 1424.000000 1784.000000 2272.000000 800.000000 792.000000 2208.000000
SP10_regex 1432.0 1144.0 1696.000000 1056.000000 1176.000000 4688.000000 768.000000 872.000000 1488.000000
SP11_regex 1216.0 1824.0 1576.000000 1152.000000 1104.000000 2288.000000 752.000000 752.000000 2232.000000
SP12_regex 1264.0 1256.0 976.000000 1192.000000 1224.000000 1344.000000 800.000000 792.000000 848.000000
Average 1102.0 2714.0 2096.666667 1186.666667 3567.333333 2288.666667 867.333333 938.666667 1845.333333
This table is saved to ../results/ipmes-java/spade_mem_usage.csv
Running: bash -c time -p -- mvn -q exec:java -Dexec.args="-w 1000 ../data/universal_patterns/DP1_regex.json ../data/preprocessed/dd1.csv "
...
The meaning of each output column: <Dataset Name>[-<IPMES Setting>]. Available settings:
- When no setting is specified, the default setting is the purposed method in out paper.
naive: naive implementation in the Join layer.cep: Use CEP tool to implement composition layer.
Note that collecting all the data points may be time-consuming. You can utilize -D option of the runner script to specify the dataset (spade or darpa). The SPADE dataset takes less time to run compared to the DARPA dataset. The following command only runs on the SPADE dataset, and it will output the result similar to Table IV in our paper:
python3 run.py -D spade \
-d ../data/preprocessed/ \
-o ../results/ipmes-java/If your computer memory is not enough, you can try to apply the -M option to set the memory limit of JVM (the script set it to 100 GB by default):
python3 run.py -D all\
-d ../data/preprocessed/ \
-o ../results/ipmes-java/ \
-M 48However, this could result in OOM error reported by JVM causing IPMES to exit and may reduce the performance.
The following command reproduces Fig.6 and Fig.7 in section IV.B of our paper. It will run IPMES to match SP7 on the graph mix with different window size options.
python3 age_limit_experiment.py \
-l '2,4,6,8,9,20,40,80,200,400,800,1600,3200,6400,12800,25600,51200' \
-p ../data/universal_patterns/SP7_regex.json \
-d ../data/preprocessed/mix.csvexample output:
WindowSize, AvgCpuTime, PoolSize, NumResults, NumClusters
2 8.62 6981 0 0
4 8.44 2665 5 5
6 8.94 3079 24 12
8 9.57 4100 35 12
9 8.98 4439 38 13
20 9.23 5930 197 13
40 8.85 7034 415 13
80 11.00 13488 1690 13
200 11.33 13932 1690 13
400 12.46 15257 1690 13
800 11.75 16346 1690 13
1600 11.31 16958 1690 13
3200 11.42 29256 1690 13
6400 14.21 40663 1690 13
12800 22.07 64812 1690 13
25600 37.50 130907 1690 13
51200 51.34 235111 1690 13
The meaning of each output column:
WindowSize: the window size in seconds.AvgCpuTime: the CPU Time (in seconds) it took to run on the given configuration.PoolSize: the maximum number of instances in the pool during matching.NumResults: the number of match results.NumClusters: the number of attack cluster. Each cluster represent a single attack behavior.
usage: ipmes-java [-h] [--dump-trigger-counts] [--dump-results] [-w WINDOWSIZE] [--debug] [--cep] [--naive-join] pattern_file data_graph
IPMES implemented in Java.
positional arguments:
pattern_file The path to the pattern, e.g. ../data/universal_patterns/TTP11.json
data_graph The path to the preprocessed data graph
named arguments:
-h, --help show this help message and exit
--dump-trigger-counts Output trigger counts. (default: false)
--dump-results Output match results. (default: false)
-w WINDOWSIZE, --window-size WINDOWSIZE
Time window size (sec) when joining. (default: 1800)
--debug Output debug information. (default: false)
--cep Use CEP (Siddhi) in composition layer. (default: false)
--naive-join Use naive join in join layer. (default: false)
Note that running some patterns on large data graphs may require a lot of memory; thus, it is recommended to set the heap allocation limit of JVM to a higher value to avoid OOM error. For example, setting it to 100 GB:
export MAVEN_OPTS=-Xmx100G
mvn -q exec:java -Dexec.args="-w 1000 ../data/universal_patterns/DP1_regex.json ../data/preprocessed/dd1.csv "In addition, --cep and --naive-join options are only used for experiment purposes. They result in significant performance degradation, and thus shouldn't be enabled for production use.
Data graphs are in CSV format. The columns in the CSV are: [start_time, end_time, event_sig, eid, start_id, end_id], which represent:
start_time: the event start timeend_time: the event end timeevent_sig: event signature, a signature is in the format:{edge label}#{start node label}#{end node label}eid: edge idstart_id: id of the start nodeend_id: id of the end node
A preprocessed provanence graph can be downloaded at link.
{
"Version": "0.1.0",
"UseRegex": false,
"Events": [
{
"ID": 0,
"Signature": "aaa",
"SubjectID": 123,
"ObjectID": 456,
"Parents": []
},
{
"ID": 1,
"Signature": "bbb",
"SubjectID": 789,
"ObjectID": 456,
"Parents": [ 0 ]
}
]
}The pattern is represented in JSON format. The root object contains 3 keys:
Version: the version of the pattern format, the latest version is0.1.0UseRegex: theSignaturein this pattern is supposed to be treated as regex expressions. We use re2j to handle regex expressions, the supported regex syntax can be found here.Events: an array of Pattern Event Object.
Pattern Event Object:
ID: the unique ID of this pattern eventSignature: the signature of this pattern event. It will match input events in the data graph with the same signature. IfUseRegexis set totrue, the signature will be treated as a regex expression to match the signatures of input events in the data graph.SubjectID: the subject of this event. If 2 events arise from the same subject, they share the subject ID.ObjectID: the object of this event. If 2 events act on the same object, they share the object ID.Parents: an array of pattern event id. The pattern event should be matched after all of its parents are matched.
Current limitations:
- Pattern event id must be assigned in the range
[0, num_id), wherenum_idis the number of unique ids.
See the files in universal_patterns/ for more information.
The output of the program is in JSON format. The output will be printed to stdout. To get the match results, you need to turn on the --dump-results option.
mvn -q exec:java -Dexec.args="-w 1800 ../data/universal_patterns/SP7_regex.json testcase/expect_ttp7.csv --dump-results"
The command above will output:
{
"PeakHeapSize": 522190848,
"MatchResults": [{
"EndTime": 1637269341652,
"StartTime": 1637269279976,
"MatchIDs": [
6546604,
6204163,
7112806,
7112865,
7109083,
6549533,
5630794,
5196484,
5630793,
5196483,
5630795,
5196481
]
}],
"NumResults": 1,
"PeakPoolSize": 11
}The meaning of each key:
PeakHeapSize: the maximum heap allocation size reported by JVM in bytes.MatchResults: An array of match results, the key of each result:EndTime: the timestamp of the latest event of this match instanceStartTime: the timestamp of the earliest event of this match instanceMatchIDs: the matched input event IDs, their index in this array corresponds to the pattern event they are matched by. In this example,6546604is located in index 0, hence it matches the pattern event with id0.
NumResults: the number of match resultsPeakPoolSize: the maximum number of instances in the pool- This number is meaningless when
--cepoption is enabled since the CEP tool doesn't allow us to obtain the pool size information
- This number is meaningless when
- Hong-Wei Li (Research Center for Information Technology Innovation, Academia Sinica, Taiwan) g6_7893000@hotmail.com
- Ping-Ting Liu (Department of Computer Science, National Yang Ming Chiao Tung University, Taiwan) xyfc128@gmail.com
- Bo-Wei Lin (Department of Computer Science, National Yang Ming Chiao Tung University, Taiwan) 0800680274united@gmail.com
- Yi-Chun Liao (Department of Computer Science and Information Engineering, National Taiwan University, Taiwan) lyck92@gmail.com
- Yennun Huang (Research Center for Information Technology Innovation, Academia Sinica, Taiwan) yennunhuang@citi.sinica.edu.tw
