Skip to content

Commit

Permalink
Update the protocol inference test infra with Mongo changes (#1758)
Browse files Browse the repository at this point in the history
Summary: Previously, the TShark command in the `dataset_generation`
script was not able to decode Mongo pcap files and insert them to the
dataset for evaluation. This PR adds a flag to the TShark command to
decode traffic running through port 27017 as Mongo. The readme is also
updated to provide information about the bidirectional connection level
dataset.

**Updates to the confusion matrix**
In the previous image, the connections per protocol in the dataset seem
to have been duplicated leading to a large number of connections per
protocol. This may have been due to the `dataset_generation` script
appending data to the `.tsv` files each time it was ran even though the
underlying pcap file content/counts not being altered.

Running the `dataset_generation` script with empty `.tsv` files with the
same pcap files followed by the `eval` script resulted in a matrix
showing much fewer number of connections per protocol, suggesting that
there may have been duplication in the dataset previously.

The connection counts for each protocol in the older dataset seem to
have increased by a factor of 4x or 8x the count as the new dataset and
makes sense as to why the inference accuracy remained constant between
the old/new matrix.

The TLS connection count had dropped in the new matrix by the previous
number of Mongo connections (432) due to the new TShark command decoding
mongo connections. The Mongo captures may have been previously captured
in one of the early iterations of running the `dataset_generation`
script and not updated since in the old dataset.

**New mongo additions**
In the old dataset, the Mongo pcap files were mainly of type `OP_QUERY`
which is an opcode that Stirling does not currently process. More mongo
pcap files of type `OP_MSG` were added to test the existing inference
rule and this resulted in 0.9% being mislabeled as `unknown` due to
request side data missing from the connection and the existing rule not
supporting response side inference for `OP_MSG` packets. 0.7% was
mislabeled as `pgsql` due to request side data also missing from the
connection and the opcode of the packet being one which is not is not
recognizable by Stirling.

Related issues: #640

Type of change: /kind test-infra

Test Plan: Ran the dataset generation and evaluation scripts with the
new TShark flag and verified the `.tsv` files were created appropriately
and the confusion matrix was as expected.

Signed-off-by: Kartik Pattaswamy <kpattaswamy@pixielabs.ai>
  • Loading branch information
kpattaswamy authored Nov 2, 2023
1 parent b868d8c commit 05fb849
Show file tree
Hide file tree
Showing 3 changed files with 10 additions and 2 deletions.
9 changes: 8 additions & 1 deletion src/stirling/protocol_inference/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,9 +62,16 @@ which is defined uniquely by `src_addr`, `dst_addr`, `src_port`, and `dst_port`.
a series of packets in a connection. The goal is to evaluate if a connection is eventually correctly
classified over a period over time.

#### bidirectional-connection-level dataset

One row in the bidirectional-connection-level dataset contains a series of packets over time in a bidrectional connection.
Packets on both directions of a connection are merged by their `src_addr`, `dst_addr`, `src_port`, and `dst_port` and grouped to
make the direction agnostic. This enables protocol inference on a series of packets in a bidirectional connection. The goal is
to evaluate if at least one side of a connection can be classified to infer the protocol of the entire bidirectional connection.

## Protocol Inference Eval

There should be two tsv files `packet_dataset.tsv` and `conn_dataset.tsv` in the dataset folder.
There should be three tsv files `packet_dataset.tsv`, `conn_dataset.tsv` and `bi_dir_conn_dataset.tsv` in the dataset folder.
Right now, available models are {ruleset_basic, ruleset_basic_conn}.
```shell script
bazel run src/stirling/protocol_inference:eval -- --dataset <packet_dataset.tsv> --num_workers 8
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion src/stirling/protocol_inference/dataset_generation.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,8 @@ def gen_tshark_cmd():
-e tcp.srcport \
-e udp.srcport \
-e tcp.dstport \
-e udp.dstport"
-e udp.dstport \
-d tcp.port==27017,mongo"

for protocol, spec in ProtocolParsingSpecs.items():
field_name = spec["length_field"]
Expand Down

0 comments on commit 05fb849

Please sign in to comment.