Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update the protocol inference test infra with Mongo changes (#1758)
Summary: Previously, the TShark command in the `dataset_generation` script was not able to decode Mongo pcap files and insert them to the dataset for evaluation. This PR adds a flag to the TShark command to decode traffic running through port 27017 as Mongo. The readme is also updated to provide information about the bidirectional connection level dataset. **Updates to the confusion matrix** In the previous image, the connections per protocol in the dataset seem to have been duplicated leading to a large number of connections per protocol. This may have been due to the `dataset_generation` script appending data to the `.tsv` files each time it was ran even though the underlying pcap file content/counts not being altered. Running the `dataset_generation` script with empty `.tsv` files with the same pcap files followed by the `eval` script resulted in a matrix showing much fewer number of connections per protocol, suggesting that there may have been duplication in the dataset previously. The connection counts for each protocol in the older dataset seem to have increased by a factor of 4x or 8x the count as the new dataset and makes sense as to why the inference accuracy remained constant between the old/new matrix. The TLS connection count had dropped in the new matrix by the previous number of Mongo connections (432) due to the new TShark command decoding mongo connections. The Mongo captures may have been previously captured in one of the early iterations of running the `dataset_generation` script and not updated since in the old dataset. **New mongo additions** In the old dataset, the Mongo pcap files were mainly of type `OP_QUERY` which is an opcode that Stirling does not currently process. More mongo pcap files of type `OP_MSG` were added to test the existing inference rule and this resulted in 0.9% being mislabeled as `unknown` due to request side data missing from the connection and the existing rule not supporting response side inference for `OP_MSG` packets. 0.7% was mislabeled as `pgsql` due to request side data also missing from the connection and the opcode of the packet being one which is not is not recognizable by Stirling. Related issues: #640 Type of change: /kind test-infra Test Plan: Ran the dataset generation and evaluation scripts with the new TShark flag and verified the `.tsv` files were created appropriately and the confusion matrix was as expected. Signed-off-by: Kartik Pattaswamy <kpattaswamy@pixielabs.ai>
- Loading branch information