Adding Kafka streaming example script

kapi-jk · web-flow · commit 54fae00a642a · 2023-11-30T14:56:24.000+01:00
This script is implementing flow data aggregation and streaming of the results to Kafka stream in CBOR format. It can be easily changed to use a different format or sent different data. Consider this one as an example of what is possible and modify it to your needs.
diff --git a/kafka-streaming/README.md b/kafka-streaming/README.md
@@ -0,0 +1,114 @@
+# Flow data to Kafka streaming
+
+This script is implementing flow data aggregation and streaming of the
+results to Kafka stream in CBOR format. It can be easily changed to use
+a different format or sent different data. Consider this one as an
+example of what is possible and modify it to your needs.
+
+It has been tested with Flowmon system version 12.3.
+
+## Prerequisites
+
+To run it you would need to create a Python virtual environment on your
+Flowmon appliance and add necessary libraries which aren’t present on
+Flowmon system. This can be achieved by running the following commands:
+
+python3 -m venv kafka
+
+The kafka is a name of the virtual environment. If you use a different
+name, you will need to change that in the script as well. The following
+is changing directory, and you need to use name of virtual environment
+used above. The script then needs to be placed in this folder.
+
+    cd kafka
+
+source bin/activate
+
+pip3 install kafka-python
+
+pip3 install cbor
+
+## Using the script
+
+The script is made to run every five minutes and you can add it to
+Flowmon user crontab by editing it with command “crontab -e”. It’s
+keeping its last timestamp in a file called last and if this one doesn’t
+exist it is created.
+
+When you need to test the script multiple times you would need to delete
+this file as it would round the current time to previous 5-minute
+interval to run analysis by nfdump console command.
+
+The command to get the aggregation is present in function get\_data.
+
+command = f"/usr/local/bin/nfdump -M
+/data/nfsen/profiles-data/live/'127-0-0-1\_p3000:127-0-0-1\_p2055' -r
+{timestamp} -A 'dstctry' -o 'fmt:%ts,%dcc,%td,%pkt,%byt,%pps,%bps,%fl'
+-6 --no-scale-number"
+
+The result in the SSH command line when tunning this command could look
+like following.
+
+Date first seen Dst Ctry Duration Packets Bytes pps bps Flows
+
+2023-11-30 11:29:22.585, 203, 302.210, 760, 58348, 2, 1544, 196
+
+2023-11-30 11:29:39.261, 826, 271.966, 55, 4541, 0, 133, 15
+
+2023-11-30 11:30:08.502, 372, 227.322, 189, 81984, 0, 2885, 13
+
+2023-11-30 11:30:54.374, 250, 150.388, 351, 195125, 2, 10379, 22
+
+2023-11-30 11:27:04.546, 840, 468.700, 4592, 1172714, 9, 20016, 1486
+
+2023-11-30 11:30:06.511, 276, 200.593, 84, 10942, 0, 436, 5
+
+2023-11-30 11:32:00.974, 100, 0.000, 1, 76, 0, 0, 1
+
+2023-11-30 11:25:03.508, 0, 594.975, 829590,893719438,
+1394,12016900,14558
+
+2023-11-30 11:29:44.087, 528, 297.434, 676, 204732, 2, 5506, 42
+
+Summary: total flows: 16338, total bytes: 895447900, total packets:
+836298, avg bps: 12040141, avg pps: 1405, avg bpp: 1070
+
+Time window: 2023-11-30 11:25:03 - 2023-11-30 11:35:00
+
+Total flows processed: 16338, Blocks skipped: 0, Bytes read: 5883516
+
+Sys: 0.028s flows/second: 569427.0 Wall: 0.010s flows/second: 1603966.2
+
+The easiest way to get the command for aggregation is to run the query
+in the Monitoring Center Analysis where you get the results you are
+after. Do not forget to select all fields you want to use for
+aggregation, filter the data (if needed), select proper output format
+and limit on the number of results which are interesting for you. Also
+select the right profile and channels where you want to get the data
+from.
+
+![A screenshot of a computer Description automatically
+generated](media/image1.png)
+
+Once you click on the black terminal window icon it will give the
+statistics command. This one would look like the above example so you
+can replace this command between quotas. Just change -R to -r
+{timestamp}” as it’s in the example so this can change the timestamp of
+analyzed data with each run.
+
+When you modify the command, you would need to modify the function
+process\_records as the record would have a different format based on
+the output you have selected.
+
+The script supports three arguments.
+
+\-i HOST, --host HOST IP address/hostname of the bootstrap server
+
+\-p PORT, --port PORT Port of the running boostrap server
+
+\-t TOPIC, --topic TOPIC Kafka topic to stream
+
+There is a log file located in the script folder (by default
+kafka/kafka-stream.log) which can help you with troubleshooting. It does
+require connection from external IP to bootstrap Kafka server configured
+so it can connect and send data for the specified topic.
diff --git a/kafka-streaming/media/analysis.png b/kafka-streaming/media/analysis.png
diff --git a/kafka-streaming/stream-flow.py b/kafka-streaming/stream-flow.py
@@ -0,0 +1,131 @@
+#!/home/flowmon/kafka/bin/python3
+# -*- coding: utf-8 -*-
+"""
+This script is to aggregate flow data for streaming to kafka
+
+=========================================================================================
+"""
+
+import argparse
+from decimal import Rounded
+import logging
+import subprocess
+import shlex
+from kafka import KafkaProducer
+from kafka.errors import KafkaError
+import cbor
+import datetime
+
+LOGGING_FORMAT = '%(asctime)s - %(module)s - %(levelname)s : %(message)s'
+logging.basicConfig(filename='/home/flowmon/kafka/kafka-stream.log', format=LOGGING_FORMAT, level=logging.DEBUG)
+
+def parse_arguments():
+    parser = argparse.ArgumentParser(prog='stream-flow-asn.py')
+    parser.add_argument("-i", "--host", action='store', type=str, help="IP address/hostname of the bootstrap server", required=True)
+    parser.add_argument("-p", "--port", action='store', type=int, help="Port running boostrap server", required=True, default=9092)
+    parser.add_argument("-t", "--topic", action='store', type=str, help="Kafka topic to stream", default='network-metadata')
+    arguments = vars(parser.parse_args())
+    return arguments
+
+def roundDownDateTime(dt):
+    delta_min = dt.minute % 5
+    return datetime.datetime(dt.year, dt.month, dt.day,
+                             dt.hour, dt.minute - delta_min)
+
+def run_command(command_line):
+    arguments = shlex.split(command_line)
+    p = subprocess.Popen(arguments, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.PIPE)
+    std_data = p.communicate()
+    output = (p.returncode, std_data[0].decode("utf8"), std_data[1].decode("utf8"))
+    return output
+
+def process_records(data):
+    records = []
+    lines = data.splitlines()
+    # Get rid of the headers
+    lines.pop(0)
+    # Get a statistics and send them to log
+    sysstat = lines.pop()
+    logging.debug(f"nfdump processing stats: {sysstat}")
+    # number of processed flows
+    totals = lines.pop()
+    logging.debug(totals)
+    # time window information we will just drop
+    lines.pop()
+    # same for summary stats
+    lines.pop()
+    # so now we have the lines with data only
+    for line in lines:
+       rows = line.split(',')
+       record = {'first_seen': rows[0],
+                 'dst_ctr': rows[1].strip(),
+                 'duration': rows[2].strip(),
+                 'packets': rows[3].strip(),
+                 'bytes': rows[4].strip(),
+                 'pps': rows[5].strip(),
+                 'bps': rows[6].strip(),
+                 'flows': rows[7].strip()}
+       records.append(record)
+    return records
+
+def get_data(timestamp):
+    # Get the data from collector
+    command = f"/usr/local/bin/nfdump -M /data/nfsen/profiles-data/live/'127-0-0-1_p3000:127-0-0-1_p2055' -r {timestamp} -A 'dstctry' -o 'fmt:%ts,%dcc,%td,%pkt,%byt,%pps,%bps,%fl' -6 --no-scale-number"
+    logging.debug(command)
+    output = run_command(command)
+    if output[0] == 0:
+        logging.debug(f"Commmand processed succesfully.")
+        return output[1]
+    else:
+        logging.error(output)
+
+def on_success(metadata):
+    logging.info(f"Message produced to topic '{metadata.topic}' at offset {metadata.offset}")
+
+def on_error(e):
+    logging.error(f"Error sending message: {e}")
+
+def get_timestamp():
+    try:
+        file = open('/home/flowmon/kafka/last', 'r')
+        datestamp = file.read()
+        file.close()
+        dateobj = datetime.datetime.strptime(datestamp,"%Y%m%d%H%M")
+        dateob_5 = dateobj + datetime.timedelta(minutes=5)
+        return dateob_5
+    except IOError:
+        current = datetime.datetime.now()
+        file = open('/home/flowmon/kafka/last', 'w+')
+        dateob_5 = current - datetime.timedelta(minutes=5)
+        rounded = roundDownDateTime(dateob_5)
+        str_time = rounded.strftime("%Y%m%d%H%M")
+        file.write(str_time)
+        file.close()
+        return rounded
+def kafka_stream(args, records):
+    producer = KafkaProducer(bootstrap_servers = f"{args['host']}:{args['port']}",
+                             value_serializer=lambda m: cbor.dumps(m))
+
+    for record in records:
+        stream = producer.send(args['topic'],record)
+        stream.add_callback(on_success)
+        stream.add_errback(on_error)
+
+    producer.flush()
+    producer.close()
+
+def main():
+    logging.info('------- New run -------')
+    args = parse_arguments()
+    timestamp = get_timestamp()
+    str_time = timestamp.strftime("%Y%m%d%H%M")
+    full_path = timestamp.strftime("%Y/%m/%d/") + "nfcapd." + str_time
+    logging.debug('Processing {}'.format(full_path))
+    data = get_data(full_path)
+    records = process_records(data)
+    kafka_stream(args, records)
+
+    logging.info('Everything is done')
+
+if __name__ == "__main__":
+       main()