Skip to content

Streaming Text Files to Kafka

Ahmed Abdul Hamid edited this page Dec 18, 2019 · 19 revisions

Overview

In this use case, we create Brooklin datastreams to publish text file contents to a locally deployed instance of Kafka.

Summary

Prerequisites

Brooklin requires Java Development Kit 8+. Here are some options:

Instructions

1. Set up Kafka

  1. Download the latest Kafka tarball and untar it.
    tar -xzf kafka_2.12-2.2.0.tgz
    cd kafka_2.12-2.2.0
  2. Start a ZooKeeper server
    bin/zookeeper-server-start.sh config/zookeeper.properties >/dev/null &
  3. Start a Kafka server
    bin/kafka-server-start.sh config/server.properties >/dev/null &

2. Set up Brooklin

  1. Download the latest tarball (tgz) from Brooklin releases.
  2. Untar the Brooklin tarball
    tar -xzf brooklin-1.0.0.tgz
    cd brooklin-1.0.0 
  3. Run Brooklin
    bin/brooklin-server-start.sh config/server.properties >/dev/null 2>&1 &

3. Create a datastream

  1. Create a datastream to stream the contents of any file of your choice to Kafka.

    # Replace NOTICE below with a file path of your choice or leave it as 
    # is if you would like to use the NOTICE file as an example text file
    bin/brooklin-rest-client.sh -o CREATE -u http://localhost:32311/ -n first-file-datastream -s NOTICE -c file -p 1 -t kafkaTransportProvider -m '{"owner":"test-user"}'

    Here are the options we used to create this datastream:

    -o CREATE                      The operation is datastream creation
    -u http://localhost:32311/     Datstream Management Service URI
    -n first-file-datastream       Datastream name
    -s NOTICE                      Datastream source URI (source file path in this case)
    -c file                        Connector name ("file" refers to FileConnector)
    -p 1                           Number of source partitions
    -t kafkaTransportProvider      Transport provider name
    -m '{"owner":"test-user"}'     Datastream metadata (specifying datastream owner is mandatory)
    
  2. Verify the datastream creation by requesting all datastream metadata from Brooklin using the command line REST client.

    bin/brooklin-rest-client.sh -o READALL -u http://localhost:32311/
  3. You can also view the streaming progress by querying the diagnostics REST endpoint of the Datastream Management Service.

    curl -s "http://localhost:32311/diag?scope=file&type=connector&q=status&content=position?"
  4. Additionally, you can view some more information about the different Datastreams and DatastreamTasks by querying the health monitoring REST endpoint of the Datastream Management Service.

    curl -s "http://localhost:32311/health"

4. Verify the data transfer to Kafka

  1. Verify a Kafka topic has been created to hold the data of your newly created datastream. The topic name will have the datastream name (i.e. first-file-datastream) as a prefix.

    cd kafka_2.12-2.2.0
    bin/kafka-topics.sh --list --bootstrap-server localhost:9092
  2. Print the Kafka topic contents

    # Replace <topic-name> below with name of Kafka topic
    bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic <topic-name> --from-beginning

5. Create more datastreams

Feel free to create more datastreams to publish more files to Kafka.

  • If you wish to delete the datastream you created, you can do so by running:

    bin/brooklin-rest-client.sh -o DELETE -u http://localhost:32311/ -n first-file-datastream
  • You can also explore the various operations you can perform on datastreams using the REST client utility.

    bin/brooklin-rest-client.sh --help

6. Stop Brooklin, Kafka, and ZooKeeper

When you are done, run the following commands to stop all running apps.

cd brooklin-1.0.0
bin/brooklin-server-stop.sh

cd kafka_2.12-2.2.0
bin/kafka-server-stop.sh
bin/zookeeper-server-stop.sh