Skip to content

Latest commit

 

History

History
88 lines (57 loc) · 5.15 KB

README.md

File metadata and controls

88 lines (57 loc) · 5.15 KB

TwitterTrends

This streaming project obtains current Twitter's trending topic and show them in Google Maps like this:

twitterTrends

The project consists in the next 4 processes (each of one is a Jupyter Notebook). Here's a brief description, we can find detailed info inside each notebook:

Requirements

Starting Kafka and MongoDB servers

In order to run TwitterTrends-2-FileToKafka we need to start Kafka server. For TwitterTrends-3-KafkaToMongoDB we need to start Kafka server and MongoDB server.

Here's some indications about how to do it:

Starting Kafka server

First we need to add Kafka binaries directory in our system PATH and execute the next commands on kafka directory:

  1. Start ZooKeeper instance: zookeeper-server-start.bat config/zookeeper.properties
  2. Start the Kafka server kafka-server-start.bat config/server.properties
  3. Create a topic (first run only): kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic tweeterTopic

We can see detailed explanation as well as the Unix commands in Kafka Quickstart.

Starting MongoDB server

We need to add MongoDB binaries directory in our system PATH and to type mongod in a command line.

Once started, If we want to use a GUI for MongoDB, we can use MongoDB Compass.

Some things to consider

  1. Jupyter Scala: Scala kernel for Jupyter.

    There's no need to use Jupyter (we can use our favourite Scala development environment) but in order to explain the project in a more interactive way I prefer to use it.

    If we plan to use Jupyter Scala we have to take in mind that the way to manage dependences (adding external libraries) differs from Jupyter Scala Notebook to a standard Scala IDE/Intellij IDEA with the use of SBT (Simple Build Tool). For example:

    In a standard Scala IDE/Intellij IDEA with the use of SBT we manage libraries adding the next line into our build.sbt file:

    libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.2.0"

    In Scala Jupyter Notebook we manage libraries executing the next statement in the notebook:

    import $ivy.`org.apache.spark::spark-sql:2.2.0`

    In either case we need to load the libraries as we normally do, for example: import org.apache.spark


  1. Spark's logging level: By default when creating the Spark Session it will show all logging level (even INFO). In order to change this we can set our desired logging level.

    In order to do this we have to copy and rename the log4j.properties.template included in our spark/conf folder to spark/conf/log4j.properties and change the following:

    log4j.rootCategory=INFO, console

    to

    log4j.rootCategory=WARN, console

    Now we can import log4j library and set our properties file. For example:

    import org.apache.log4j.PropertyConfigurator
    PropertyConfigurator.configure("C:/spark/conf/log4j.properties")

Pending tasks