You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Project/README.md
+25-17Lines changed: 25 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,41 +1,49 @@
1
1
# Run the code with Docker container
2
-
If you want to run the Jupyter notebooks using Docker container, you can download this repository onto your local machine. And run `$docker run -it -v yourPathToDownloadedFolder:/home/jovyan -p 8888:8888 xianlai/spark_project`. Then it will map the folder onto the container and you can access the files inside container.
2
+
If you want to run the Jupyter notebooks using Docker container, you can download this repository onto your local machine. And run `$docker run -it -v /your/Path/To/Downloaded/Folder:/home/jovyan -p 8888:8888 xianlai/spark_project`. Then it will map the folder onto the container and you can access the files inside container.
3
3
4
4
# Project Outline
5
5
6
-
## Create a Twitter Stream and send tweets to Spark
7
-
Firstly we use tweepy API to pull tweets streams regard tracks #MAGA and #resist. Then the stream is directed into Spark Streaming through the TCP socket.
6
+
The main project file is contained in the ipython notebook [Tweets_analysis_SparkStreaming.ipynb](./Tweets_analysis_SparkStreaming.ipynb)
8
7
9
-
We set up the Spark context in local mode with 3 CPU's running simulating 3 different machines. And build a Spark streaming context based on Spark context and set the time interval to 5 seconds. So the incoming tweets will be collect into 1 RDD every 5 seconds.
8
+
In order to receive Tweets, you will need to add your own Twitter API keys in the [TweetsListener.py](./TweetsListener.py) file. The variables that need to be set are are the top and are as follows:
10
9
```python
11
-
conf = SparkConf().setMaster('local[3]')
12
-
sc = SparkContext(conf=conf)
13
-
ssc = StreamingContext(sc, 5)
10
+
consumer_key =None# Replace with your Consumer key
11
+
consumer_secret =None# Replace with your Consumer secret
12
+
access_token =None# Replace with your acces token
13
+
access_secret =None# Replace with your access secret
14
14
```
15
15
16
-
We build an app called TweetRead.py to pull tweet streaming from Twitter by using library called Tweepy and use socket to send streaming into Spark Streaming
16
+
## Startup, Twitter Stream, and Send Tweets to Spark
17
+
Firstly we use Tweepy API to pull tweets streams containg the words 'MAGA' or 'resist'.
17
18
19
+
We built a module called TweetRead.py containing a class to pull tweets streaming from Twitter containging the words 'MAGA' or 'resist' by using the Tweepy API. Then the stream is directed into Spark Streaming through a TCP socket.
18
20
```python
19
-
host ="localhost"# Get local machine name
21
+
host ="172.17.0.2"# Get local machine name
20
22
port =5555# Reserve a port for your service.
21
23
s = socket.socket() # Create a socket object
22
24
s.bind((host, port)) # Bind to the port
23
25
s.listen(5) # Now wait for client connection.
24
26
c, addr = s.accept() # Establish connection with client.
25
-
sendData( c,['maga', 'resist'] ) #Function to send data
27
+
sendData( c,['maga', 'resist'] ) #Function to start receiving data from Twitter, and send them to Spark
26
28
```
27
-
We use Streaming Context API socketTextStream to receive tweet Streaming through port and transfer into DStream, which is the Streaming of RDD
29
+
30
+
We set up the Spark context in local mode with 3 thread's allowing up to 3 simultaneous processes. The Spark streaming context is initialized based on Spark context and set the update time interval to 5 seconds. Thus, incoming tweets will be added to a DStream every 5 seconds.
31
+
```python
32
+
conf = SparkConf().setMaster('local[3]')
33
+
sc = SparkContext(conf=conf)
34
+
ssc = StreamingContext(sc, 5)
35
+
```
36
+
37
+
On the Spark side we use the Streaming Context method `socketTextStream` to receive tweet Streaming through the previously initialized socket, and transfer into DStream.
0 commit comments