Skip to content

Commit 880d465

Browse files
authored
Fixed typos, added api key mention
1 parent f847fb6 commit 880d465

File tree

1 file changed

+25
-17
lines changed

1 file changed

+25
-17
lines changed

Project/README.md

Lines changed: 25 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,49 @@
11
# Run the code with Docker container
2-
If you want to run the Jupyter notebooks using Docker container, you can download this repository onto your local machine. And run `$docker run -it -v yourPathToDownloadedFolder:/home/jovyan -p 8888:8888 xianlai/spark_project`. Then it will map the folder onto the container and you can access the files inside container.
2+
If you want to run the Jupyter notebooks using Docker container, you can download this repository onto your local machine. And run `$docker run -it -v /your/Path/To/Downloaded/Folder:/home/jovyan -p 8888:8888 xianlai/spark_project`. Then it will map the folder onto the container and you can access the files inside container.
33

44
# Project Outline
55

6-
## Create a Twitter Stream and send tweets to Spark
7-
Firstly we use tweepy API to pull tweets streams regard tracks #MAGA and #resist. Then the stream is directed into Spark Streaming through the TCP socket.
6+
The main project file is contained in the ipython notebook [Tweets_analysis_SparkStreaming.ipynb](./Tweets_analysis_SparkStreaming.ipynb)
87

9-
We set up the Spark context in local mode with 3 CPU's running simulating 3 different machines. And build a Spark streaming context based on Spark context and set the time interval to 5 seconds. So the incoming tweets will be collect into 1 RDD every 5 seconds.
8+
In order to receive Tweets, you will need to add your own Twitter API keys in the [TweetsListener.py](./TweetsListener.py) file. The variables that need to be set are are the top and are as follows:
109
```python
11-
conf = SparkConf().setMaster('local[3]')
12-
sc = SparkContext(conf=conf)
13-
ssc = StreamingContext(sc, 5)
10+
consumer_key = None # Replace with your Consumer key
11+
consumer_secret = None # Replace with your Consumer secret
12+
access_token = None # Replace with your acces token
13+
access_secret = None # Replace with your access secret
1414
```
1515

16-
We build an app called TweetRead.py to pull tweet streaming from Twitter by using library called Tweepy and use socket to send streaming into Spark Streaming
16+
## Startup, Twitter Stream, and Send Tweets to Spark
17+
Firstly we use Tweepy API to pull tweets streams containg the words 'MAGA' or 'resist'.
1718

19+
We built a module called TweetRead.py containing a class to pull tweets streaming from Twitter containging the words 'MAGA' or 'resist' by using the Tweepy API. Then the stream is directed into Spark Streaming through a TCP socket.
1820
```python
19-
host = "localhost" # Get local machine name
21+
host = "172.17.0.2" # Get local machine name
2022
port = 5555 # Reserve a port for your service.
2123
s = socket.socket() # Create a socket object
2224
s.bind((host, port)) # Bind to the port
2325
s.listen(5) # Now wait for client connection.
2426
c, addr = s.accept() # Establish connection with client.
25-
sendData( c,['maga', 'resist'] ) #Function to send data
27+
sendData( c,['maga', 'resist'] ) #Function to start receiving data from Twitter, and send them to Spark
2628
```
27-
We use Streaming Context API socketTextStream to receive tweet Streaming through port and transfer into DStream, which is the Streaming of RDD
29+
30+
We set up the Spark context in local mode with 3 thread's allowing up to 3 simultaneous processes. The Spark streaming context is initialized based on Spark context and set the update time interval to 5 seconds. Thus, incoming tweets will be added to a DStream every 5 seconds.
31+
```python
32+
conf = SparkConf().setMaster('local[3]')
33+
sc = SparkContext(conf=conf)
34+
ssc = StreamingContext(sc, 5)
35+
```
36+
37+
On the Spark side we use the Streaming Context method `socketTextStream` to receive tweet Streaming through the previously initialized socket, and transfer into DStream.
2838
```python
2939
raw_tweets = ssc.socketTextStream('localhost',5555)
3040
```
3141

3242
## Clean Tweets
3343

34-
The tweets come in as a DStream object, which can be thought of as a list of strings, each one corresponding to a single tweet.
35-
```python
36-
raw_tweets = ssc.socketTextStream('172.17.0.2',5555)
37-
```
38-
Next, the `raw_tweets` are written out to a log file to be accessed for any future analysis.
44+
`raw_tweets`, from above, cna be though of as a list of strings, each one corresponding to a single tweet.
45+
46+
First, the `raw_tweets` are written out to a log file to be accessed for any future analysis.
3947
```python
4048
def writeRDD(rdd):
4149
global logs
@@ -45,7 +53,7 @@ def writeRDD(rdd):
4553
raw_tweets.foreachRDD(writeRDD)
4654
```
4755

48-
The following line pre-processes the incoming tweets by applying an operation to each RDD in the DStream:
56+
Then, we pre-processes the incoming tweets by applying an operation to each RDD in the DStream:
4957
1. Split the string by whitespace
5058
```
5159
clean_tweets = raw_tweets\

0 commit comments

Comments
 (0)