Finalize Streaming, Cleaning, and Vizualiation

cmhulbert · web-flow · commit 0a43433da7df · 2017-12-20T15:11:26.000-05:00
diff --git a/Project/README.md b/Project/README.md
@@ -23,16 +23,91 @@ We use Streaming Context API socketTextStream to receive tweet Streaming through
 raw_tweets = ssc.socketTextStream('localhost',5555)
 ```
 
+## Clean Tweets
 
+The tweets come in as a DStream object, which can be thought of as a list of strings, each one corresponding to a single tweet.
+```python
+raw_tweets = ssc.socketTextStream('172.17.0.2',5555)
+```
+Next, the `raw_tweets` are written out to a log file to be accessed for any future analysis.
+```python
+def writeRDD(rdd):
+    global logs
+    logs.write(timeHeader())
+    logs.write("\nRaw Tweets:\n{}".format(rdd.take(num=1)))
+    logs.flush()
+raw_tweets.foreachRDD(writeRDD)
+```
 
-## Clean Tweets
+The following line pre-processes the incoming tweets by applying an operation to each RDD in the DStream:
+1. Split the string by whitespace
+```
+clean_tweets = raw_tweets\
+   .map(lambda x: x.split())\
+   ```
+2. Regualar expression to remove any non-alphanumeric characters
+```
+  .map(lambda x: [re.sub(r'([^\s\w]|_)+', '', y) for y in x])\
+```
+3. Ensure all words are lower case
+```
+  .map(lambda x: [word.lower() for word in x])\
+```
+4. Remove any empty string
+```
+  .map(lambda x: [word for word in x if word != ''])\
+```
+5. Filter any words included in STOPWORDS, a file loaded during initialization
+```
+  .map(lambda x: [word for word in x if word not in STOPWORDS])\
+```
+6. If any tweets are empty now, remove them:
+```
+  .filter(lambda x: x != [])\
+```
+7. Assign label to each tweets
+```
+  .map(assign_label)\
+```
+8. Remove tweets that don't belong to any label
+```
+  .filter(lambda x: x != None)
+```
+    
+The `assign_label` function used above is defined as follows:
+
+```python
+def assign_label(words):
+    """ Assign labels to tweets. If this tweet has word resist in it, 
+    then we assign label resist to it. Else if it has word maga, we 
+    label it as maga. If it doesn't have either words, we return none.
+    In the same time we remove the label word from the words.
+    
+    Inputs: words: one tweet in form of a list of cleaned words.
+    Output: labeled tweet: (label, words)
+    """
+    if 'resist' in words:
+        words = [x for x in words if x != 'resist']
+        return('resist', words)
+    if 'maga' in words:
+        words = [x for x in words if x != 'maga']
+        return('maga', words)
+```
+        
+This takes as input a tweet word list, and outputs a tuple of ('resist', tweet_word_list) if  'resist is contained inside the tweet, and returns ('maga', tweet_word_list) if 'maga' is contained within the tweet.
+
+The final output is be a DStream containing tuples of structure (label, [tweet words])
 
+  
 ## Extract Feature Words
 
 ## Calculate Informativeness of Features
 
 ## Visualization
 
+The TweetsStreamingPlot.py file defines the streaming plotting class. We will import this file and use it to initialize an object called `plot` and make streaming plot with method `plot.start(data)`. It will take the data input as a new column data source and push the updated plotting onto the Jupyter notebook.
+![streaming_plot](../images/streaming_plot.png)
+
 ---
 
 # Project Contributors: