You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following line pre-processes the incoming tweets by applying an operation to each RDD in the DStream:
43
+
1. Split the string by whitespace
44
+
```
45
+
clean_tweets = raw_tweets\
46
+
.map(lambda x: x.split())\
47
+
```
48
+
2. Regualar expression to remove any non-alphanumeric characters
49
+
```
50
+
.map(lambda x: [re.sub(r'([^\s\w]|_)+', '', y) for y in x])\
51
+
```
52
+
3. Ensure all words are lower case
53
+
```
54
+
.map(lambda x: [word.lower() for word in x])\
55
+
```
56
+
4. Remove any empty string
57
+
```
58
+
.map(lambda x: [word for word in x if word != ''])\
59
+
```
60
+
5. Filter any words included in STOPWORDS, a file loaded during initialization
61
+
```
62
+
.map(lambda x: [word for word in x if word not in STOPWORDS])\
63
+
```
64
+
6. If any tweets are empty now, remove them:
65
+
```
66
+
.filter(lambda x: x != [])\
67
+
```
68
+
7. Assign label to each tweets
69
+
```
70
+
.map(assign_label)\
71
+
```
72
+
8. Remove tweets that don't belong to any label
73
+
```
74
+
.filter(lambda x: x != None)
75
+
```
76
+
77
+
The `assign_label` function used above is defined as follows:
78
+
79
+
```python
80
+
defassign_label(words):
81
+
""" Assign labels to tweets. If this tweet has word resist in it,
82
+
then we assign label resist to it. Else if it has word maga, we
83
+
label it as maga. If it doesn't have either words, we return none.
84
+
In the same time we remove the label word from the words.
85
+
86
+
Inputs: words: one tweet in form of a list of cleaned words.
87
+
Output: labeled tweet: (label, words)
88
+
"""
89
+
if'resist'in words:
90
+
words = [x for x in words if x !='resist']
91
+
return('resist', words)
92
+
if'maga'in words:
93
+
words = [x for x in words if x !='maga']
94
+
return('maga', words)
95
+
```
96
+
97
+
This takes as input a tweet word list, and outputs a tuple of ('resist', tweet_word_list) if 'resist is contained inside the tweet, and returns ('maga', tweet_word_list) if 'maga' is contained within the tweet.
98
+
99
+
The final output is be a DStream containing tuples of structure (label, [tweet words])
29
100
101
+
30
102
## Extract Feature Words
31
103
32
104
## Calculate Informativeness of Features
33
105
34
106
## Visualization
35
107
108
+
The TweetsStreamingPlot.py file defines the streaming plotting class. We will import this file and use it to initialize an object called `plot` and make streaming plot with method `plot.start(data)`. It will take the data input as a new column data source and push the updated plotting onto the Jupyter notebook.
0 commit comments