In this piece of work, we answer the following questions
What were the top 50 most popular subreddits in terms of the number of active users?
What does the Probability Density Function (PDF) of the number of active users per subreddit look like for all subreddits?
What is the proportion between the number of users in the ith popular subreddit compared to the (i + 1)th for i ∈ [1...100]? Comment on how fast the popularity drops and how this ratio/proportion changes with i.
How many comments does each of these subreddits receive in a given hour of the day (i.e., 1AM, 2AM, 3AM...11PM,12AM)?
When you plot these curves where the x-axis is hours from 0 to 23 and the y-axis is counts, can you see patterns in these curves? How do these curves compare to each other? Do they have offsets relative to each other?
If you consider the /r/unitedkingdom as being UTC, what can you say about the timezones of the users in the other subreddits?
What are the top 10 most frequent words in each of the five subreddits above? Do you see differences/similarities?
What are the top 10 most frequent words in each of the five subreddits above? Do you see differences/similarities?
What does the word-frequency distribution look like? Plot the relative frequencies of the words as a probability density function. What can you say about the word frequency you observed and the predicted by the Zipf’s Law (https://en.wikipedia.org/wiki/Zipf%27s_law)