Edureka Python Project Documentation

Problem Statement

IMDB provides a list of celebrities born on the current date. Below is the link: http://m.imdb.com/feature/bornondate

Get the list of these celebrities from this webpage using web scraping (the ones that are displayed i.e top 10). You have to extract the below information:

Name of the celebrity
Celebrity Image
Profession
Best Work

Once you have this list, run a sentiment analysis on twitter for each celebrity and finally the output should be in the below format

Name of the celebrity:
Celebrity Image:
Profession:
Best Work:
Overall Sentiment on Twitter: Positive, Negative or Neutral

Hint: Use IMDB scrapping sample example as reference for scraping the mentioned web page. For sentiment analysis use the Twitter sentiment code as reference.

Please Note That I Am Using Python 3.4

Tools and Packages Used

• Version: Python 3.4 [VERY IMPORTANT] • Tweepy  Tweepy is an open-sourced, hosted on GitHub, and enables Python to communicate with the Twitter platform and use its API. Here's the documentation.

• Codecs  The codecs module provides stream and file interfaces for transcoding data in your program. In this project I use the module for storing the tweets as Unicode text. Here's the documentation.

• String (punctuation)  To strip the tweets of all punctuations.

• BeautifulSoup  Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree using Python parsers like lxml and html5lib. It automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Here's the documentation.

• Selenium  The webdriver kit emulates a web-browser (I chose FireFox) and executes the JS scripts to load the dynamic content.

Challenges Faced during the project

Tweepy has an issue with Python 3

Error message: TypeError: Can't convert 'bytes' object to str implicitly inside: tweepy\streaming.py

Solution:
Can be found at tweepy/tweepy#615. In streaming.py: I changed line 161 to

self._buffer += self._stream.read(read_len).decode('ascii')

and line 171 to

self._buffer += self._stream.read(self._chunk_size).decode('ascii')

and then reinstalled.

The IMDB website has dynamic content:

Reference: http://fruchter.co/post/53164489086/python-headless-web-browser-scraping-on-amazon

Description: Had to use the Selenium’s webdriver to emulate a Firefox browser and execute the JS functions which dynamically fetches the details of celebrities born on the current day.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
finalOutput.txt		finalOutput.txt
imdbWebScrapingAndTweetAnalysis.py		imdbWebScrapingAndTweetAnalysis.py
negative_words.txt		negative_words.txt
positive_words.txt		positive_words.txt
testTweets.txt		testTweets.txt
tweetSearchAndAnalysis.py		tweetSearchAndAnalysis.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Edureka Python Project Documentation

Problem Statement

Tools and Packages Used

Challenges Faced during the project

Tweepy has an issue with Python 3

The IMDB website has dynamic content:

About

Uh oh!

Releases

Packages

Languages

bensooraj/webScraping_twitterSentimentAnalysis

Folders and files

Latest commit

History

Repository files navigation

Edureka Python Project Documentation

Problem Statement

Tools and Packages Used

Challenges Faced during the project

Tweepy has an issue with Python 3

The IMDB website has dynamic content:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages