The following application is a custom web scraper that uses the Beautiful Soup library to scrape information off Wikipedia pages. More specifically, the web scraper is tailored to scrape information related to movie actors. So the scraper goes through an actor’s Wikipedia page and pulls information related to the movies the actor acted in. The scraper does have to account for Wikipedia not having a uniform standard for their movie actor pages and so alternate cases were implemented to find the movies that an actor acted in. After collecting the movies, the application goes through each movie to scrape specific information related to the movie such as release date, box office revenues and other logistical information.
A graph structure was also implemented to store actor and movie information. Basic retrieval functions were added onto the graph structure to improve search times for finding specific movies and actors. Additionally, the graph structure allows for a user to find movies based on release date, actors based on date of birth, and other basic information relating to a movie/actor.
The scraper started scraping from a single actor and then found other actors that participated in the same movie to continue scraping until limits of 125 and 250 were hit for movies and actors respectively. Aside from having a graph structure for retrieving this scraped data, a RESTful API was implemented in flask to get movie and actor information from a JSON document. The API supports getting specific movies and actors as well as getting a list of movies/actors if the query does not have a specific actor/movie name. Moreover, the API also supports common functions such as and/or operators; so for example the user could specify in the API query that he or she wants to find actors that were born in 1980 and have their name starting with Brad. Additionally, an or example would be allowing the users to specify all movies released in either 1990 or 2000 in the API call. Furthermore, the RESTful API allows the users to edit information related to an actor/movie as well as post actor and movie entries into the JSON document.