Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

README.md

Summary

Web Scraping

Go Back to Summary

Check What I Can Scrape

Go Back to Summary

  • Most of the website has a filed called robots.txt that describes all the data that we "can" scrape

      #                      ///////
      #                     //     //
      #                    //       //
      #                   //         //                           ///             ///                      ///
      #                  //           //                                          ///                      ///
      #                 //     ///     //               //// ///  ///  /// ////   /// ////     /// ////    /// ////
      #                //   ///   ///   //            //////////  ///  ////////// ///////////  //////////  ///////////
      #               //   //       //   //          ///     ///  ///  ///        ///      /// ///     /// ///      ///
      #              //    //       //    //        ///      ///  ///  ///        ///      /// ///     /// ///      ///
      #             //      //     //      //        ///     ///  ///  ///        ///     ///  ///     /// ///     ///
      #            //        //   //        //        //////////  ///  ///        //////////   ///     /// //////////
      #            //         /////         //
      #            //         /////         //
      #             //      ///   ///      //
      #               //////         //////
      #
      #
      #    We thought you'd never make it!
      #    We hope you feel right at home in this file...unless you're a disallowed subfolder.
      #    And since you're here, read up on our culture and team: https://www.airbnb.com/careers/departments/engineering
      #    There's even a bring your robot to work day.
    
      User-agent: Googlebot
      Allow: /calendar/ical/
      Allow: /.well-known/amphtml/apikey.pub
      Disallow: /account
      Disallow: /alumni
      Disallow: /associates/click
      Disallow: /api/v1/trebuchet
      Disallow: /api/v3

BeautifulSoup

Go Back to Summary

  • Before we start our project we need to install the following libraries

      pip3 install beautifulsoup4
      pip3 install requests
    • The requests library allows us to download the data
    • The beautifulsoup library allows us to manipulate the data
  • To check the available modules, we can use pip list

      Package                       Version
      ----------------------------- ---------
      alabaster                     0.7.12
      appnope                       0.1.0
      astroid                       2.3.3
      attrs                         19.3.0
      Babel                         2.7.0

Basics

Go Back to Summary

  import requests
  from bs4 import BeautifulSoup

  response = requests.get('https://news.ycombinator.com/news')
  soup = BeautifulSoup(response.text, 'html.parser')
  print(soup.body)
  print(soup.body.contents)
  print(soup.find_all('div'))
  print(soup.find_all('a'))
  print(soup.title)
  print(soup.a)
  print(soup.find('a'))
  print(soup.find(id='score_14123123'))
  • the requests library is just like fetch in JavaScript, in this case we are making a GET request

  • The we get the response of the request using .text, and we use beautifulsoup to to convert into an object that we can manipulate. In this case we are using the default parser (html.parser)

  • We can select/target different elements

  • soup.body, soup.title, soup.a, soup.find('a'), soup.find(id='score_14123123') - returns the first element body

  • soup.body.contents - returns all the content inside contents

  • soup.find_all() - returns all the elements of a specific type

  • soup.select('.score') - returns all the elements that has score class. the soup.select uses css selectors to target the information. Just like document.querySelector()

  • With beautifulSoup we can chain our data

  import requests
  from bs4 import BeautifulSoup

  response = requests.get('https://news.ycombinator.com/news')
  soup = BeautifulSoup(response.text, 'html.parser')
  links = soup.select('.storylink') # Get all the links
  votes = soup.select('.score') # Get all the votes
  print(votes[0].get('id')) # Get the id of the first element
  # score_24469921

Simple Scraper

Go Back to Summary

  • with this simple exercise, we are going to use:

    • getText() the the content of the tag <tag>value</tag>
    • python enumerate to do a for loop using the item and idx
    • python replace('value', 'new value')
    • python int() convert into integer
    • python str() convert into string
      import requests
      from bs4 import BeautifulSoup
      import pprint
    
      def sort_stories_by_votes(list):
          return sorted(list, key=lambda k: k['votes'], reverse = True);
    
      def create_custom_hn():
          hn = []
    
          for i in range(1, 2):
              response = requests.get('https://news.ycombinator.com/news?p=' + str(i))
              soup = BeautifulSoup(response.text, 'html.parser')
              links = soup.select('.storylink')
              votes = soup.select('.score')
              subtext = soup.select('.subtext')
    
              for idx, item in enumerate(links):
                  title = item.getText()
                  href = item.get('href', '')
                  vote = subtext[idx].select('.score')
    
                  if len(vote):
                      points = int(vote[0].getText().replace(' points', ''))
    
                      if points > 99:
                          hn.append({'title': title, 'link': href, 'votes': points})
    
          return sort_stories_by_votes(hn)
    
      pprint.pprint(create_custom_hn())