web_scraping_tutorial_python.py

# -*- coding: utf-8 -*-
"""web_scraping_tutorial.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/github/virtualmarioe/Web_scraping_tutorial/blob/main/web_scraping_tutorial.ipynb

<p><img alt="Web scraping tutorial" height="45px" 
src="https://aiconica.net/previews/spider-web-icon-1027.png" 
align="left" hspace="10px" vspace="0px"></p>

<h1>Web scraping tutorial</h1>

This notebook presents an introduction to Web scraping. 
Web scraping is the process of extracting data from
websites or other online sources and copying the data
into an structured form (e.g., a database) enabling
further retrieval and analysis.

For this particular tutorial, we are going to extract 
demografic information (e.g., country, state and 
population) of Colombian towns from <a href = 
"https://es.wikipedia.org/wiki/Municipios_de_Colombia">
Wikipedia</a>.

The tutorial is written in Python and will use two different
methods, of the many available, for pulling the data,
<a href = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/">
Beautiful Soup</a> and <a href = "https://pandas.pydata.org/docs/"
> Pandas</a>.

The tutorial is divided into the following 4 sections:

 - **Section 1: Method Beautiful Soup**
 - **Section 2: Method Pandas**
 - **Section 3: Structuring and cleaning the data**
 - **Section 4: Data saving**

 ____

<h2> Setup </h2>

First, we will import all the required libraries.
"""

# Importing libraries

import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from urllib.request import urlopen
import re
import seaborn as sns

"""<h3> Section 1: Method Beautiful Soup </h3>

The data we are interested in is distributed across 
multiple Wikipedia pages and tables. Therefore, we first 
need to read and parse the main table containing the list 
with all the towns and a link per town where the actual 
demographic information is located.

We will go through the following steps:

  - 1.1. Building the main table and parsing its content
  - 1.2. Extracting all data contained in tables
  - 1.3. Building lists to hold the extracted data
  - 1.4. Structuring the extracted data

**1.1. Building the main table and parsing its content**
"""

# 1. Building the URL and parsing it with Beautiful Soup
wiki_es = 'https://es.wikipedia.org'
mun_col = '/wiki/Municipios_de_Colombia'
url = wiki_es + mun_col
html = urlopen(url) 
soup = BeautifulSoup(html, 'html.parser')

"""**1.2. Extracting all data contained in tables**

Extracting all data contained in the webpage's sections 
labeled with the tag `'table'`.
"""

# 2. Finding all data with tag 'table'
tables = soup.find_all('table')

"""**1.3. Building lists to hold the extracted data**

To extract the links contained in the tables it is necessary 
to cycle across all rows, labeled witht the tag `'tr'`, and
cells, labeled with the tag `'td'`. Finally, at each cell the
link of interest, labeled with the tag `'href'`, will be
appended to `links_anex` list, which will be used to build
the final URL for calling the webpages we are interested in.
"""

# 3. Building lists to hold the extracted data

# Preallocating variables for each lists
departamentos = []
numero_de_municipios = []
links_anex = []

# Cycling through the table rows
for table in tables:
    rows = table.find_all('tr')
    
    for row in rows:
        cells = row.find_all('td')
        
        # The main page contains multiple tables.
        # Finding the table with more that 2 cells, which
        # the one we are interested.
        if len(cells) > 2:
          
            # Building a list with the state names
            departamento = cells[0]
            departamentos.append(departamento.text.strip())
            
            # Building a list with the town names
            municipio = cells[1]
            numero_de_municipios.append(municipio.text.strip())

            # Building a list with the state's link
            link = cells[1]
            links_anex.append(municipio.contents[0]['href'])

"""**1.4. Structuring the extracted data**

In order to store the extracted data in a format and a 
structure that can be used for further analysis, we will 
put all the data in a pandas `DataFrame`. To do this we 
will create a pandas `series` from each list created in 
point 3. and then contenating all series into a single df.
"""

# Building DataFrame with name of States and number of towns

# Creating pandas series from scraped list created in 1.3
deptos_serie = pd.Series(departamentos,name='Departamento')
num_mun_serie = pd.Series(numero_de_municipios,name='# Municipios')
links_serie = pd.Series(links_anex, name='Link')

# Building all series into a single df
df_municipios_info = pd.concat([deptos_serie,num_mun_serie,links_serie],axis=1)

"""Let's check how the current `DataFrame` looks kike."""

# Checking df dimenssions and head
print('The dimenssions of the df_municipios_info are: ' +
      str(df_municipios_info.shape))
print('Here are the first 5 rows:')
df_municipios_info.head()

"""So, now we have a `df` with the following information 
for each of the 33 states. The state's name, number of towns
and the URL where the info for all State's town can be pulled.

<h3> Section 2: Method Pandas</h3>

We will use `Pandas` to pull the demographic data of each 
town across all states.

For this extraction we will use the function <a href =
"https://pandas.pydata.org/docs/user_guide/io.html#io-read-html">
`pd.read_html()`</a>, which takes a HTML URL and parse its 
content into a list of `DataFrames`. We will pass the URL with
the function `get` from the library <a href = 
"https://docs.python-requests.org/en/latest/">Request</a>.
"""

# Looping through all the list of states to scrap available population data

# Preallocation of list for all df with town info
df_list_municipios = []
df_habitantes_info = []
df_habitantes_info_all = []


# Loop for data collection
for muni_link in enumerate((df_municipios_info.iloc[:]['Link']).tolist()):
    
    curr_link = muni_link[1]

    # Current town's name
    dept_name = df_municipios_info.iloc[muni_link[0]]['Departamento']
    
    curr_r = requests.get(wiki_es + curr_link)

    # Scraping the data from the current URL ising Pandas.
    curr_list_dfs = pd.read_html(curr_r.text)

    # Loop for selecting and extracting data for each town
    for df_idx in enumerate(curr_list_dfs):

        # Checking for field town name. This can be either 'Name' or 'Municipio'
        # Thus we will make them homogeneous by using 'Municipio' in all.
        if True in curr_list_dfs[df_idx[0]].columns.astype(str).str.contains(
            pat = 'Nombre'):

            # Changing 'Nombre' to 'Municipio'
            df_habitantes_info = pd.DataFrame(list(
                curr_list_dfs[df_idx[0]]['Nombre']), columns=['Municipio'])

            # Adding State's name as first column
            df_habitantes_info['Departamento'] = dept_name
       
            # Poplation information can be stored in columns called either 
            # 'Habitantes' or 'Población'. Thus, we need to make them homogeneous.
            # Checking if the current df has a column called 'Habitantes'
            if True in curr_list_dfs[df_idx[0]].columns.astype(
                str).str.contains(pat = 'Habitantes'):
                
                # Getting the index of the column named 'Habitantes'
                col_idx = list(curr_list_dfs[df_idx[0]].columns.astype(
                    str).str.contains(pat = 'Habitantes')).index(True)
                
                col_name = curr_list_dfs[df_idx[0]].columns[col_idx]
                
                df_habitantes_info['Habitantes'] = (
                    curr_list_dfs[df_idx[0]][col_name])

            # Checking if the current df has a column called 'Población'
            elif True in curr_list_dfs[df_idx[0]].columns.astype(
                str).str.contains(pat = 'Población'):
                
                # Getting the index of the column named 'Población'
                col_idx = list(curr_list_dfs[df_idx[0]].columns.astype(
                    str).str.contains(pat = 'Población')).index(True)
                
                col_name = curr_list_dfs[df_idx[0]].columns[col_idx]
                
                df_habitantes_info['Habitantes'] = (
                    curr_list_dfs[df_idx[0]][col_name])
        
        # Special case: The demographic info of Bogota is subdivided, 
        # therefore it needs to be agregated.
        elif True in curr_list_dfs[df_idx[0]].columns.astype(
            str).str.contains(pat = 'Localidad') & True in (
                curr_list_dfs[df_idx[0]].columns.astype(str).str.contains(
                    pat = 'Población')):
            
            col_idx = list(curr_list_dfs[df_idx[0]].columns.astype(
                str).str.contains(pat = 'Localidad')).index(True)
            
            col_name = curr_list_dfs[df_idx[0]].columns[col_idx]
            
            df_habitantes_info = pd.DataFrame(list(
                curr_list_dfs[df_idx[0]][col_name]), columns=['Municipio'])

            # Adding State's name as first column
            df_habitantes_info['Departamento'] = dept_name

            # Checking if the "Poblacion" info exist in current df
            if True in curr_list_dfs[df_idx[0]].columns.astype(
                str).str.contains(pat = 'Población'):

                col_idx = list(curr_list_dfs[df_idx[0]].columns.astype(
                    str).str.contains(pat = 'Población')).index(True)
                
                col_name = curr_list_dfs[df_idx[0]].columns[col_idx]
                
                df_habitantes_info['Habitantes'] = (
                    curr_list_dfs[df_idx[0]][col_name])
                
        # Appending current df to the list with all dfs      
        df_habitantes_info_all.append(df_habitantes_info)

"""<h3>Section 3: Structuring and cleaning the data</h3>

There are some <a href = "https://pandas.pydata.org/docs/user_guide/io.html#io-html-gotchas">
issues</a> when parsing HTML tables with pandas. In our 
case the function generates some non-numeric characters 
in the population column. Therefore in order to be able 
to analyse the data further we first need to make the 
numeric variables homogeneous. This can be done by finding
and replacing the desire characters in the population column
using regular expressions `regex`.
"""

# Formating the final df 'all_data'
all_data = pd.concat(df_habitantes_info_all)
all_data = all_data.reset_index()
all_data.shape

# Removing non-numeric characters
all_data.Habitantes = all_data.Habitantes.replace(u'\xa0', '', regex=True)

"""After organising the the data into the the final DataFrame
`all_data`, we can check the `df` before saving it."""

# Checking df dimenssions and head
print('The dimenssions of all_data are: ' +
      str(all_data.shape))
print('Here are the first 5 rows of the final df (all_data):')
all_data.head()

# Checking df's tail
print('Here are the last 5 rows:')
all_data.tail()

"""The final df contains, for each of the country's 1726 towns, the town's name, the state to which the town belongs
 to and the town's population. There are however some cells with invalid or no information that will need to be 
 cleaned, so let's do that with pandas `dropna` function and creating a new, clean, DataFrame without NaNs."""

# Droping NaN's
all_data_clean = all_data.dropna()
print('Here are the last 5 rows of the clean df:')
all_data_clean.tail()

"""Now we have the cleaned data that can be used for 
further analysis. So, let's save it!
_______

<h3>Section 4: Data saving</h3>
"""

# Actual saving
all_data_clean.to_csv('habitantes_municipios_colombia_2021.csv')