-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathweb_scraping_tutorial_python.py
325 lines (239 loc) · 11.5 KB
/
web_scraping_tutorial_python.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
# -*- coding: utf-8 -*-
"""web_scraping_tutorial.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/github/virtualmarioe/Web_scraping_tutorial/blob/main/web_scraping_tutorial.ipynb
<p><img alt="Web scraping tutorial" height="45px"
src="https://aiconica.net/previews/spider-web-icon-1027.png"
align="left" hspace="10px" vspace="0px"></p>
<h1>Web scraping tutorial</h1>
This notebook presents an introduction to Web scraping.
Web scraping is the process of extracting data from
websites or other online sources and copying the data
into an structured form (e.g., a database) enabling
further retrieval and analysis.
For this particular tutorial, we are going to extract
demografic information (e.g., country, state and
population) of Colombian towns from <a href =
"https://es.wikipedia.org/wiki/Municipios_de_Colombia">
Wikipedia</a>.
The tutorial is written in Python and will use two different
methods, of the many available, for pulling the data,
<a href = "https://www.crummy.com/software/BeautifulSoup/bs4/doc/">
Beautiful Soup</a> and <a href = "https://pandas.pydata.org/docs/"
> Pandas</a>.
The tutorial is divided into the following 4 sections:
- **Section 1: Method Beautiful Soup**
- **Section 2: Method Pandas**
- **Section 3: Structuring and cleaning the data**
- **Section 4: Data saving**
____
<h2> Setup </h2>
First, we will import all the required libraries.
"""
# Importing libraries
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from urllib.request import urlopen
import re
import seaborn as sns
"""<h3> Section 1: Method Beautiful Soup </h3>
The data we are interested in is distributed across
multiple Wikipedia pages and tables. Therefore, we first
need to read and parse the main table containing the list
with all the towns and a link per town where the actual
demographic information is located.
We will go through the following steps:
- 1.1. Building the main table and parsing its content
- 1.2. Extracting all data contained in tables
- 1.3. Building lists to hold the extracted data
- 1.4. Structuring the extracted data
**1.1. Building the main table and parsing its content**
"""
# 1. Building the URL and parsing it with Beautiful Soup
wiki_es = 'https://es.wikipedia.org'
mun_col = '/wiki/Municipios_de_Colombia'
url = wiki_es + mun_col
html = urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
"""**1.2. Extracting all data contained in tables**
Extracting all data contained in the webpage's sections
labeled with the tag `'table'`.
"""
# 2. Finding all data with tag 'table'
tables = soup.find_all('table')
"""**1.3. Building lists to hold the extracted data**
To extract the links contained in the tables it is necessary
to cycle across all rows, labeled witht the tag `'tr'`, and
cells, labeled with the tag `'td'`. Finally, at each cell the
link of interest, labeled with the tag `'href'`, will be
appended to `links_anex` list, which will be used to build
the final URL for calling the webpages we are interested in.
"""
# 3. Building lists to hold the extracted data
# Preallocating variables for each lists
departamentos = []
numero_de_municipios = []
links_anex = []
# Cycling through the table rows
for table in tables:
rows = table.find_all('tr')
for row in rows:
cells = row.find_all('td')
# The main page contains multiple tables.
# Finding the table with more that 2 cells, which
# the one we are interested.
if len(cells) > 2:
# Building a list with the state names
departamento = cells[0]
departamentos.append(departamento.text.strip())
# Building a list with the town names
municipio = cells[1]
numero_de_municipios.append(municipio.text.strip())
# Building a list with the state's link
link = cells[1]
links_anex.append(municipio.contents[0]['href'])
"""**1.4. Structuring the extracted data**
In order to store the extracted data in a format and a
structure that can be used for further analysis, we will
put all the data in a pandas `DataFrame`. To do this we
will create a pandas `series` from each list created in
point 3. and then contenating all series into a single df.
"""
# Building DataFrame with name of States and number of towns
# Creating pandas series from scraped list created in 1.3
deptos_serie = pd.Series(departamentos,name='Departamento')
num_mun_serie = pd.Series(numero_de_municipios,name='# Municipios')
links_serie = pd.Series(links_anex, name='Link')
# Building all series into a single df
df_municipios_info = pd.concat([deptos_serie,num_mun_serie,links_serie],axis=1)
"""Let's check how the current `DataFrame` looks kike."""
# Checking df dimenssions and head
print('The dimenssions of the df_municipios_info are: ' +
str(df_municipios_info.shape))
print('Here are the first 5 rows:')
df_municipios_info.head()
"""So, now we have a `df` with the following information
for each of the 33 states. The state's name, number of towns
and the URL where the info for all State's town can be pulled.
<h3> Section 2: Method Pandas</h3>
We will use `Pandas` to pull the demographic data of each
town across all states.
For this extraction we will use the function <a href =
"https://pandas.pydata.org/docs/user_guide/io.html#io-read-html">
`pd.read_html()`</a>, which takes a HTML URL and parse its
content into a list of `DataFrames`. We will pass the URL with
the function `get` from the library <a href =
"https://docs.python-requests.org/en/latest/">Request</a>.
"""
# Looping through all the list of states to scrap available population data
# Preallocation of list for all df with town info
df_list_municipios = []
df_habitantes_info = []
df_habitantes_info_all = []
# Loop for data collection
for muni_link in enumerate((df_municipios_info.iloc[:]['Link']).tolist()):
curr_link = muni_link[1]
# Current town's name
dept_name = df_municipios_info.iloc[muni_link[0]]['Departamento']
curr_r = requests.get(wiki_es + curr_link)
# Scraping the data from the current URL ising Pandas.
curr_list_dfs = pd.read_html(curr_r.text)
# Loop for selecting and extracting data for each town
for df_idx in enumerate(curr_list_dfs):
# Checking for field town name. This can be either 'Name' or 'Municipio'
# Thus we will make them homogeneous by using 'Municipio' in all.
if True in curr_list_dfs[df_idx[0]].columns.astype(str).str.contains(
pat = 'Nombre'):
# Changing 'Nombre' to 'Municipio'
df_habitantes_info = pd.DataFrame(list(
curr_list_dfs[df_idx[0]]['Nombre']), columns=['Municipio'])
# Adding State's name as first column
df_habitantes_info['Departamento'] = dept_name
# Poplation information can be stored in columns called either
# 'Habitantes' or 'Población'. Thus, we need to make them homogeneous.
# Checking if the current df has a column called 'Habitantes'
if True in curr_list_dfs[df_idx[0]].columns.astype(
str).str.contains(pat = 'Habitantes'):
# Getting the index of the column named 'Habitantes'
col_idx = list(curr_list_dfs[df_idx[0]].columns.astype(
str).str.contains(pat = 'Habitantes')).index(True)
col_name = curr_list_dfs[df_idx[0]].columns[col_idx]
df_habitantes_info['Habitantes'] = (
curr_list_dfs[df_idx[0]][col_name])
# Checking if the current df has a column called 'Población'
elif True in curr_list_dfs[df_idx[0]].columns.astype(
str).str.contains(pat = 'Población'):
# Getting the index of the column named 'Población'
col_idx = list(curr_list_dfs[df_idx[0]].columns.astype(
str).str.contains(pat = 'Población')).index(True)
col_name = curr_list_dfs[df_idx[0]].columns[col_idx]
df_habitantes_info['Habitantes'] = (
curr_list_dfs[df_idx[0]][col_name])
# Special case: The demographic info of Bogota is subdivided,
# therefore it needs to be agregated.
elif True in curr_list_dfs[df_idx[0]].columns.astype(
str).str.contains(pat = 'Localidad') & True in (
curr_list_dfs[df_idx[0]].columns.astype(str).str.contains(
pat = 'Población')):
col_idx = list(curr_list_dfs[df_idx[0]].columns.astype(
str).str.contains(pat = 'Localidad')).index(True)
col_name = curr_list_dfs[df_idx[0]].columns[col_idx]
df_habitantes_info = pd.DataFrame(list(
curr_list_dfs[df_idx[0]][col_name]), columns=['Municipio'])
# Adding State's name as first column
df_habitantes_info['Departamento'] = dept_name
# Checking if the "Poblacion" info exist in current df
if True in curr_list_dfs[df_idx[0]].columns.astype(
str).str.contains(pat = 'Población'):
col_idx = list(curr_list_dfs[df_idx[0]].columns.astype(
str).str.contains(pat = 'Población')).index(True)
col_name = curr_list_dfs[df_idx[0]].columns[col_idx]
df_habitantes_info['Habitantes'] = (
curr_list_dfs[df_idx[0]][col_name])
# Appending current df to the list with all dfs
df_habitantes_info_all.append(df_habitantes_info)
"""<h3>Section 3: Structuring and cleaning the data</h3>
There are some <a href = "https://pandas.pydata.org/docs/user_guide/io.html#io-html-gotchas">
issues</a> when parsing HTML tables with pandas. In our
case the function generates some non-numeric characters
in the population column. Therefore in order to be able
to analyse the data further we first need to make the
numeric variables homogeneous. This can be done by finding
and replacing the desire characters in the population column
using regular expressions `regex`.
"""
# Formating the final df 'all_data'
all_data = pd.concat(df_habitantes_info_all)
all_data = all_data.reset_index()
all_data.shape
# Removing non-numeric characters
all_data.Habitantes = all_data.Habitantes.replace(u'\xa0', '', regex=True)
"""After organising the the data into the the final DataFrame
`all_data`, we can check the `df` before saving it."""
# Checking df dimenssions and head
print('The dimenssions of all_data are: ' +
str(all_data.shape))
print('Here are the first 5 rows of the final df (all_data):')
all_data.head()
# Checking df's tail
print('Here are the last 5 rows:')
all_data.tail()
"""The final df contains, for each of the country's 1726 towns, the town's name, the state to which the town belongs
to and the town's population. There are however some cells with invalid or no information that will need to be
cleaned, so let's do that with pandas `dropna` function and creating a new, clean, DataFrame without NaNs."""
# Droping NaN's
all_data_clean = all_data.dropna()
print('Here are the last 5 rows of the clean df:')
all_data_clean.tail()
"""Now we have the cleaned data that can be used for
further analysis. So, let's save it!
_______
<h3>Section 4: Data saving</h3>
"""
# Actual saving
all_data_clean.to_csv('habitantes_municipios_colombia_2021.csv')