Primary objective: To design and develop a Data Acquisition, Pre-processing and Data Analysis Pipeline
Data Auquisition Using Webscraping with help of Beautiful Soap.
This code is a Jupyter notebook for web scraping using the Beautiful Soup Python module. The code extracts data from a web page and saves it in different Python data structures for further data analysis and manipulation.
Original file is located at https://colab.research.google.com/github/manojdhadke/data_analysis/blob/main/Data%20Auquisition%20Using%20Webscraping%20Beautiful%20Soap%20Python%20Module.ipynb
Repository Link https://github.com/manojdhadke/data_analysis.git
Language: Python
Packages: jovian, beautifulsoup4, pandas, requests, sqlalchemy, pymysql
Install with pip
!pip install jovian --upgrade --quiet # jovian: for saving the notebook
!pip install beautifulsoup4 --upgrade --quiet # beautifulsoup4: for web scraping
!pip install pandas --upgrade --quiet # pandas: for data manipulation and analysis
!pip install requests --upgrade --quiet # requests: for HTTP requests
!pip install sqlalchemy # sqlalchemy: for connecting to the MySQL database
!pip install pymysql # pymysql: for connecting to the MySQL database using Python
The code scrapes data from the web page https://www.value.today/world/world-top-500-companies and extracts the following information for each company in the list:
- Company rank
- Headquarters location
- CEO name
- Market capitalization
- Total number of employees
- Sector
The data is saved in different Python data structures such as lists, dictionaries and Pandas dataframes.
Automatically generated by Colaboratory.
Original file is located at https://colab.research.google.com/github/manojdhadke/data_analysis/blob/main/Data%20Auquisition%20Using%20Webscraping%20Beautiful%20Soap%20Python%20Module.ipynb
import pandas as pd
import jovian
from bs4 import BeautifulSoup
import requests
import time
topic_url = "https://www.value.today/world/world-top-500-companies"
Use the requests library to send a GET request to the URL and store the response in a variable "response"
response = requests.get(topic_url)
response.status_code
Extract the content of the response as a string using the "text" attribute and store it in a variable "page_content"
page_content = response.text
len(page_content) # Checking the lenght of the page_content
Let's have a look on the first 1000 characters of the html codes that has been written for the web page
page_content[:1000]
with open("word's-top-500-companies-by-narket-capitalization.html", 'w') as file:
file.write(page_content)
Try to read the html that we have created above to extract the data that we needed from the web page.
with open("word's-top-500-companies-by-narket-capitalization.html", 'r') as f:
html_source = f.read()
This code uses the BeautifulSoup library to create a new BeautifulSoup object doc by parsing the HTML content stored in html_source.
doc = BeautifulSoup(html_source)
type(doc)
title_tag = doc.title
title_tag
title_tag.text
jovian.commit(project="web-scraping-final")
This code block is using BeautifulSoup to find all div tags with class 'row well views-row' in the HTML source of a webpage. It then returns the first element of the resulting list.
company_block_tag = doc.find_all('div', {'class':'row well views-row'})
company_block_tag[:1]
len(company_block_tag)
companies_name = []
for tag in company_block_tag:
company_name_tag = tag.find('div',{'class':'views-field views-field-title col-sm-12 clearfix'})
companies_name.append(company_name_tag.find('a').text)
Let's have a look on the top 5 company names if we have extracted it in right way. And also check the length of the companies name if we have got all the names available on the webpage
print(companies_name[:5])
len(companies_name)
rank = []
# Extracting company rank from the company block tag
for tag in company_block_tag:
rank_tag = tag.find('div', {'class':'views-field views-field-field-world-rank-jan-2020 clearfix col-sm-12'})
rank.append(int(rank_tag.find('span').text))
hq_location = []
# Extracting company headquarters location from the company block tag
for tag in company_block_tag:
hq_tag = tag.find('div', {'class':'views-field views-field-field-headquarters-of-company clearfix col-sm-12'})
hq_location.append(hq_tag.find('span').text)
ceo_name = []
# Extracting CEO name from the company block tag
for tag in company_block_tag:
ceo_tag = tag.find('div', {'class':'views-field views-field-field-ceo clearfix col-sm-12'})
try:
ceo = ceo_tag.find('span', {'class':'field-content'})
ceo_name.append(ceo.text)
except AttributeError:
ceo_name.append(None)
market_cap = []
# Extracting company market capitalization from the company block tag
for tag in company_block_tag:
market_cap_tag = tag.find('div', {'class':'views-field views-field-field-market-value-jan-2020 clearfix col-sm-12'})
try:
market_cap_value = market_cap_tag.find('span', {'class':'field-content'})
market_cap.append(market_cap_value.text)
except AttributeError:
market_cap.append(None)
total_employee = []
# Extracting total number of employees from the company block tag
for tag in company_block_tag:
employee_tag = tag.find('div',{'class':'views-field views-field-field-employee-count clearfix col-sm-12'})
try:
total_employee.append(employee_tag.find('span').text)
except AttributeError:
total_employee.append(None)
sectors = []
# Extracting the primary sector of the company from the company block tag
for tag in company_block_tag:
sector_tag = tag.find('div', {'class':'views-field views-field-field-company-category-primary clearfix col-sm-12'})
sectors.append(sector_tag.find('span').text)
# Have a look if all the data is exactly same in counting for all 500 companies.
print(len(rank),len(ceo_name),len(market_cap),len(hq_location), len(total_employee), len(sectors))
base_url = 'https://value.today'
url = []
for tag in company_block_tag:
url_tag = tag.find('div',{'class':'views-field views-field-title col-sm-12 clearfix'})
url.append(base_url + url_tag.find('a')['href'])
This code creates a Python dictionary companies_name_dict which holds the scraped data in a structured format.
companies_name_dict = {
'Name': companies_name,
'Rank': rank,
'Headquarter': hq_location,
'CEO': ceo_name,
'Market Capitalization': market_cap,
'Total No. Of Employee': total_employee,
'Sectors': sectors,
'url': url
}
# The dataframe will have two columns: "Name" and "Country"
companies_df = pd.DataFrame(companies_name_dict)
Clean up the "Market Capitalization" and "Total No. Of Employee" columns in the companies_df dataframe
# Remove the " Billion USD$" text from the "Market Capitalization" values using regex
companies_df['Market Capitalization'] = companies_df['Market Capitalization'].replace(" Billion USD$", "", regex=True)
# Remove commas from the "Total No. Of Employee" values using regex
companies_df['Total No. Of Employee'] = companies_df['Total No. Of Employee'].replace(',', '', regex=True)
# Return the cleaned-up dataframe
companies_df
from google.colab import drive
drive.mount('/content/drive')
Save the contents of the companies_df dataframe to a CSV file named "companies.csv". The file will be stored in the "DBS/CA" directory in the user's Google Drive
companies_df.to_csv('/content/drive/MyDrive/DBS/CA/companies.csv',index=False)
companies_df = pd.read_csv("/content/drive/MyDrive/DBS/CA/companies.csv")
companies_df.shape
companies_df.dtypes
companies_df.head(5)
companies_df.describe()
del companies_df["url"]
del companies_df["Sectors"]
sorted(companies_df["Name"])[0:15]
companies_df["Name"].describe()
companies_df["Headquarter"][0:15]
companies_df["Headquarter"].describe()
dummy_vector = pd.Series([1,None,3,None,7,8])
dummy_vector.isnull()
companies_df.isnull().sum()
# drop all rows with any NaN and NaT values
df1 = companies_df.dropna()
print(df1)
df1.isnull().sum()
from sklearn.linear_model import LinearRegression
# Remove any rows from the companies_df dataframe where the "Total No. Of Employee" or "Market Capitalization" columns are null
# This ensures that we have complete data to work with, and that the model can be properly trained
companies_df = companies_df[companies_df['Total No. Of Employee'].notnull()]
companies_df = companies_df[companies_df['Market Capitalization'].notnull()]
# The feature matrix includes only the "Total No. Of Employee" column
# The target variable is the "Market Capitalization" column
X = companies_df[['Total No. Of Employee']]
y = companies_df['Market Capitalization']
# Create a LinearRegression object and fit the training data to it
# This will train the model to predict revenue based on the number of employees
model = LinearRegression()
model.fit(X, y)
new_employee_count = np.array([500]).reshape(-1,1)
predicted_revenue = model.predict(new_employee_count)
print(f"Predicted revenue for a company with 500 employees: {predicted_revenue[0]:,.2f} USD")
import matplotlib.pyplot as plt
companies_df['Market Capitalization'].hist(bins=50)
plt.xlabel('Market Capitalization (in billions USD)')
plt.ylabel('Count')
plt.show()
This method sorts the companies_df dataframe by the "Total No. Of Employee" column in descending order and returns the top 10 rows
top_10_employees = companies_df.nlargest(10, 'Total No. Of Employee')
print(top_10_employees[['Name', 'Total No. Of Employee']])
import pandas as pd
from sqlalchemy import create_engine
Set up the database connection information, including the username, password, database name and connection string
db_username = 'root'
db_password = 'root'
db_name = 'organizations'
db_connection_string = f'mysql+pymysql://{db_username}:{db_password}@localhost/{db_name}'
engine = create_engine(db_connection_string)
table_name = 'companies'
companies_df.to_sql(table_name, con=engine, if_exists='replace', index=False)
query = 'SELECT Name, `Total No. Of Employee` FROM companies_data ORDER BY `Total No. Of Employee` DESC LIMIT 10'
top_10_employees = pd.read_sql(query, con=engine)
print(top_10_employees)