Primary objective: To design and develop a Data Acquisition, Pre-processing and Data Analysis Pipeline

Data Auquisition Using Webscraping with help of Beautiful Soap.

This code is a Jupyter notebook for web scraping using the Beautiful Soup Python module. The code extracts data from a web page and saves it in different Python data structures for further data analysis and manipulation.

Original file is located at https://colab.research.google.com/github/manojdhadke/data_analysis/blob/main/Data%20Auquisition%20Using%20Webscraping%20Beautiful%20Soap%20Python%20Module.ipynb

Repository Link https://github.com/manojdhadke/data_analysis.git

Authors

@manojdhadke

Tech Stack

Language: Python

Packages: jovian, beautifulsoup4, pandas, requests, sqlalchemy, pymysql

Installation

Install with pip

!pip install jovian --upgrade --quiet # jovian: for saving the notebook
!pip install beautifulsoup4 --upgrade --quiet # beautifulsoup4: for web scraping
!pip install pandas --upgrade --quiet # pandas: for data manipulation and analysis
!pip install requests --upgrade --quiet # requests: for HTTP requests
!pip install sqlalchemy # sqlalchemy: for connecting to the MySQL database
!pip install pymysql # pymysql: for connecting to the MySQL database using Python

Usage

The code scrapes data from the web page https://www.value.today/world/world-top-500-companies and extracts the following information for each company in the list:

Company rank
Headquarters location
CEO name
Market capitalization
Total number of employees
Sector

The data is saved in different Python data structures such as lists, dictionaries and Pandas dataframes.

Documentation

Automatically generated by Colaboratory.

Original file is located at https://colab.research.google.com/github/manojdhadke/data_analysis/blob/main/Data%20Auquisition%20Using%20Webscraping%20Beautiful%20Soap%20Python%20Module.ipynb

Core

Importing

import pandas as pd
import jovian
from bs4 import BeautifulSoup
import requests
import time

Define the URL of the web page to be scraped as a string variable "topic_url"

topic_url = "https://www.value.today/world/world-top-500-companies"

Use the requests library to send a GET request to the URL and store the response in a variable "response"

response = requests.get(topic_url)

Check the status code

response.status_code

Extract the content of the response as a string using the "text" attribute and store it in a variable "page_content"

page_content = response.text
len(page_content) # Checking the lenght of the page_content

Let's have a look on the first 1000 characters of the html codes that has been written for the web page

page_content[:1000]

Write the contents of the "page_content" variable to the file object using the "write" method

with open("word's-top-500-companies-by-narket-capitalization.html", 'w') as file:
    file.write(page_content)

Try to read the html that we have created above to extract the data that we needed from the web page.

with open("word's-top-500-companies-by-narket-capitalization.html", 'r') as f:
    html_source = f.read()

This code uses the BeautifulSoup library to create a new BeautifulSoup object doc by parsing the HTML content stored in html_source.

doc = BeautifulSoup(html_source)
type(doc)
title_tag = doc.title
title_tag
title_tag.text

Before going further, first save our work because we are doing all this on an online platform

jovian.commit(project="web-scraping-final")

This code block is using BeautifulSoup to find all div tags with class 'row well views-row' in the HTML source of a webpage. It then returns the first element of the resulting list.

company_block_tag = doc.find_all('div', {'class':'row well views-row'})
company_block_tag[:1]

len(company_block_tag)

Here we try to find the tag and name for the company

companies_name = []
for tag in company_block_tag:
        company_name_tag = tag.find('div',{'class':'views-field views-field-title col-sm-12 clearfix'})
        companies_name.append(company_name_tag.find('a').text)

Let's have a look on the top 5 company names if we have extracted it in right way. And also check the length of the companies name if we have got all the names available on the webpage

print(companies_name[:5])
len(companies_name)

Creating following empty lists to store the scraped data

rank = []
# Extracting company rank from the company block tag
for tag in company_block_tag:
    rank_tag = tag.find('div', {'class':'views-field views-field-field-world-rank-jan-2020 clearfix col-sm-12'})
    rank.append(int(rank_tag.find('span').text))

hq_location = []
# Extracting company headquarters location from the company block tag
for tag in company_block_tag:
    hq_tag = tag.find('div', {'class':'views-field views-field-field-headquarters-of-company clearfix col-sm-12'})
    hq_location.append(hq_tag.find('span').text)

ceo_name = []
# Extracting CEO name from the company block tag
for tag in company_block_tag:
    ceo_tag = tag.find('div', {'class':'views-field views-field-field-ceo clearfix col-sm-12'})
    try:
        ceo = ceo_tag.find('span', {'class':'field-content'})
        ceo_name.append(ceo.text)
    except AttributeError:
        ceo_name.append(None)
        
market_cap = []
# Extracting company market capitalization from the company block tag
for tag in company_block_tag:
    market_cap_tag = tag.find('div', {'class':'views-field views-field-field-market-value-jan-2020 clearfix col-sm-12'})
    try:
        market_cap_value = market_cap_tag.find('span', {'class':'field-content'})
        market_cap.append(market_cap_value.text)
    except AttributeError:
        market_cap.append(None)
        
total_employee = []
# Extracting total number of employees from the company block tag
for tag in company_block_tag:
    employee_tag = tag.find('div',{'class':'views-field views-field-field-employee-count clearfix col-sm-12'})
    try:
        total_employee.append(employee_tag.find('span').text)
    except AttributeError:
        total_employee.append(None)
        
sectors = []
# Extracting the primary sector of the company from the company block tag
for tag in company_block_tag:
    sector_tag = tag.find('div', {'class':'views-field views-field-field-company-category-primary clearfix col-sm-12'})
    sectors.append(sector_tag.find('span').text)

# Have a look if all the data is exactly same in counting for all 500 companies.
print(len(rank),len(ceo_name),len(market_cap),len(hq_location), len(total_employee), len(sectors))

Extract the URL for the company page

base_url = 'https://value.today'
url = []
for tag in company_block_tag:
    url_tag = tag.find('div',{'class':'views-field views-field-title col-sm-12 clearfix'})
    url.append(base_url + url_tag.find('a')['href'])

This code creates a Python dictionary companies_name_dict which holds the scraped data in a structured format.

companies_name_dict = {
    'Name': companies_name,
    'Rank': rank,
    'Headquarter': hq_location,
    'CEO': ceo_name,
    'Market Capitalization': market_cap,
    'Total No. Of Employee': total_employee,
    'Sectors': sectors,
    'url': url
}

Create a pandas dataframe called "companies_df" using the dictionary "companies_name_dict"

# The dataframe will have two columns: "Name" and "Country"
companies_df = pd.DataFrame(companies_name_dict)

Clean up the "Market Capitalization" and "Total No. Of Employee" columns in the companies_df dataframe

# Remove the " Billion USD$" text from the "Market Capitalization" values using regex
companies_df['Market Capitalization'] = companies_df['Market Capitalization'].replace(" Billion USD$", "", regex=True)

# Remove commas from the "Total No. Of Employee" values using regex
companies_df['Total No. Of Employee'] = companies_df['Total No. Of Employee'].replace(',', '', regex=True)

# Return the cleaned-up dataframe
companies_df

Load the Drive helper and mount

from google.colab import drive
drive.mount('/content/drive')

Save the contents of the companies_df dataframe to a CSV file named "companies.csv". The file will be stored in the "DBS/CA" directory in the user's Google Drive

companies_df.to_csv('/content/drive/MyDrive/DBS/CA/companies.csv',index=False)
companies_df = pd.read_csv("/content/drive/MyDrive/DBS/CA/companies.csv")
companies_df.shape
companies_df.dtypes
companies_df.head(5)
companies_df.describe()

Deletion of unnecessory columns

del companies_df["url"]
del companies_df["Sectors"]

sorted(companies_df["Name"])[0:15]
companies_df["Name"].describe()
companies_df["Headquarter"][0:15]
companies_df["Headquarter"].describe()
dummy_vector = pd.Series([1,None,3,None,7,8])
dummy_vector.isnull()

Checking of missing values

companies_df.isnull().sum()

Removing missing values

# drop all rows with any NaN and NaT values
df1 = companies_df.dropna()
print(df1)
df1.isnull().sum()

Problem statement 1: Predicting Company Revenue Based On Number of Employees

from sklearn.linear_model import LinearRegression

Prepare training data

# Remove any rows from the companies_df dataframe where the "Total No. Of Employee" or "Market Capitalization" columns are null
# This ensures that we have complete data to work with, and that the model can be properly trained
companies_df = companies_df[companies_df['Total No. Of Employee'].notnull()]
companies_df = companies_df[companies_df['Market Capitalization'].notnull()]

Split the data into two sets: X (the feature matrix) and y (the target variable)

# The feature matrix includes only the "Total No. Of Employee" column
# The target variable is the "Market Capitalization" column
X = companies_df[['Total No. Of Employee']]
y = companies_df['Market Capitalization']

Train model

# Create a LinearRegression object and fit the training data to it
# This will train the model to predict revenue based on the number of employees
model = LinearRegression()
model.fit(X, y)

Predict revenue for a new company with 500 employees

new_employee_count = np.array([500]).reshape(-1,1)
predicted_revenue = model.predict(new_employee_count)

print(f"Predicted revenue for a company with 500 employees: {predicted_revenue[0]:,.2f} USD")

Problem statement 2: Analyzing the distribution of market capitalizations

import matplotlib.pyplot as plt

Plot a histogram of market capitalizations

companies_df['Market Capitalization'].hist(bins=50)
plt.xlabel('Market Capitalization (in billions USD)')
plt.ylabel('Count')
plt.show()

Problem statement 3: Finding the top 10 companies by number of employees

This method sorts the companies_df dataframe by the "Total No. Of Employee" column in descending order and returns the top 10 rows

top_10_employees = companies_df.nlargest(10, 'Total No. Of Employee')
print(top_10_employees[['Name', 'Total No. Of Employee']])

Loading data into an MySQL database

import pandas as pd
from sqlalchemy import create_engine

Set up the database connection information, including the username, password, database name and connection string

db_username = 'root'
db_password = 'root'
db_name = 'organizations'
db_connection_string = f'mysql+pymysql://{db_username}:{db_password}@localhost/{db_name}'

Create a database engine object to establish a connection to the MySQL database

engine = create_engine(db_connection_string)

Set the name of the table to store the data in

table_name = 'companies'

Write the contents of the companies_df dataframe to the MySQL database

companies_df.to_sql(table_name, con=engine, if_exists='replace', index=False)

Example: This will return a dataframe containing the top 10 companies by number of employees.

query = 'SELECT Name, `Total No. Of Employee` FROM companies_data ORDER BY `Total No. Of Employee` DESC LIMIT 10'
top_10_employees = pd.read_sql(query, con=engine)
print(top_10_employees)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Data Auquisition Using Webscraping Beautiful Soap Python Module.ipynb		Data Auquisition Using Webscraping Beautiful Soap Python Module.ipynb
README.md		README.md

manojdhadke/programming_data_analysis_ca2

Folders and files

Latest commit

History

Repository files navigation

Primary objective: To design and develop a Data Acquisition, Pre-processing and Data Analysis Pipeline

Authors

Tech Stack

Installation

Usage

Documentation

Core

Importing

Define the URL of the web page to be scraped as a string variable "topic_url"

Use the requests library to send a GET request to the URL and store the response in a variable "response"

Check the status code

Extract the content of the response as a string using the "text" attribute and store it in a variable "page_content"

Let's have a look on the first 1000 characters of the html codes that has been written for the web page

Write the contents of the "page_content" variable to the file object using the "write" method

Try to read the html that we have created above to extract the data that we needed from the web page.

This code uses the BeautifulSoup library to create a new BeautifulSoup object doc by parsing the HTML content stored in html_source.

Before going further, first save our work because we are doing all this on an online platform

This code block is using BeautifulSoup to find all div tags with class 'row well views-row' in the HTML source of a webpage. It then returns the first element of the resulting list.

Here we try to find the tag and name for the company

Let's have a look on the top 5 company names if we have extracted it in right way. And also check the length of the companies name if we have got all the names available on the webpage

Creating following empty lists to store the scraped data

Extract the URL for the company page

This code creates a Python dictionary companies_name_dict which holds the scraped data in a structured format.

Create a pandas dataframe called "companies_df" using the dictionary "companies_name_dict"

Clean up the "Market Capitalization" and "Total No. Of Employee" columns in the companies_df dataframe

Load the Drive helper and mount

Save the contents of the companies_df dataframe to a CSV file named "companies.csv". The file will be stored in the "DBS/CA" directory in the user's Google Drive

Deletion of unnecessory columns

Checking of missing values

Removing missing values

Problem statement 1: Predicting Company Revenue Based On Number of Employees

Prepare training data

Split the data into two sets: X (the feature matrix) and y (the target variable)

Train model

Predict revenue for a new company with 500 employees

Problem statement 2: Analyzing the distribution of market capitalizations

Plot a histogram of market capitalizations

Problem statement 3: Finding the top 10 companies by number of employees

This method sorts the companies_df dataframe by the "Total No. Of Employee" column in descending order and returns the top 10 rows

Loading data into an MySQL database

Set up the database connection information, including the username, password, database name and connection string

Create a database engine object to establish a connection to the MySQL database

Set the name of the table to store the data in

Write the contents of the companies_df dataframe to the MySQL database

Example: This will return a dataframe containing the top 10 companies by number of employees.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages