As an enthusiastic collector of antique coins, I have always been fascinated by the rich history each piece embodies. These coins are not only currency, but also links to our past. Some may have been used in times of war, others for everyday transactions like buying food, book, clothes or acquiring a new home. Each coin holds a story, a glimpse into the lives and times of those who once held them.
This is Part IV on Web Scraping if you want to see the first three here are links:
- Web_Scraping_IMDB_Most_Popular_Movies
- Web_Scraping_X_Feed_Selenium
- Web_Scraping_Practice_w_Instagram_and_GitHub
python get-pip.py
You need to find your python executable location to add it to Path generally you can find it under C:\Python it is going to look like this:
C:\Users\USER\AppData\Local\Programs\Python
then you need to add find Edit the system environment variables click on Environment Variables and add this path to there.
python -m venv venv
When you create venv named virtual environment you will find Scripts folder in it and inside it there is file called activate this is batch file we will activate our environment with this
venv\Scripts\activate
Now we are ready to use our virtual environment.
If you face with “Execution_Policies” problem, you can run the following script on powershell:
Set-ExecutionPolicy RemoteSigned
python -m pip install "package-name"
That’s it we can install any package we want with any version we need without making our environment messier, or dealing with issues problems because of all other modules in the same environment etc.
deactivate
When we are done, we can simply close our virtual environment with deactivate.
You can check this documentation for more: Virtual Environments and Packages
Let’s install our packages:
pip install Scrapy
pip install pandas
pip install numpy
Scrapy is a simply high-level web crawling and scraping framework that helps us to extract structured data from websites. It can be used for various purposes, including data mining, monitoring, and automated testing.
Documentation => Scrapy 2.11 documentation — Scrapy 2.11.0 documentation
scrapy startproject "project_name"
After we started our project, our folder has files like settings.py where you can configure your spider settings, items.py, pipelines.py etc. and the most important one is spiders folder we will configure our spider and it will do what we ask for we can create different spiders for different jobs. To begin with I suggest you to check the documentation above it really helps for you to understand what’s going on.
from pathlib import Path
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f"quotes-{page}.html"
Path(filename).write_bytes(response.body)
self.log(f"Saved file {filename}")
You can find examples like this in the documentation, spider structure looks like this, the important thing is name must be unique you will use spider’s name to crawl data.
There is shortcut to start_requests method:
from pathlib import Path
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
]
def parse(self, response):
page = response.url.split("/")[-2]
filename = f"quotes-{page}.html"
Path(filename).write_bytes(response.body)
scrapy crawl quotes #quetos spider will get into action
scrapy crawl quotes -o quotes.json #with this you can save data into json format
scrapy shell "url" #with this you can directly get the data to analyze in the shell
- from scrapy shell you can directly analyze to see output
response.css("title::text").get()
# Output:
'Quotes to Scrape'
That’s it for now for more information you can check documentation I gave above it is enough for you to understand deeper.
Pandas is an open-source Python library that has changed the game in data analysis and manipulation. Think of pandas as Swiss Army knife. It’s powerful yet user-friendly, complex yet approachable, and it’s the tool for anyone looking to make sense of data.
With pandas, tasks like reading data from various sources, cleaning it to a usable format, exploring it to find trends, and even visualizing it for presentations are simplified.
Why pandas? Because it streamlines complex processes into one or two lines of code — processes that otherwise would have taken countless steps in traditional programming languages. It’s especially popular in academic research, finance, and commercial data analytics because of its ability to handle large datasets efficiently and intuitively.
pip install pandas
import pandas as pd
import numpy as np
df_exe = pd.DataFrame(
{
"One": 1.0,
"Time data": pd.Timestamp("20130102"),
"Series": pd.Series(1, index=list(range(4)), dtype="float32"),
"Numpy Array": np.array([3] * 4, dtype="int32"),
"Catalog": pd.Categorical(["Chair", "Tv", "Mirror", "Sofa"]),
"F": "example",
}
)
df_exe
df_exe[df_exe["Catalog"]=="Mirror"]
We will explore more with the project so for now we are done with it for detailed information you can check pandas documentation.
When I check the website I see the structure is like this and I decided to get the seller name, money, and price of it so I checked its html structure to see what to extract.
scrapy shell "https://www.vcoins.com/en/coins/world-1945.aspx"
response.css("div.item-link a::text").extract()
response.css("p.description a::text").extract()
response.css("div.prices span.newitemsprice::text").extract()[::2]
response.css("div.prices span.newitemsprice::text").extract()[1::2]
These are data of one page, and I want my spider to search all the pages available and return the data so I will check pagination part on the bottom.
It will go until there is nothing. I first done this project with getting text output, then for csv(comma separated values) output for data analysis.
import scrapy # Import the scrapy library
import csv # Import the csv library
# Define a new spider class which inherits from scrapy.Spider.
class MoneySpider(scrapy.Spider):
name = "moneyspider_csv"
page_count = 0
money_count = 1
start_urls = ["https://www.vcoins.com/en/coins/world-1945.aspx"]
def start_requests(self):
self.file = open('money.csv', 'w', newline='', encoding='UTF-8') # Open a new CSV file in write mode.
self.writer = csv.writer(self.file) # Create a CSV writer object.
self.writer.writerow(['Count', 'Seller', 'Money', 'Price']) # Write the header row in the CSV file.
return [scrapy.Request(url=url) for url in self.start_urls] # Return a list of scrapy.Request objects for each URL.
# This method processes the response from each URL
def parse(self, response):
# Extract the names
money_names = response.css("div.item-link a::text").extract()
# Extract the years
money_years = response.css("p.description a::text").extract()
# Extract the currency symbols
money_symbols = response.css("div.prices span.newitemsprice::text").extract()[::2]
# Extract the prices
money_prices = response.css("div.prices span.newitemsprice::text").extract()[1::2]
# Combine the currency symbols and prices
combined_prices = [money_symbols[i] + money_prices[i] for i in range(len(money_prices))]
# Loop through the extracted items and write each to a row in the CSV file.
for i in range(len(money_names)):
self.writer.writerow([self.money_count, money_names[i], money_years[i], combined_prices[i]])
self.money_count += 1
# Extract the URL for the next page
next_page_url = response.css("div.pagination a::attr(href)").extract_first()
# If there is a URL for the next page, construct the full URL and continue scraping.
if next_page_url:
absolute_next_page_url = response.urljoin(next_page_url)
self.page_count += 1
if self.page_count != 10:
yield scrapy.Request(url=absolute_next_page_url, callback=self.parse, dont_filter=True)
else:
self.file.close()
We successfully extracted the csv data, now it is time for us to analyze.
import pandas as pd
import numpy as np
test = pd.read_csv("money.csv",index_col="Count")
test
test.shape
test.info()
test.describe()
test.isnull().sum()
test.drop_duplicates().sort_values(by="Price",ascending=False).head(25)
# Regular expression(Regex) pattern to match 2, 3, or 4 consecutive digits.
pattern = r'(\b\d{4}\b|\b\d{3}\b|\b\d{2}\b)'
test['Extracted_Year'] = test['Money'].str.extract(pattern, expand=False)
test['Extracted_Year'] = pd.to_numeric(test['Extracted_Year'], errors='coerce').fillna(-1).astype(int)
test.drop_duplicates().sort_values(by='Extracted_Year',ascending=False).head(60)
def clean_price(price):
price = price.replace('US$', '').replace('€', '').replace('£', '').replace('NOK', '')
price = price.replace(',', '').replace('.', '')
price = price.strip()
return price
# Apply the cleaning function to Price Column
test['Price'] = test['Price'].apply(clean_price)
test['Price'] = pd.to_numeric(test['Price'], errors='coerce')
test[["Money", "Price"]].drop_duplicates().sort_values(by="Price", ascending=False).head(40)
- This is not correct thing to do but I wanted to show you clean process I haven’t decide how to correctly sort my values because there are different currencies.
test[test["Money"].apply(lambda x : x.startswith("Elizabeth"))]
test[test["Seller"].apply(lambda x : x.startswith("Sovereign"))]
test[test["Money"].isin(["Elizabeth II 1966 Gillick Sovereign MS64"])]
If you want to understand this in a more simpler language you can check my Medium writing published on Level Up Coding