AWS Web Scraping Pipeline: ss.com

Overview

ss.com is the leading marketplace for real estate, cars, and other things. This project is concerned with scraping and aggregating data on real estate for purposes of making an informed decision in the future or at least having the necessary data on ask prices for analysis.

The main goal of this pipeline is to (1) get the job done and (2) as cheaply as possible without sacrificing sanity.

Architecture

Development in local environment, individual components tested out in production environment.
Code pushed to Github repo
EventBridge cron job
Triggers Lambda function
Wakes up the EC2 instance & runs CI/CD scripts on EC2 & triggers the ETL script through Systems Manager
EC2 does the scraping & dumps the data in S3, after which it shuts down.

This setup achieves minimal costs ~3.5 USD / month, total, and the data is easily accessible.

Services

EventBridge

Purpose: triggering Lambda function. This is a simple cron job that fires once a day.

Lambda

Purpose: orchestrates the EC2 instance – starts it, waits until it is running & then sends shell commands to it that updates the scraper script from GitHub & executes it.

Systems Manager

Purpose: ties together AWS microservices – allows Lambda to turn on the EC2 & sending commands to it.

EC2

Purpose: does the main scraping job.

S3

Purpose: persistent storage for the scraped data.

Deployment TBU

This project is deployed manually using AWS-managed services, with infrastructure defined separately from application code.

Security & Ethics TBU

Scraping

TBU

Use of LLM

LLMs are used throughout the project to help me learn new skills & to generate ideas and entry points into the things I want to achieve. Throughout the project there is not a single line of code that has been generated by an LLM but not researched & understood by myself..

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
SScom_pipeline_architecture.png		SScom_pipeline_architecture.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AWS Web Scraping Pipeline: ss.com

Overview

Architecture

Services

EventBridge

Lambda

Systems Manager

EC2

S3

Deployment TBU

Security & Ethics TBU

Scraping

Use of LLM

About

Uh oh!

Releases

Packages

Languages

diodiom/ss_scraper

Folders and files

Latest commit

History

Repository files navigation

AWS Web Scraping Pipeline: ss.com

Overview

Architecture

Services

EventBridge

Lambda

Systems Manager

EC2

S3

Deployment TBU

Security & Ethics TBU

Scraping

Use of LLM

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages