Skip to content

diodiom/ss_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AWS Web Scraping Pipeline: ss.com

Overview

ss.com is the leading marketplace for real estate, cars, and other things. This project is concerned with scraping and aggregating data on real estate for purposes of making an informed decision in the future or at least having the necessary data on ask prices for analysis.

The main goal of this pipeline is to (1) get the job done and (2) as cheaply as possible without sacrificing sanity.

Architecture

A diagram of pipeline architecture

  1. Development in local environment, individual components tested out in production environment.
  2. Code pushed to Github repo
  3. EventBridge cron job
  4. Triggers Lambda function
  5. Wakes up the EC2 instance & runs CI/CD scripts on EC2 & triggers the ETL script through Systems Manager
  6. EC2 does the scraping & dumps the data in S3, after which it shuts down.

This setup achieves minimal costs ~3.5 USD / month, total, and the data is easily accessible.

Services

EventBridge

Purpose: triggering Lambda function. This is a simple cron job that fires once a day.

Lambda

Purpose: orchestrates the EC2 instance – starts it, waits until it is running & then sends shell commands to it that updates the scraper script from GitHub & executes it.

Systems Manager

Purpose: ties together AWS microservices – allows Lambda to turn on the EC2 & sending commands to it.

EC2

Purpose: does the main scraping job.

S3

Purpose: persistent storage for the scraped data.

Deployment TBU

This project is deployed manually using AWS-managed services, with infrastructure defined separately from application code.

Security & Ethics TBU

Scraping

TBU

Use of LLM

LLMs are used throughout the project to help me learn new skills & to generate ideas and entry points into the things I want to achieve. Throughout the project there is not a single line of code that has been generated by an LLM but not researched & understood by myself..

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published