ss.com is the leading marketplace for real estate, cars, and other things. This project is concerned with scraping and aggregating data on real estate for purposes of making an informed decision in the future or at least having the necessary data on ask prices for analysis.
The main goal of this pipeline is to (1) get the job done and (2) as cheaply as possible without sacrificing sanity.
- Development in local environment, individual components tested out in production environment.
- Code pushed to Github repo
- EventBridge cron job
- Triggers Lambda function
- Wakes up the EC2 instance & runs CI/CD scripts on EC2 & triggers the ETL script through Systems Manager
- EC2 does the scraping & dumps the data in S3, after which it shuts down.
This setup achieves minimal costs ~3.5 USD / month, total, and the data is easily accessible.
Purpose: triggering Lambda function. This is a simple cron job that fires once a day.
Purpose: orchestrates the EC2 instance – starts it, waits until it is running & then sends shell commands to it that updates the scraper script from GitHub & executes it.
Purpose: ties together AWS microservices – allows Lambda to turn on the EC2 & sending commands to it.
Purpose: does the main scraping job.
Purpose: persistent storage for the scraped data.
This project is deployed manually using AWS-managed services, with infrastructure defined separately from application code.
TBU
LLMs are used throughout the project to help me learn new skills & to generate ideas and entry points into the things I want to achieve. Throughout the project there is not a single line of code that has been generated by an LLM but not researched & understood by myself..
