Football Players Web Scraper & Stats Processor ⚽📊

This automated ETL pipeline extracts player statistics from Transfermarkt, transforms the data, and loads it into a REST API using message queuing for scalable processing.

Project Structure 🗂️

/project
├── main.py            # Producer: sends tasks to RabbitMQ
├── consumer.py        # Consumer: processes scraping tasks
├── scrapper.py        # Web scraping logic with BeautifulSoup
├── config.py          # Configuration (environment variables)
├── auth.py            # API authentication
├── data_manager.py    # Data processing (Pandas)
├── logger_config.py   # Logging configuration
├── save_data.py       # API saving functions
├── requirements.txt   # Dependencies
└── .github/workflows  # CI/CD with GitHub Actions

Prerequisites 📋

Python 3.8+
CloudAMQP account (free plan available)

Configuration ⚙️

VEnvironment variables: Create a .env file based on this example:

# .env.example
# .env.example
API_AUTH_URL=https://your-api.com/auth/
API_PLAYERS_URL=https://your-api.com/players/
API_PLAYER_STATS_URL=https://your-api.com/stats/
API_STATS_BY_POSITION_URL=https://your-api.com/position-stats/
EMAIL=your_email@api.com
PASSWORD=your_password
CLOUDAMQP_URL=amqps://user:pass@host/vhost

Install dependencies:

pip install -r requirements.txt

Usage 🚀

Local Execution

Producer (sends tasks):

python main.py

Consumer (processes tasks):

python consumer.py

Production Execution

System runs automatically daily at midnight UTC via GitHub Actions.
Consumers can be scaled in parallel for better performance.

🧩 Core Components

Producer (main.py)

Creates scraping tasks
Manages RabbitMQ connections
Handles task prioritization

Consumer (consumer.py)

Processes tasks from queue
Implements exponential backoff for retries
Manages API authentication

Architecture 🏗️

sequenceDiagram
    Producer->>+RabbitMQ: Enqueue scraping task
    RabbitMQ->>+Consumer: Deliver task
    Consumer->>+Transfermarkt: Scrape data
    Transfermarkt-->>-Consumer: Return HTML
    Consumer->>+DataProcessor: Clean/transform
    DataProcessor-->>-Consumer: Structured data
    Consumer->>+API: POST processed data

GitHub Actions Workflow 🤖

name: Cron Job
on:
  schedule:
    - cron: '0 0 * * *'  # Daily execution at midnight UTC
jobs:
  run_producer:
    # Sends tasks to RabbitMQ
  run_consumer:
    needs: run_producer  # Sequential dependency
    # Processes tasks with 3 parallel workers

Error Handling ⚠️

Automatic retries on connection failures
Persistent queues in RabbitMQ

Future Improvements 🔮

Add monitoring with Prometheus/Grafana
Implement dead-letter queue for failed messages
Dockerize the application

Contributing 🤝

Fork the project
Create your branch (git checkout -b feature/fooBar)
Commit your changes (git commit -am 'Add some fooBar')
Push to the branch (git push origin feature/fooBar)
Open a Pull Request

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Football Players Web Scraper & Stats Processor ⚽📊

Project Structure 🗂️

Prerequisites 📋

Configuration ⚙️

Usage 🚀

Local Execution

Production Execution

🧩 Core Components

Producer (main.py)

Consumer (consumer.py)

Architecture 🏗️

GitHub Actions Workflow 🤖

Error Handling ⚠️

Future Improvements 🔮

Contributing 🤝

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.github/workflows		.github/workflows
__pycache__		__pycache__
.gitignore		.gitignore
README.md		README.md
auth.py		auth.py
config.py		config.py
consumer.py		consumer.py
data_manager.py		data_manager.py
logger_config.py		logger_config.py
main.py		main.py
players_url.py		players_url.py
requirements.txt		requirements.txt
save_data.py		save_data.py
scrapper.py		scrapper.py

Garridot/football_players_web_sraping

Folders and files

Latest commit

History

Repository files navigation

Football Players Web Scraper & Stats Processor ⚽📊

Project Structure 🗂️

Prerequisites 📋

Configuration ⚙️

Usage 🚀

Local Execution

Production Execution

🧩 Core Components

Producer (main.py)

Consumer (consumer.py)

Architecture 🏗️

GitHub Actions Workflow 🤖

Error Handling ⚠️

Future Improvements 🔮

Contributing 🤝

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages