Skip to content

This is repository is related to building end to end data pipeline for football leagues analysis and predictive modeling.

Notifications You must be signed in to change notification settings

faiyaz106/Football-League-Analytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Football-League-Analytics ( AWS and GCP Implementation)

  • Note: This project is deployed independently on AWS and GCP.

Objective:

To develop a robust data pipeline that automatically collect, store, and preprocess the 950+ football leagues data, facilitating advanced analysis and predictive modeling.

Stages:

  • Data Collection
  • Cloud Service Selection and Architecture Finalization
  • Data Modeling
  • Data Preprocessing
  • Data Updation Policy and Functions Mapping
  • Data Storage and Warehousing
  • Data Visualization and Predictive Modeling

Data Collection:

Football API Architecture:

Screenshot of a comment on a GitHub issue showing an image, added in the Markdown, of an Octocat smiling and raising a tentacle.

  • To run the python Script- GCP: Cloud Functions, AWS: Lambda Functions
  • Initially all the raw data from 2008-2022 is collected in local system using python script.
  • Only fixtures data is changing frequently (Approx. 600+ matches played per day).
  • Extracting Fixtures data automatically using serverless cloud or lambda function from Football API (RAPID API) staged in raw data bucket.
  • In GCP: Cloud Schedular Invoke the cloud functions for fixtures extraction.

Cloud Service Selection and Architecture Finalization:

Cloud Service Selection Criterion:

  • Serverless Architecture
  • Easy to Automate and Ease of Scheduling
  • Less Overhead
  • Real-time data processing
  • Easy to integrate with third party visualization and machine learning tools.

GCP Architecture:

Screenshot of a comment on a GitHub issue showing an image, added in the Markdown, of an Octocat smiling and raising a tentacle.

AWS Architecture:

Screenshot of a comment on a GitHub issue showing an image, added in the Markdown, of an Octocat smiling and git raising a tent

Data Modeling:

Schema

Screenshot of a comment on a GitHub issue showing an image, added in the Markdown, of an Octocat smiling and raising a tentacle.

Data Preprocessing:

  • Preprocessing all the incoming raw data file (.json) and convert into tabular data (.parquet) and write into respective preprocessed cloud storage or S3 bucket.
  • Preprocessing cloud functions triggered when any new data arrived in raw data bucket.
  • Auditlog trigger: GCP preprocessing cloud functions is triggered new file is added to raw data bucket.
  • S3 Put Notification trigger: AWS preprocessing lambda function is triggered when new file is added to S3 raw data bucket.

Data Updation Policy and Functions Mapping:

Screenshot of a comment on a GitHub issue showing an image, added in the Markdown, of an Octocat smiling and raising a tentacle.

Data Storage and Warehousing:

Storage (Object Storage)

  • GCP: Cloud Storage
  • AWS: S3
  • Preferred option for storing unstructured data and support all kind file format
  • Staging the Raw Data and Preprocessed data

Data Warehousing- GCP: BigQuery , AWS: Redshift

  • Robust SQL querying capabilities , Ease of data transfer and integration
  • Schedule transfer of data from preprocessed bucket to BigQuery or Redshift
  • Running the SQL query as per requirement to get the analytics data.
  • Ease to connect the third party visualization tools such as Tableau.

For AWS Only

  • Glue Service: For Data Cataloging and ETL processes
  • Athena: Leveraging the AWS Glue Data Catalog, Athena enables to query data from various sources seamlessly. Used for Adhoc querying.

Data Visualization and Predictive Modeling

Visualization:

  • Tools: Looker, Tableau
  • Looker is directly conencted with BigQuery.
  • Tableau is connected to BigQuery directly through google account authentication.
  • Tableau is connected to Redshift or Athena through JDBC or ODBC Driver to access the data.

Predictive Modeling (Machine Learning):

  • GCP: Vertex AI
  • AWS: SageMaker
  • Directly connect with data warehouse to get the upated data and to do the predictive modeling, and deploying the machine learning model at scale.

Live Dashboard Link:

Tableau Dashboard: Click Here

Screenshot of a comment on a GitHub issue showing an image, added in the Markdown, of an Octocat smiling and raising a tentacle.

Looker Dashboard: Click Here

Screenshot of a comment on a GitHub issue showing an image, added in the Markdown, of an Octocat smiling and raising a tentacle.