- Note: This project is deployed independently on AWS and GCP.
To develop a robust data pipeline that automatically collect, store, and preprocess the 950+ football leagues data, facilitating advanced analysis and predictive modeling.
- Data Collection
- Cloud Service Selection and Architecture Finalization
- Data Modeling
- Data Preprocessing
- Data Updation Policy and Functions Mapping
- Data Storage and Warehousing
- Data Visualization and Predictive Modeling
- To run the python Script- GCP: Cloud Functions, AWS: Lambda Functions
- Initially all the raw data from 2008-2022 is collected in local system using python script.
- Only fixtures data is changing frequently (Approx. 600+ matches played per day).
- Extracting Fixtures data automatically using serverless cloud or lambda function from Football API (RAPID API) staged in raw data bucket.
- In GCP: Cloud Schedular Invoke the cloud functions for fixtures extraction.
- Serverless Architecture
- Easy to Automate and Ease of Scheduling
- Less Overhead
- Real-time data processing
- Easy to integrate with third party visualization and machine learning tools.
- Preprocessing all the incoming raw data file (.json) and convert into tabular data (.parquet) and write into respective preprocessed cloud storage or S3 bucket.
- Preprocessing cloud functions triggered when any new data arrived in raw data bucket.
- Auditlog trigger: GCP preprocessing cloud functions is triggered new file is added to raw data bucket.
- S3 Put Notification trigger: AWS preprocessing lambda function is triggered when new file is added to S3 raw data bucket.
- GCP: Cloud Storage
- AWS: S3
- Preferred option for storing unstructured data and support all kind file format
- Staging the Raw Data and Preprocessed data
- Robust SQL querying capabilities , Ease of data transfer and integration
- Schedule transfer of data from preprocessed bucket to BigQuery or Redshift
- Running the SQL query as per requirement to get the analytics data.
- Ease to connect the third party visualization tools such as Tableau.
- Glue Service: For Data Cataloging and ETL processes
- Athena: Leveraging the AWS Glue Data Catalog, Athena enables to query data from various sources seamlessly. Used for Adhoc querying.
- Tools: Looker, Tableau
- Looker is directly conencted with BigQuery.
- Tableau is connected to BigQuery directly through google account authentication.
- Tableau is connected to Redshift or Athena through JDBC or ODBC Driver to access the data.
- GCP: Vertex AI
- AWS: SageMaker
- Directly connect with data warehouse to get the upated data and to do the predictive modeling, and deploying the machine learning model at scale.
Tableau Dashboard: Click Here
Looker Dashboard: Click Here