Skip to content

An end‑to‑end clickstream analytics pipeline for e‑commerce, from ingestion and processing to warehouse analysis and dashboarding

Notifications You must be signed in to change notification settings

kkh1902/User-Behavior-Analytics-Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This project builds a data engineering pipeline that analyzes e-commerce clickstream data to derive business insights. It collects, processes, and analyzes event logs to understand purchase conversion patterns, product interest, and user journeys, supporting marketing and UX improvements.

Problems Addressed

  • Identify user purchase journeys and drop-off points
  • Analyze view/cart/purchase conversion rates by product
  • Detect time-based traffic patterns and peak hours
  • Track user interest by category/brand
  • Session-based behavior analysis

📌 Project Goals

Build an end-to-end pipeline for clickstream analytics using modern cloud and open-source tools.

✅ Technical Goals

  • Terraform: Provision GCP resources automatically
  • Kestra: Orchestrate data ingestion workflows
  • Apache Spark: Process and transform large datasets
  • BigQuery: Load and analyze warehouse data
  • Looker Studio: Visualize dashboards

Tech Stack

Area Technology
Cloud Google Cloud Platform (GCP)
Infrastructure Terraform
Orchestration Kestra
Data Processing Apache Spark
Data Warehouse BigQuery
Storage Google Cloud Storage
Visualization Looker Studio

Analysis Summary (BigQuery)

Based on BigQuery/clickstream_analysis.sql:

  • Periodic (hourly) event-type distribution: event_type_by_hour
  • Visualization use: donut/line charts to show hourly event ratios and trends

For details, see BigQuery/README.md.

Folder Structure

clickstream-pipeline/
├── data/        # Raw data
├── spark/       # Spark jobs
├── kestra/      # Workflow definitions
└── terraform/   # Infrastructure code

How to Run

1) Provision Infrastructure (Terraform)

cd terraform
terraform init
terraform plan
terraform apply
  • The service account key path uses cred/clickstream-sa.json in terraform/main.tf.
  • Adjust variables in terraform/variables.tf or with *.tfvars.

2) Configure and Run Kestra

  • Register KV/Secret values and upload flows (see kestra/README.md).
  • Enable flows in the Kestra UI and run with inputs as needed.

3) Run Spark Processing

  • Follow the execution steps in spark/README.md.

4) Load and Analyze in BigQuery

  • Load Parquet outputs into BigQuery and run queries.
  • See BigQuery/README.md for details.

5) Visualization (Looker Studio)

  • Connect BigQuery tables to Looker Studio and build dashboards.

Architecture

alt text

Data Visualization with Looker Studio

alt text

Why Not (Yet)

  • Why dbt is not used: With only one analytics table, adding dbt felt unnecessary. It can be added later in dbt/ if modeling expands.
  • Why Spark jobs are not run via Kestra: There wasn’t enough reliable reference material to build a stable Kestra–Spark integration in the current environment. Spark orchestration is planned for a future Airflow project.

References

About

An end‑to‑end clickstream analytics pipeline for e‑commerce, from ingestion and processing to warehouse analysis and dashboarding

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published