This project builds a data engineering pipeline that analyzes e-commerce clickstream data to derive business insights. It collects, processes, and analyzes event logs to understand purchase conversion patterns, product interest, and user journeys, supporting marketing and UX improvements.
- Identify user purchase journeys and drop-off points
- Analyze view/cart/purchase conversion rates by product
- Detect time-based traffic patterns and peak hours
- Track user interest by category/brand
- Session-based behavior analysis
Build an end-to-end pipeline for clickstream analytics using modern cloud and open-source tools.
- Terraform: Provision GCP resources automatically
- Kestra: Orchestrate data ingestion workflows
- Apache Spark: Process and transform large datasets
- BigQuery: Load and analyze warehouse data
- Looker Studio: Visualize dashboards
| Area | Technology |
|---|---|
| Cloud | Google Cloud Platform (GCP) |
| Infrastructure | Terraform |
| Orchestration | Kestra |
| Data Processing | Apache Spark |
| Data Warehouse | BigQuery |
| Storage | Google Cloud Storage |
| Visualization | Looker Studio |
Based on BigQuery/clickstream_analysis.sql:
- Periodic (hourly) event-type distribution:
event_type_by_hour - Visualization use: donut/line charts to show hourly event ratios and trends
For details, see BigQuery/README.md.
clickstream-pipeline/
├── data/ # Raw data
├── spark/ # Spark jobs
├── kestra/ # Workflow definitions
└── terraform/ # Infrastructure code
cd terraform
terraform init
terraform plan
terraform apply- The service account key path uses
cred/clickstream-sa.jsoninterraform/main.tf. - Adjust variables in
terraform/variables.tfor with*.tfvars.
- Register KV/Secret values and upload flows (see
kestra/README.md). - Enable flows in the Kestra UI and run with inputs as needed.
- Follow the execution steps in
spark/README.md.
- Load Parquet outputs into BigQuery and run queries.
- See
BigQuery/README.mdfor details.
- Connect BigQuery tables to Looker Studio and build dashboards.
- Why dbt is not used: With only one analytics table, adding dbt felt unnecessary. It can be added later in
dbt/if modeling expands. - Why Spark jobs are not run via Kestra: There wasn’t enough reliable reference material to build a stable Kestra–Spark integration in the current environment. Spark orchestration is planned for a future Airflow project.

