Skip to content

The Data Engineering Zoomcamp covers essential skills in containerization, workflow orchestration, data warehousing, analytics engineering, batch, and streaming processing. It includes tools like Docker, Terraform, BigQuery, dbt, Spark, Kafka, Kestra, Postgres, Google Data Studio, and Metabase.

Notifications You must be signed in to change notification settings

nathadriele/data-engineering-zoomcamp

Repository files navigation

Data Engineering Zoomcamp

The Data Engineering Zoomcamp offers essential concepts, tools, and hands-on skills required for modern data engineering. Covering a broad spectrum of topics — including containerization, infrastructure as code, and advanced batch and streaming processing — the course takes a practical, project-based approach. This ensures that participants not only understand the theory but also apply their knowledge by developing real-world data pipelines.

image

Featured Tools and Technologies

  • Docker: Containerization platform for building, shipping, and running applications.
  • Terraform: Infrastructure as code tool for building, changing, and versioning infrastructure.
  • Google BigQuery: Serverless, highly scalable, and cost-effective multi-cloud data warehouse.
  • dbt (data build tool): Analytics engineering tool providing a transformation-focused query runner.
  • Apache Spark: Open-source distributed computing system for big data processing.
  • Apache Kafka: Distributed event streaming platform for building real-time data pipelines and streaming applications.
  • Kestra: Flexible and scalable workflow orchestration and automation tool.
  • PostgreSQL: Powerful open-source relational database system.
  • Google Data Studio: Data visualization and reporting tool to turn data into informative dashboards and reports.
  • Metabase: Open-source business intelligence and analytics tool for easy data visualization and exploration.

Module 1: Containerization and Infrastructure as Code

  • GCP
  • Docker and docker-compose
  • Running Postgres locally with Docker
  • Setting up infrastructure on GCP with Terraform
  • Preparing the environment

Module 2: Workflow Orchestration

  • Data Lake
  • Workflow orchestration
  • Workflow orchestration with Kestra

Workshop 1: Data Ingestion

  • Reading from apis
  • Building scalable pipelines
  • Normalising data
  • Incremental loading

Module 3: Data Warehouse

  • Data Warehouse
  • BigQuery
  • Partitioning and clustering
  • BigQuery best practices
  • Internals of BigQuery
  • BigQuery Machine Learning

Module 4: Analytics engineering

  • Basics of analytics engineering
  • dbt (data build tool)
  • BigQuery and dbt
  • Postgres and dbt
  • dbt models
  • Testing and documenting
  • Deployment to the cloud and locally
  • Visualizing the data with google data studio and metabase

Module 5: Batch processing

  • Batch processing
  • What is Spark
  • Spark Dataframes
  • Spark SQL
  • Internals: GroupBy and joins

Module 6: Streaming

  • Introduction to Kafka
  • Schemas (avro)
  • Kafka Streams
  • Kafka Connect and KSQL

Project

  • Week 1 and 2: working on your project
  • Week 3: reviewing your peers

https://github.com/DataTalksClub/data-engineering-zoomcamp

About

The Data Engineering Zoomcamp covers essential skills in containerization, workflow orchestration, data warehousing, analytics engineering, batch, and streaming processing. It includes tools like Docker, Terraform, BigQuery, dbt, Spark, Kafka, Kestra, Postgres, Google Data Studio, and Metabase.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published