The Data Engineering Zoomcamp offers essential concepts, tools, and hands-on skills required for modern data engineering. Covering a broad spectrum of topics — including containerization, infrastructure as code, and advanced batch and streaming processing — the course takes a practical, project-based approach. This ensures that participants not only understand the theory but also apply their knowledge by developing real-world data pipelines.
- Docker: Containerization platform for building, shipping, and running applications.
- Terraform: Infrastructure as code tool for building, changing, and versioning infrastructure.
- Google BigQuery: Serverless, highly scalable, and cost-effective multi-cloud data warehouse.
- dbt (data build tool): Analytics engineering tool providing a transformation-focused query runner.
- Apache Spark: Open-source distributed computing system for big data processing.
- Apache Kafka: Distributed event streaming platform for building real-time data pipelines and streaming applications.
- Kestra: Flexible and scalable workflow orchestration and automation tool.
- PostgreSQL: Powerful open-source relational database system.
- Google Data Studio: Data visualization and reporting tool to turn data into informative dashboards and reports.
- Metabase: Open-source business intelligence and analytics tool for easy data visualization and exploration.
- GCP
- Docker and docker-compose
- Running Postgres locally with Docker
- Setting up infrastructure on GCP with Terraform
- Preparing the environment
- Data Lake
- Workflow orchestration
- Workflow orchestration with Kestra
- Reading from apis
- Building scalable pipelines
- Normalising data
- Incremental loading
- Data Warehouse
- BigQuery
- Partitioning and clustering
- BigQuery best practices
- Internals of BigQuery
- BigQuery Machine Learning
- Basics of analytics engineering
- dbt (data build tool)
- BigQuery and dbt
- Postgres and dbt
- dbt models
- Testing and documenting
- Deployment to the cloud and locally
- Visualizing the data with google data studio and metabase
- Batch processing
- What is Spark
- Spark Dataframes
- Spark SQL
- Internals: GroupBy and joins
- Introduction to Kafka
- Schemas (avro)
- Kafka Streams
- Kafka Connect and KSQL
- Week 1 and 2: working on your project
- Week 3: reviewing your peers