A comprehensive, self-paced bootcamp delivered as interactive Marimo notebooks covering modern Python fundamentals, local data stack, and distributed computing.
This bootcamp provides a structured pathway for staff and developers at the University of Idaho OIT to acquire modern data science skills. The content bridges local development with Polars and DuckDB to distributed computing with PySpark and Databricks.
- Module 0: Environment & Tooling - Set up your development environment with Python 3.14+, UV, and modern tooling
- Module 1: Modern Python - Learn Python 3.14+ patterns, type hints, and professional coding practices
- Module 2: Local Data Stack - Master Polars and DuckDB for efficient local data analysis
- Module 3: Data Acquisition - Collect data from APIs, web scraping, and various sources
- Module 4: Data Cleaning - Clean and validate messy data with Pydantic and Pandera
- Module 5: Feature Engineering - Build reproducible data pipelines and engineer features
- Module 6: Visualization - Create effective visualizations to communicate insights
- Module 7: Machine Learning - Build classification models with scikit-learn
- Module 8: Databricks - Scale up to distributed computing with PySpark and Databricks
- Python 3.14 or later
- UV package manager installed
- Git
- Clone the repository:
git clone https://github.com/ncolesummers/data-engineering-bootcamp.git
cd data-engineering-bootcamp- Set up the environment:
uv sync- Verify installation:
python -c "from bootcamp import __version__; print(__version__)"You should see 0.1.0 printed to the console.
Notebooks are delivered using Marimo, a reactive Python notebook platform.
To run a notebook:
marimo edit notebooks/module_00_environment/00_01_welcome.pydata-engineering-bootcamp/
├── pyproject.toml # Project configuration and dependencies
├── uv.lock # Locked dependencies for reproducibility
├── README.md # This file
├── docs/ # Documentation and guides
│ ├── 01-curriculum-architecture.md
│ ├── 02-prd.md
│ ├── 03-premortem.md
│ └── 04-backlog-structure.md
├── src/
│ └── bootcamp/ # Python package for shared code
│ ├── datasets/ # Sample and synthetic datasets
│ ├── solutions/ # Exercise solutions
│ └── utils/ # Shared utilities
├── notebooks/ # Interactive Marimo notebooks
│ ├── module_00_environment/
│ ├── module_01_modern_python/
│ ├── module_02_local_data_stack/
│ ├── module_03_data_acquisition/
│ ├── module_04_data_cleaning/
│ ├── module_05_feature_engineering/
│ ├── module_06_visualization/
│ ├── module_07_machine_learning/
│ └── module_08_databricks/
└── tests/ # Test files
Start with Module 0: Environment & Tooling to set up your development environment and get familiar with the tools used throughout the bootcamp.
Each module builds on previous modules, so it's recommended to progress sequentially. However, experienced learners can skip ahead using the prerequisite information in each notebook's metadata.
- Product Requirements Document - Detailed specifications and requirements
- Curriculum Architecture - Learning objectives and structure
- Backlog Structure - GitHub issues and development tracking
This bootcamp is under active development. Contributions are welcome!
- Use the Epic template for major deliverable proposals
- Use the User Story template for individual tasks or features
See the GitHub Issues for current development status.
- Runtime: Python 3.14+
- Package Manager: UV
- Notebooks: Marimo
- DataFrames: Polars
- SQL Engine: DuckDB
- Validation: Pydantic, Pandera
- ML: scikit-learn
- Distributed: PySpark (Databricks)
MIT License - See LICENSE file for details.
Nathan Summers - nsummers72@gmail.com
Project Link: https://github.com/ncolesummers/data-engineering-bootcamp