Collaborative workshop project: clean a movie dataset, explore genre trends, analyze profitability, and train a predictive ratings model.
- Full workshop specification: see
docs/SPEC.md - Environment/setup: concise instructions live in
SETUP.md
uv sync
uv run python scripts/01_clean_data.py
uv run python scripts/02_analyze_genres.py
uv run python scripts/03_analyze_financials.py
uv run python scripts/04_build_model.pyThe scripts are designed to run in order; each writes its outputs for the next step.
00_refresh_raw.py– optional refresh of the TMDB subset (data/movies_raw.csv).01_clean_data.py– feature engineering ->results/movies_clean.csv.02_analyze_genres.py– decade/genre area chart ->outputs/genres_by_decade.png.03_analyze_financials.py– ROI & profitability summary ->outputs/roi_by_budget_category.png.04_build_model.py– scikit-learn regression with cross-val + holdout metrics.
- Clean dataset:
results/movies_clean.csv - Plots:
outputs/genres_by_decade.png,outputs/roi_by_budget_category.png - Model metrics: printed by
scripts/04_build_model.py
movie-analysis/
├── README.md
├── SETUP.md
├── docs/
│ └── SPEC.md
├── data/
│ ├── README.md
│ └── movies_raw.csv
├── scripts/
├── outputs/
├── results/
└── tests/- Commit after each script so teammates can re-run and review.
- Document notable findings (ROI shifts, genre insights, feature importances) in your PR or the shared report.
- Need more context? The spec in
docs/SPEC.mdcovers roles, timeline, and stretch goals.