feat(examples): Modernize example data loading with Parquet and YAML configs #36538

rusackas · 2025-12-11T16:49:25Z

Summary

This PR modernizes the Superset example data loading system by migrating to a Parquet-based approach with YAML configuration files, organized by dashboard for better developer experience.

Key Changes

New Directory Structure by Dashboard

Each dashboard is self-contained in its own directory
Data and configuration co-located for easy maintenance
Shared configs in _shared/ directory
Support for both simple (single dataset) and complex (multiple datasets) examples

superset/examples/
├── _shared/
│   ├── database.yaml      # Database connection config
│   └── metadata.yaml      # Import metadata
├── birth_names/           # Simple example (single dataset)
│   ├── data.parquet
│   ├── dataset.yaml
│   ├── dashboard.yaml
│   └── charts/
├── deck_gl/               # Complex example (multiple datasets)
│   ├── dashboard.yaml
│   ├── charts/
│   ├── datasets/          # Multiple dataset configs
│   │   ├── long_lat.yaml
│   │   ├── flights.yaml
│   │   ├── bart_lines.yaml
│   │   └── sf_population_polygons.yaml
│   └── data/              # Multiple parquet files
│       ├── long_lat.parquet
│       ├── flights.parquet
│       ├── bart_lines.parquet
│       └── sf_population_polygons.parquet
└── ... (11 example dashboards total)

Migrated to Parquet Storage Format
- Converted all example datasets to compressed Parquet files (Snappy compression)
- Reduced total data size from 79MB to 58MB (27% smaller)
- Parquet is an Apache project - ideal fit for ASF codebase
Auto-Discovery System
- Just drop a data.parquet file in a new directory to add an example
- YAML configs are auto-discovered and imported
- No Python code changes needed to add new examples
Generic Loading System
- Implemented load_parquet_table() for unified data loading
- Removed dataset-specific Python modules (birth_names.py, flights.py, energy.py, etc.)
- Added robust error handling with directory traversal prevention
Export as Example Feature ✨ NEW
- Added "Export as Example" option to the Dashboard header Download menu
- Exports any dashboard in the new Parquet + YAML format
- Makes it easy for developers to create new examples from existing dashboards
- Includes CLI command: superset export-example --dashboard-id <id> --name <name> --output-dir <dir>
- API endpoint: GET /api/v1/dashboard/<pk>/export_as_example/
- Protected by the same export permission as the regular YAML export
Showtime/Ephemeral Environment Support 🐤 NEW
- Removed dependency on pre-built examples.duckdb file from external repo
- Showtime environments now load examples directly from Parquet files at runtime
- Removed LOAD_EXAMPLES_DUCKDB build argument from Dockerfile
- Examples stay automatically in sync with YAML configs

Export as Example UI

The "Export as Example" option appears in the Download submenu alongside "Export YAML":

Export YAML - Standard export format for import/export workflows
Export as Example - New format with Parquet data for the examples system

Example Dashboards (10 total)

Dashboard	Datasets	Description
birth_names	1	Baby names data
cleaned_sales_data	1	Sales data for Featured Charts
deck_gl	4	Geospatial visualizations (long_lat, flights, bart_lines, sf_population_polygons)
fcc_2018_survey	1	FCC survey data
featured_charts	2	SQL virtual datasets (hierarchical_dataset, project_management)
misc_charts	2	Miscellaneous chart types (birth_france_by_region, energy_usage)
slack_dashboard	11	Slack analytics (channels, users, messages, etc.)
video_game_sales	1	Video game sales data
wb_health_population	1	World Bank health & population data

Why Parquet?

Apache-friendly: Parquet is an Apache project, making it ideal for ASF codebases
Compressed: Built-in Snappy compression reduces storage by ~27%
Widely supported: Compatible with pandas, pyarrow, DuckDB, Spark, and many other tools
Self-describing: Schema is embedded in the file
Industry standard: De facto standard for columnar data storage

Benefits

Better DevEx: Examples grouped by dashboard, data and configs together
Smaller footprint: 27% reduction in example data size
Maintainability: YAML configs are easier to update than Python code
Consistency: Single source of truth for example data across tests and production
Security: Added validation to prevent directory traversal
Extensibility: Easy to add new examples by dropping in a directory
Easy contribution: "Export as Example" lets anyone create properly-formatted examples

Testing

All Python unit and integration tests pass
Cypress tests updated to use dynamic ID lookups instead of hardcoded IDs
Some Cypress tests temporarily skipped (see Next Steps below)
Playwright E2E tests added for Export as Example functionality

Cypress Test Status

Test	Status	Notes
`box_plot.test.js`	✅ Fixed	Dynamic dataset lookup
`bubble.test.js`	✅ Fixed	Dynamic dataset lookup
`nativeFilters.test.ts`	✅ Fixed	Dynamic dataset/chart lookup
`filter.test.ts` (chart_list)	✅ Fixed	Datasets & dashboards exist in YAML
`_skip.tabs.test.ts`	⏸️ Skipped	Needs charts not in YAML (Treemap, Box plot, etc.)
`_skip.AdhocMetrics.test.ts`	⏸️ Skipped	Needs "Num Births Trend" chart
`_skip.advanced_analytics.test.ts`	⏸️ Skipped	Needs "Num Births Trend" chart
`_skip.link.test.ts`	⏸️ Skipped	Needs "Growth Rate" chart
`_skip.annotations.test.ts`	⏸️ Skipped	Was already skipped

Breaking Changes

None for end users. The superset load-examples command works exactly as before.

For developers:

Python modules like superset.examples.birth_names are removed
Test fixtures now use the config-based loading system
Example data moved from superset/examples/data/ to superset/examples/{name}/data.parquet
Docker: LOAD_EXAMPLES_DUCKDB build argument removed - examples are loaded from Parquet at runtime

Next Steps (Follow-up PRs)

The following items are out of scope for this PR but should be addressed in follow-up work:

1. Add Missing Charts to YAML Configs

The skipped Cypress tests depend on specific charts that were created by the old Python code but aren't in the YAML configs yet:

Treemap - needed for tabs.test.ts
Box plot - needed for tabs.test.ts
Growth Rate - needed for tabs.test.ts, link.test.ts
Num Births Trend - needed for AdhocMetrics.test.ts, advanced_analytics.test.ts
Number of Girls - needed for tabs.test.ts
Names Sorted by Num in California - needed for tabs.test.ts

2. Convert Tabbed Dashboard to YAML

The tabbed_dashboard.py creates a special dashboard for testing tab navigation. This should be converted to YAML format with all required charts.

3. Apply Dynamic ID Lookup Pattern to More Tests

The pattern introduced in this PR (getDatasetId(), getChartId()) can be applied to other tests that may have hardcoded IDs, making them more resilient to changes in example data.

4. Remove Remaining Python Example Loaders

A few Python modules remain for backwards compatibility (birth_names.py, world_bank.py). Once the Cypress tests are fully migrated, these can be removed.

5. Deprecate Pre-built examples.duckdb

The pre-built examples.duckdb file in the apache-superset/examples-data repo is no longer used. It can be removed or marked as deprecated in a follow-up.

superset/cli/examples.py

superset/commands/importers/v1/utils.py

superset/examples/generic_loader.py

superset/examples/helpers.py

…ading

codeant-ai-for-open-source · 2026-01-21T18:25:00Z

CodeAnt AI is running Incremental review

Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

github-actions · 2026-01-21T18:25:43Z

🎪 Showtime is building environment on GHA for d6f4b6f

codeant-ai-for-open-source · 2026-01-21T18:27:17Z

CodeAnt AI Incremental review completed.

betodealmeida

Yes!

My suggestion for future improvements: let's put all virtual datasets in per-DB directories, and use them depending of the examples DB:

datasets/virtual/postgres/
datasets/virtual/mysql/

We can have only Postgres intiially, and add support for more over time.

sfirke · 2026-01-22T16:25:31Z

I love the decision to go with Parquet instead of DuckDB for this use case. Nicely done!

The modernized example loading (apache#36538) routes through import_database() which checks PREVENT_UNSAFE_DB_CONNECTIONS. This blocks SQLite examples URIs in environments where this safety flag is enabled. Skip the check when ignore_permissions=True, since system imports (like examples) use URIs from server config, not untrusted user input. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions bot added the doc Namespace | Anything related to documentation label Dec 11, 2025

pull-request-size bot added the size/XXL label Dec 11, 2025

dosubot bot added the doc:examples Related to example datasets and dashboards label Dec 11, 2025

codeant-ai-for-open-source bot added the size:XXL This PR changes 1000+ lines, ignoring generated files label Dec 11, 2025

github-actions bot added the preset-io label Dec 11, 2025

codeant-ai-for-open-source bot reviewed Dec 11, 2025

View reviewed changes

rusackas force-pushed the revamped-example-loading branch from fd5b47e to be5e9f1 Compare December 12, 2025 23:12

rusackas requested review from eschutho, geido, kgabryje, michael-s-molina and villebro as code owners December 12, 2025 23:12

rusackas force-pushed the revamped-example-loading branch 3 times, most recently from 2131e36 to cd9cf76 Compare December 16, 2025 22:30

rusackas changed the title ~~feat(examples): Revamp example data loading with DuckDB and fix chart issues~~ feat(examples): Modernize example data loading with Parquet and YAML configs Dec 16, 2025

apache deleted a comment from codeant-ai-for-open-source bot Dec 17, 2025

apache deleted a comment from bito-code-review bot Dec 17, 2025

apache deleted a comment from codeant-ai-for-open-source bot Dec 17, 2025

rusackas requested a review from betodealmeida December 17, 2025 06:11

rusackas added the 🎪 ⚡ showtime-trigger-start Create new ephemeral environment for this PR label Dec 17, 2025

github-actions bot removed the 🎪 ⚡ showtime-trigger-start Create new ephemeral environment for this PR label Dec 17, 2025

Merge remote-tracking branch 'origin/master' into revamped-example-lo…

d6f4b6f

…ading

github-actions bot added 🎪 d6f4b6f 🚦 building Environment d6f4b6f status: building 🎪 d6f4b6f 📅 2026-01-21T18-25 Environment d6f4b6f created at 2026-01-21T18-25 🎪 d6f4b6f 🤡 rusackas Environment d6f4b6f requested by rusackas labels Jan 21, 2026

betodealmeida approved these changes Jan 21, 2026

View reviewed changes

rusackas merged commit dee063a into master Jan 21, 2026
85 of 91 checks passed

rusackas deleted the revamped-example-loading branch January 21, 2026 20:42

rusackas mentioned this pull request Jan 21, 2026

feat(examples): Transpile virtual dataset SQL on import #37311

Merged

6 tasks

dpgaspar mentioned this pull request Jan 30, 2026

fix(examples): skip URI safety check for system imports #37577

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(examples): Modernize example data loading with Parquet and YAML configs #36538

feat(examples): Modernize example data loading with Parquet and YAML configs #36538

Uh oh!

rusackas commented Dec 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codeant-ai-for-open-source bot commented Jan 21, 2026

Uh oh!

github-actions bot commented Jan 21, 2026

Uh oh!

codeant-ai-for-open-source bot commented Jan 21, 2026

Uh oh!

betodealmeida left a comment

Uh oh!

Uh oh!

sfirke commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat(examples): Modernize example data loading with Parquet and YAML configs #36538

feat(examples): Modernize example data loading with Parquet and YAML configs #36538

Uh oh!

Conversation

rusackas commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Changes

Export as Example UI

Example Dashboards (10 total)

Why Parquet?

Benefits

Testing

Cypress Test Status

Breaking Changes

Next Steps (Follow-up PRs)

1. Add Missing Charts to YAML Configs

2. Convert Tabbed Dashboard to YAML

3. Apply Dynamic ID Lookup Pattern to More Tests

4. Remove Remaining Python Example Loaders

5. Deprecate Pre-built examples.duckdb

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codeant-ai-for-open-source bot commented Jan 21, 2026

Thanks for using CodeAnt! 🎉

Uh oh!

github-actions bot commented Jan 21, 2026

Uh oh!

codeant-ai-for-open-source bot commented Jan 21, 2026

Uh oh!

betodealmeida left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sfirke commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rusackas commented Dec 11, 2025 •

edited

Loading