Skip to content

Conversation

@rusackas
Copy link
Member

@rusackas rusackas commented Dec 11, 2025

Summary

This PR modernizes the Superset example data loading system by migrating to a Parquet-based approach with YAML configuration files, organized by dashboard for better developer experience.

Key Changes

  1. New Directory Structure by Dashboard

    • Each dashboard is self-contained in its own directory
    • Data and configuration co-located for easy maintenance
    • Shared configs in _shared/ directory
    • Support for both simple (single dataset) and complex (multiple datasets) examples
    superset/examples/
    ├── _shared/
    │   ├── database.yaml      # Database connection config
    │   └── metadata.yaml      # Import metadata
    ├── birth_names/           # Simple example (single dataset)
    │   ├── data.parquet
    │   ├── dataset.yaml
    │   ├── dashboard.yaml
    │   └── charts/
    ├── deck_gl/               # Complex example (multiple datasets)
    │   ├── dashboard.yaml
    │   ├── charts/
    │   ├── datasets/          # Multiple dataset configs
    │   │   ├── long_lat.yaml
    │   │   ├── flights.yaml
    │   │   ├── bart_lines.yaml
    │   │   └── sf_population_polygons.yaml
    │   └── data/              # Multiple parquet files
    │       ├── long_lat.parquet
    │       ├── flights.parquet
    │       ├── bart_lines.parquet
    │       └── sf_population_polygons.parquet
    └── ... (11 example dashboards total)
    
  2. Migrated to Parquet Storage Format

    • Converted all example datasets to compressed Parquet files (Snappy compression)
    • Reduced total data size from 79MB to 58MB (27% smaller)
    • Parquet is an Apache project - ideal fit for ASF codebase
  3. Auto-Discovery System

    • Just drop a data.parquet file in a new directory to add an example
    • YAML configs are auto-discovered and imported
    • No Python code changes needed to add new examples
  4. Generic Loading System

    • Implemented load_parquet_table() for unified data loading
    • Removed dataset-specific Python modules (birth_names.py, flights.py, energy.py, etc.)
    • Added robust error handling with directory traversal prevention
  5. Export as Example Feature ✨ NEW

    • Added "Export as Example" option to the Dashboard header Download menu
    • Exports any dashboard in the new Parquet + YAML format
    • Makes it easy for developers to create new examples from existing dashboards
    • Includes CLI command: superset export-example --dashboard-id <id> --name <name> --output-dir <dir>
    • API endpoint: GET /api/v1/dashboard/<pk>/export_as_example/
    • Protected by the same export permission as the regular YAML export
  6. Showtime/Ephemeral Environment Support 🐤 NEW

    • Removed dependency on pre-built examples.duckdb file from external repo
    • Showtime environments now load examples directly from Parquet files at runtime
    • Removed LOAD_EXAMPLES_DUCKDB build argument from Dockerfile
    • Examples stay automatically in sync with YAML configs

Export as Example UI

The "Export as Example" option appears in the Download submenu alongside "Export YAML":

  • Export YAML - Standard export format for import/export workflows
  • Export as Example - New format with Parquet data for the examples system

Example Dashboards (10 total)

Dashboard Datasets Description
birth_names 1 Baby names data
cleaned_sales_data 1 Sales data for Featured Charts
deck_gl 4 Geospatial visualizations (long_lat, flights, bart_lines, sf_population_polygons)
fcc_2018_survey 1 FCC survey data
featured_charts 2 SQL virtual datasets (hierarchical_dataset, project_management)
misc_charts 2 Miscellaneous chart types (birth_france_by_region, energy_usage)
slack_dashboard 11 Slack analytics (channels, users, messages, etc.)
video_game_sales 1 Video game sales data
wb_health_population 1 World Bank health & population data

Why Parquet?

  • Apache-friendly: Parquet is an Apache project, making it ideal for ASF codebases
  • Compressed: Built-in Snappy compression reduces storage by ~27%
  • Widely supported: Compatible with pandas, pyarrow, DuckDB, Spark, and many other tools
  • Self-describing: Schema is embedded in the file
  • Industry standard: De facto standard for columnar data storage

Benefits

  • Better DevEx: Examples grouped by dashboard, data and configs together
  • Smaller footprint: 27% reduction in example data size
  • Maintainability: YAML configs are easier to update than Python code
  • Consistency: Single source of truth for example data across tests and production
  • Security: Added validation to prevent directory traversal
  • Extensibility: Easy to add new examples by dropping in a directory
  • Easy contribution: "Export as Example" lets anyone create properly-formatted examples

Testing

  • All Python unit and integration tests pass
  • Cypress tests updated to use dynamic ID lookups instead of hardcoded IDs
  • Some Cypress tests temporarily skipped (see Next Steps below)
  • Playwright E2E tests added for Export as Example functionality

Cypress Test Status

Test Status Notes
box_plot.test.js ✅ Fixed Dynamic dataset lookup
bubble.test.js ✅ Fixed Dynamic dataset lookup
nativeFilters.test.ts ✅ Fixed Dynamic dataset/chart lookup
filter.test.ts (chart_list) ✅ Fixed Datasets & dashboards exist in YAML
_skip.tabs.test.ts ⏸️ Skipped Needs charts not in YAML (Treemap, Box plot, etc.)
_skip.AdhocMetrics.test.ts ⏸️ Skipped Needs "Num Births Trend" chart
_skip.advanced_analytics.test.ts ⏸️ Skipped Needs "Num Births Trend" chart
_skip.link.test.ts ⏸️ Skipped Needs "Growth Rate" chart
_skip.annotations.test.ts ⏸️ Skipped Was already skipped

Breaking Changes

None for end users. The superset load-examples command works exactly as before.

For developers:

  • Python modules like superset.examples.birth_names are removed
  • Test fixtures now use the config-based loading system
  • Example data moved from superset/examples/data/ to superset/examples/{name}/data.parquet
  • Docker: LOAD_EXAMPLES_DUCKDB build argument removed - examples are loaded from Parquet at runtime

Next Steps (Follow-up PRs)

The following items are out of scope for this PR but should be addressed in follow-up work:

1. Add Missing Charts to YAML Configs

The skipped Cypress tests depend on specific charts that were created by the old Python code but aren't in the YAML configs yet:

  • Treemap - needed for tabs.test.ts
  • Box plot - needed for tabs.test.ts
  • Growth Rate - needed for tabs.test.ts, link.test.ts
  • Num Births Trend - needed for AdhocMetrics.test.ts, advanced_analytics.test.ts
  • Number of Girls - needed for tabs.test.ts
  • Names Sorted by Num in California - needed for tabs.test.ts

2. Convert Tabbed Dashboard to YAML

The tabbed_dashboard.py creates a special dashboard for testing tab navigation. This should be converted to YAML format with all required charts.

3. Apply Dynamic ID Lookup Pattern to More Tests

The pattern introduced in this PR (getDatasetId(), getChartId()) can be applied to other tests that may have hardcoded IDs, making them more resilient to changes in example data.

4. Remove Remaining Python Example Loaders

A few Python modules remain for backwards compatibility (birth_names.py, world_bank.py). Once the Cypress tests are fully migrated, these can be removed.

5. Deprecate Pre-built examples.duckdb

The pre-built examples.duckdb file in the apache-superset/examples-data repo is no longer used. It can be removed or marked as deprecated in a follow-up.

@github-actions github-actions bot added the doc Namespace | Anything related to documentation label Dec 11, 2025
@dosubot dosubot bot added the doc:examples Related to example datasets and dashboards label Dec 11, 2025
@codeant-ai-for-open-source codeant-ai-for-open-source bot added the size:XXL This PR changes 1000+ lines, ignoring generated files label Dec 11, 2025
@rusackas rusackas force-pushed the revamped-example-loading branch from fd5b47e to be5e9f1 Compare December 12, 2025 23:12
@rusackas rusackas force-pushed the revamped-example-loading branch 3 times, most recently from 2131e36 to cd9cf76 Compare December 16, 2025 22:30
@rusackas rusackas changed the title feat(examples): Revamp example data loading with DuckDB and fix chart issues feat(examples): Modernize example data loading with Parquet and YAML configs Dec 16, 2025
@apache apache deleted a comment from codeant-ai-for-open-source bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source bot Dec 17, 2025
@apache apache deleted a comment from bito-code-review bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source bot Dec 17, 2025
@apache apache deleted a comment from codeant-ai-for-open-source bot Dec 17, 2025
@rusackas rusackas added the 🎪 ⚡ showtime-trigger-start Create new ephemeral environment for this PR label Dec 17, 2025
@github-actions github-actions bot removed the 🎪 ⚡ showtime-trigger-start Create new ephemeral environment for this PR label Dec 17, 2025
@github-actions github-actions bot added 🎪 dda62d3 🚦 deploying Environment dda62d3 status: deploying 🎪 dda62d3 🚦 failed Environment dda62d3 status: failed and removed 🎪 dda62d3 🚦 building Environment dda62d3 status: building 🎪 dda62d3 🚦 deploying Environment dda62d3 status: deploying labels Jan 21, 2026
@codeant-ai-for-open-source
Copy link
Contributor

CodeAnt AI is running Incremental review


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@github-actions github-actions bot added 🎪 d6f4b6f 🚦 building Environment d6f4b6f status: building 🎪 d6f4b6f 📅 2026-01-21T18-25 Environment d6f4b6f created at 2026-01-21T18-25 🎪 d6f4b6f 🤡 rusackas Environment d6f4b6f requested by rusackas labels Jan 21, 2026
@github-actions
Copy link
Contributor

🎪 Showtime is building environment on GHA for d6f4b6f

@codeant-ai-for-open-source
Copy link
Contributor

CodeAnt AI Incremental review completed.

@github-actions github-actions bot added 🎪 d6f4b6f 🚦 deploying Environment d6f4b6f status: deploying 🎪 d6f4b6f 🚦 failed Environment d6f4b6f status: failed and removed 🎪 d6f4b6f 🚦 building Environment d6f4b6f status: building 🎪 d6f4b6f 🚦 deploying Environment d6f4b6f status: deploying labels Jan 21, 2026
@rusackas rusackas removed 🎪 ⌛ 48h Environment expires after 48 hours (default) 🎪 dda62d3 🚦 failed Environment dda62d3 status: failed 🎪 dda62d3 📅 2026-01-21T17-51 Environment dda62d3 created at 2026-01-21T17-51 🎪 dda62d3 🤡 sadpandajoe Environment dda62d3 requested by sadpandajoe 🎪 d6f4b6f 📅 2026-01-21T18-25 Environment d6f4b6f created at 2026-01-21T18-25 🎪 d6f4b6f 🤡 rusackas Environment d6f4b6f requested by rusackas 🎪 d6f4b6f 🚦 failed Environment d6f4b6f status: failed review:draft labels Jan 21, 2026
Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!

My suggestion for future improvements: let's put all virtual datasets in per-DB directories, and use them depending of the examples DB:

  • datasets/virtual/postgres/
  • datasets/virtual/mysql/

We can have only Postgres intiially, and add support for more over time.

@rusackas rusackas merged commit dee063a into master Jan 21, 2026
85 of 91 checks passed
@rusackas rusackas deleted the revamped-example-loading branch January 21, 2026 20:42
@sfirke
Copy link
Member

sfirke commented Jan 22, 2026

I love the decision to go with Parquet instead of DuckDB for this use case. Nicely done!

dpgaspar added a commit to preset-io/superset that referenced this pull request Jan 30, 2026
The modernized example loading (apache#36538) routes through import_database()
which checks PREVENT_UNSAFE_DB_CONNECTIONS. This blocks SQLite examples
URIs in environments where this safety flag is enabled.

Skip the check when ignore_permissions=True, since system imports (like
examples) use URIs from server config, not untrusted user input.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api Related to the REST API data Namespace | Anything related to data, including databases configurations, datasets, etc. doc:examples Related to example datasets and dashboards doc Namespace | Anything related to documentation preset-io size/XXL size:XXL This PR changes 1000+ lines, ignoring generated files

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants