[SIP-182] Semantic Layer Support in Apache Superset

# SIP: Semantic Layer Support in Apache Superset

## Abstract

This proposal introduces changes to Apache Superset to better support semantic layers and external data modeling systems. The changes include (1) the definition of an “Explorable” interface (a Python protocol), and (2) the introduction of a new class of connections for semantic layers and similar systems.

## Motivation

Semantic layers are a powerful way to expose well curated metric and related dimensions to users, allowing for an improved user experience when presenting data: instead of focusing on datasets, semantic layers in general operate on a higher level of abstraction, exposing curated **metrics** as first class citizens. Once a user has selected one or more metrics that they're interested in, most modern semantic layers allow them to slice and dice metrics by presenting associated **dimensions**, automatically performing joins between the underlying data sources. This workflow, where metrics are curated and their dimensions are freely available, allows users to focus on the metrics that matter to them, while providing confidence that the underlying data is correct and well defined.

Because Superset is fundamentally dataset-centric, integrations with semantic layers have been timid so far. When they exist, **they usually represent the semantic layer as pseudo-database**, exposing one or more **pseudo-datasets** that represent models in the semantic layer. [Cube](https://cube.dev/), for example, uses the Postgres wire protocol to expose cubes as pseudo-tables that can be queried in Superset. Minerva, an inhouse semantic layer from AirBnB, uses a different approach, exposing all metrics and dimensions as a single dataset, [with custom overrides to indicate to the users which metrics and dimensions are compatible](https://medium.com/airbnb-engineering/supercharging-apache-superset-b1a2393278bd). Other experimental integrations (with MetricFlow, Snowflake, and DataJunction) used a similar approach, though they haven’t been contributed to OSS yet.

There are a couple limitations in Superset's architecture that create friction when integrating with semantic layers and modern data platforms:

### Datasets *are* the Superset Semantic Layer

In order to explore data in Superset we need some kind of semantic layer that informs us which columns are available, which ones can be filtered, which ones are groupable, etc. In Superset, that semantic layer is the **dataset**, a thin abstraction that allows users to define metrics, declare derived columns, and add metadata that informs the UI and prevents expensive queries from running (group bys on high cardinality columns, for example).

Because the dataset is the native semantic layer, adding an external source from a semantic layer as a dataset to Superset is unlikely to work as expected, since we’re adding a second semantic layer on top of the first one. For example, the external semantic layer might not allow for adhoc metrics or computed columns, making it incompatible with the Superset dataset editor, as well as making the experience in Explore unintuitive and error-prone.

For these “semantic datasets” (or “semantic models”) we likely want to disable the Superset semantic layer, since the metadata is already defined externally:

1. Users shouldn't be able to add metrics nor calculated columns when editing them, unless the semantic layer supports adhoc expressions.
2. Similar to the "Sync columns from source", metrics should have a button to "Sync metrics from source", to fetch new or updated metrics from the semantic layer.
3. When explored, users should not be allowed to add adhoc metrics, derived columns, or use "Custom SQL", unless the semantic layer supports adhoc expressions.
4. When explored, not all metrics and dimensions might be compatible, requiring a mechanism for disabling metrics/columns as the user selects metrics/columns, similar to the inhouse approach used by Minerva.

This suggests that these **semantic models should not be represented as standard datasets in Superset**, given that in order to make it work we need to remove all the value that datasets provide — the semantics.

### Query generation

The current flow for chart creation is tightly coupled not only with SQL but with SQLAlchemy. The frontend sends a request to the backend, via the `QueryObject`, indicating which columns, metrics, filters, and other parameters the user has selected. The backend then generates a SQLAlchemy query from this payload. The query generation is highly dependent on the specific database engine, so the `get_sqla_query` method inspects several attributes of the DB engine spec. Once this query is generated, it is transpiled to the target dialect using SQLAlchemy, and passed to the DB engine spec for execution.

This approach works reasonably well for traditional databases, but it creates friction when integrating with semantic layers that do not use SQL or SQLAlchemy. For example, if a semantic layer uses GraphQL or REST APIs, the current flow requires parsing the generated SQL and building a custom request. This is the case for the experimental MetricFlow DB engine spec, which has this flow:

```
Explore controls
       ↓
  QueryObject
       ↓
SQLAlchemy query
       ↓
   pseudo-SQL
       ↓
Shillelagh handler
       ↓
 GraphQL request
```

For the new Snowflake semantic layer, even though it exposes a SQL interface to semantic views, the flow looks like this:

```
Explore controls
       ↓
  QueryObject
       ↓
SQLAlchemy query
       ↓
   pseudo-SQL
       ↓
  sqlglot AST
       ↓
   actual SQL
       ↓
    execute
```

The SQL parsing step is necessary for Snowflake because Explore builds the query on top of either a table or a subquery, but in Snowflake it should be a UDTF (user defined table function):

```sql
-- SQL generated by Explore
SELECT "Item.Brand", "Store.State", "StoreSales.TotalSalesQuantity"
FROM pseudo_table
GROUP BY "Item.Brand", "Store.State";

-- final SQL after parsing and manipulating the AST
SELECT * FROM SEMANTIC_VIEW(
    TPCDS_SEMANTIC_VIEW_SM
    DIMENSIONS Item.Brand, Store.State
    METRICS StoreSales.TotalSalesQuantity
);
```

This is not only inefficient and brittle but also limits the flexibility of how queries can be executed against different data sources. 

## Proposed Change

In order to properly support semantic layers we need to move away from the current solutions based on pseudo-databases and custom DB engine specs. Instead, we should implement first class support for semantic layers, bypassing the need for using a dataset when exploring data. This will allow users to choose between the semantics provided by Superset datasets, or an external system.

The first change will be [the introduction of an `Explorable` interface](https://github.com/apache/superset/pull/36245), defined as a Python protocol. This interface is in some ways similar to the existing `ExploreMixin`, that was added when we decided to support `Query` objects in explore. The problem with the current `ExploreMixin` is that it’s too tightly-coupled with SQLAlchemy and datasets, and has no clear separation between the functionality needed for data exploration vs. query generation.

The new `Explorable`, on the other hand, is concerned only with chart building:

```python
@runtime_checkable
class Explorable(Protocol):

    # =========================================================================
    # Core Query Interface
    # =========================================================================

    def get_query_result(self, query_object: QueryObject) -> QueryResult:
    def get_query_str(self, query_obj: QueryObjectDict) -> str:

    # =========================================================================
    # Identity & Metadata
    # =========================================================================

    @property
    def uid(self) -> str:

    @property
    def type(self) -> str:

    @property
    def metrics(self) -> list[Any]:

    @property
    def columns(self) -> list[Any]:

    @property
    def column_names(self) -> list[str]:

    @property
    def data(self) -> ExplorableData:

    # =========================================================================
    # Caching
    # =========================================================================

    @property
    def cache_timeout(self) -> int | None:

    @property
    def changed_on(self) -> datetime | None:

    def get_extra_cache_keys(self, query_obj: QueryObjectDict) -> list[Hashable]:

    # =========================================================================
    # Security
    # =========================================================================

    @property
    def perm(self) -> str:

    # =========================================================================
    # Time/Date Handling
    # =========================================================================

    @property
    def offset(self) -> int:

    # =========================================================================
    # Time Granularity
    # =========================================================================

    def get_time_grains(self) -> list[TimeGrainDict]:

    # =========================================================================
    # Drilling
    # =========================================================================

    def has_drill_by_columns(self, column_names: list[str]) -> bool:

    # =========================================================================
    # Optional Properties
    # =========================================================================

    @property
    def is_rls_supported(self) -> bool:

    @property
    def query_language(self) -> str | None:
```

Note the `get_query_result` method, which essentially returns a Pandas dataframe from a `QueryObject`. This allows exploring data from sources that are not SQL based, as well as decoupling the SQL generating from SQLAlchemy. This gives a much simpler flow:

```
Explore controls
       ↓
  QueryObject
       ↓
  Explorable
       ↓
   DataFrame
```

In addition to the `Explorable` we would also **add models for a new class of connections for semantic layers**. This would be similar to existing database connections, but with a few key differences:

1. There will be no concept of a SQLAlchemy URI. The connection info should have a well defined schema, which could vary wildly between different semantic layers. For example, Malloy could point to a GitHub repository and a database; Snowflake would require parameters to build a SQLAlchemy engine, just like today; and MetricFlow would require an API key and an optional customer URL. This is similar to how some DB engine specs today use the `BasicParametersMixin` for a better experience when defining the connection parameters.
2. The association between a given instance of a semantic layer and its implementation will be explicit. Today, there’s only an implicit mapping between a `Database` instance and the engine spec that should be used with it; we infer the DB engine spec based on the scheme of the SQLAlchemy URI, which has caused problems in the past, especially because early implementations used only the backend name, and not the driver. In this work, new semantic layers will be added as extensions, following the design from https://github.com/apache/superset/issues/31932.

An important note is that this change would not be very different from the initial versions of Superset, where we had different connectors for Druid (via REST and JSON, before its SQL API was introduced) and SQLAlchemy. Having different connectors offered some challenges, mostly when defining adhoc metrics (since for Druid the user would have to provide JSON). I hope that we can avoid these problems by ensuring a consistent interface in Explore that works across all types of `Explorable`, while still allowing for flexibility in how queries are executed.

Finally, we would also need a lightweight model for explorables, containing metadata about them: UUID, name, parent connection, timeout, default metric when exploring.

The introduction of semantic layers should open up new interesting workflows. A deployment of Superset could have a traditional connection to Snowflake, allowing powers users to run SQL directly, as well as defining semantic views in SQL Lab. These semantic views could then be exposed to other users via the semantic layer connection, providing a curated collection of metrics and dimensions that non power users would have access to. Both connections to Snowflake, via DB engine spec and via semantic layer, would have their purposes and target audiences.

## Implementation Plan

### Phase 0: implement the `Explorable` protocol

- This will be done regardless of this SIP, since it adds value to Superset regardless of semantic layers.

### Phase 1: introduce semantic layers

- Add feature flag for enabling semantic layer support.
- Add new models
- CRUDIE (create, read, update, delete, import, export) for semantic layers.
- DAR for semantic layers and semantic models.
- Allow users to explore semantic models.

### Phase 2: UI/UX Updates

- Update Explore to support semantic models:
    - No adhoc metrics, derived columns, or custom SQL when not supported.
    - Implement reactive UI for metric/dimension compatibility matrix.

All of these phases require considerable work, and should be done with constant feedback from the community when it comes to terminology, UI, and UX.

## Alternative Approaches Considered

We considered using the plugin architecture for semantic layers. While potentially valuable, the proposed approach provides the necessary flexibility without the complexity of a full plugin system, especially when taking in consideration the tight coupling between Superset and SQL/SQLAlchemy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SIP-182] Semantic Layer Support in Apache Superset #35003

SIP: Semantic Layer Support in Apache Superset

Abstract

Motivation

Datasets are the Superset Semantic Layer

Query generation

Proposed Change

Implementation Plan

Phase 0: implement the `Explorable` protocol

Phase 1: introduce semantic layers

Phase 2: UI/UX Updates

Alternative Approaches Considered

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[SIP-182] Semantic Layer Support in Apache Superset #35003

Description

SIP: Semantic Layer Support in Apache Superset

Abstract

Motivation

Datasets are the Superset Semantic Layer

Query generation

Proposed Change

Implementation Plan

Phase 0: implement the Explorable protocol

Phase 1: introduce semantic layers

Phase 2: UI/UX Updates

Alternative Approaches Considered

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Phase 0: implement the `Explorable` protocol