You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This proposal introduces changes to Apache Superset to better support semantic layers and external data modeling systems. The changes include (1) the definition of an “Explorable” interface (a Python protocol), and (2) the introduction of a new class of connections for semantic layers and similar systems.
Motivation
Semantic layers are a powerful way to expose well curated metric and related dimensions to users, allowing for an improved user experience when presenting data: instead of focusing on datasets, semantic layers in general operate on a higher level of abstraction, exposing curated metrics as first class citizens. Once a user has selected one or more metrics that they're interested in, most modern semantic layers allow them to slice and dice metrics by presenting associated dimensions, automatically performing joins between the underlying data sources. This workflow, where metrics are curated and their dimensions are freely available, allows users to focus on the metrics that matter to them, while providing confidence that the underlying data is correct and well defined.
Because Superset is fundamentally dataset-centric, integrations with semantic layers have been timid so far. When they exist, they usually represent the semantic layer as pseudo-database, exposing one or more pseudo-datasets that represent models in the semantic layer. Cube, for example, uses the Postgres wire protocol to expose cubes as pseudo-tables that can be queried in Superset. Minerva, an inhouse semantic layer from AirBnB, uses a different approach, exposing all metrics and dimensions as a single dataset, with custom overrides to indicate to the users which metrics and dimensions are compatible. Other experimental integrations (with MetricFlow, Snowflake, and DataJunction) used a similar approach, though they haven’t been contributed to OSS yet.
There are a couple limitations in Superset's architecture that create friction when integrating with semantic layers and modern data platforms:
Datasets are the Superset Semantic Layer
In order to explore data in Superset we need some kind of semantic layer that informs us which columns are available, which ones can be filtered, which ones are groupable, etc. In Superset, that semantic layer is the dataset, a thin abstraction that allows users to define metrics, declare derived columns, and add metadata that informs the UI and prevents expensive queries from running (group bys on high cardinality columns, for example).
Because the dataset is the native semantic layer, adding an external source from a semantic layer as a dataset to Superset is unlikely to work as expected, since we’re adding a second semantic layer on top of the first one. For example, the external semantic layer might not allow for adhoc metrics or computed columns, making it incompatible with the Superset dataset editor, as well as making the experience in Explore unintuitive and error-prone.
For these “semantic datasets” (or “semantic models”) we likely want to disable the Superset semantic layer, since the metadata is already defined externally:
Users shouldn't be able to add metrics nor calculated columns when editing them, unless the semantic layer supports adhoc expressions.
Similar to the "Sync columns from source", metrics should have a button to "Sync metrics from source", to fetch new or updated metrics from the semantic layer.
When explored, users should not be allowed to add adhoc metrics, derived columns, or use "Custom SQL", unless the semantic layer supports adhoc expressions.
When explored, not all metrics and dimensions might be compatible, requiring a mechanism for disabling metrics/columns as the user selects metrics/columns, similar to the inhouse approach used by Minerva.
This suggests that these semantic models should not be represented as standard datasets in Superset, given that in order to make it work we need to remove all the value that datasets provide — the semantics.
Query generation
The current flow for chart creation is tightly coupled not only with SQL but with SQLAlchemy. The frontend sends a request to the backend, via the QueryObject, indicating which columns, metrics, filters, and other parameters the user has selected. The backend then generates a SQLAlchemy query from this payload. The query generation is highly dependent on the specific database engine, so the get_sqla_query method inspects several attributes of the DB engine spec. Once this query is generated, it is transpiled to the target dialect using SQLAlchemy, and passed to the DB engine spec for execution.
This approach works reasonably well for traditional databases, but it creates friction when integrating with semantic layers that do not use SQL or SQLAlchemy. For example, if a semantic layer uses GraphQL or REST APIs, the current flow requires parsing the generated SQL and building a custom request. This is the case for the experimental MetricFlow DB engine spec, which has this flow:
The SQL parsing step is necessary for Snowflake because Explore builds the query on top of either a table or a subquery, but in Snowflake it should be a UDTF (user defined table function):
-- SQL generated by ExploreSELECT"Item.Brand", "Store.State", "StoreSales.TotalSalesQuantity"FROM pseudo_table
GROUP BY"Item.Brand", "Store.State";
-- final SQL after parsing and manipulating the ASTSELECT*FROM SEMANTIC_VIEW(
TPCDS_SEMANTIC_VIEW_SM
DIMENSIONS Item.Brand, Store.State
METRICS StoreSales.TotalSalesQuantity
);
This is not only inefficient and brittle but also limits the flexibility of how queries can be executed against different data sources.
Proposed Change
In order to properly support semantic layers we need to move away from the current solutions based on pseudo-databases and custom DB engine specs. Instead, we should implement first class support for semantic layers, bypassing the need for using a dataset when exploring data. This will allow users to choose between the semantics provided by Superset datasets, or an external system.
The first change will be the introduction of an Explorable interface, defined as a Python protocol. This interface is in some ways similar to the existing ExploreMixin, that was added when we decided to support Query objects in explore. The problem with the current ExploreMixin is that it’s too tightly-coupled with SQLAlchemy and datasets, and has no clear separation between the functionality needed for data exploration vs. query generation.
The new Explorable, on the other hand, is concerned only with chart building:
Note the get_query_result method, which essentially returns a Pandas dataframe from a QueryObject. This allows exploring data from sources that are not SQL based, as well as decoupling the SQL generating from SQLAlchemy. This gives a much simpler flow:
In addition to the Explorable we would also add models for a new class of connections for semantic layers. This would be similar to existing database connections, but with a few key differences:
There will be no concept of a SQLAlchemy URI. The connection info should have a well defined schema, which could vary wildly between different semantic layers. For example, Malloy could point to a GitHub repository and a database; Snowflake would require parameters to build a SQLAlchemy engine, just like today; and MetricFlow would require an API key and an optional customer URL. This is similar to how some DB engine specs today use the BasicParametersMixin for a better experience when defining the connection parameters.
The association between a given instance of a semantic layer and its implementation will be explicit. Today, there’s only an implicit mapping between a Database instance and the engine spec that should be used with it; we infer the DB engine spec based on the scheme of the SQLAlchemy URI, which has caused problems in the past, especially because early implementations used only the backend name, and not the driver. In this work, new semantic layers will be added as extensions, following the design from [SIP-151] The vision for a new Superset Plugins Architecture #31932.
An important note is that this change would not be very different from the initial versions of Superset, where we had different connectors for Druid (via REST and JSON, before its SQL API was introduced) and SQLAlchemy. Having different connectors offered some challenges, mostly when defining adhoc metrics (since for Druid the user would have to provide JSON). I hope that we can avoid these problems by ensuring a consistent interface in Explore that works across all types of Explorable, while still allowing for flexibility in how queries are executed.
Finally, we would also need a lightweight model for explorables, containing metadata about them: UUID, name, parent connection, timeout, default metric when exploring.
The introduction of semantic layers should open up new interesting workflows. A deployment of Superset could have a traditional connection to Snowflake, allowing powers users to run SQL directly, as well as defining semantic views in SQL Lab. These semantic views could then be exposed to other users via the semantic layer connection, providing a curated collection of metrics and dimensions that non power users would have access to. Both connections to Snowflake, via DB engine spec and via semantic layer, would have their purposes and target audiences.
Implementation Plan
Phase 0: implement the Explorable protocol
This will be done regardless of this SIP, since it adds value to Superset regardless of semantic layers.
Phase 1: introduce semantic layers
Add feature flag for enabling semantic layer support.
Add new models
CRUDIE (create, read, update, delete, import, export) for semantic layers.
DAR for semantic layers and semantic models.
Allow users to explore semantic models.
Phase 2: UI/UX Updates
Update Explore to support semantic models:
No adhoc metrics, derived columns, or custom SQL when not supported.
Implement reactive UI for metric/dimension compatibility matrix.
All of these phases require considerable work, and should be done with constant feedback from the community when it comes to terminology, UI, and UX.
Alternative Approaches Considered
We considered using the plugin architecture for semantic layers. While potentially valuable, the proposed approach provides the necessary flexibility without the complexity of a full plugin system, especially when taking in consideration the tight coupling between Superset and SQL/SQLAlchemy.
SIP: Semantic Layer Support in Apache Superset
Abstract
This proposal introduces changes to Apache Superset to better support semantic layers and external data modeling systems. The changes include (1) the definition of an “Explorable” interface (a Python protocol), and (2) the introduction of a new class of connections for semantic layers and similar systems.
Motivation
Semantic layers are a powerful way to expose well curated metric and related dimensions to users, allowing for an improved user experience when presenting data: instead of focusing on datasets, semantic layers in general operate on a higher level of abstraction, exposing curated metrics as first class citizens. Once a user has selected one or more metrics that they're interested in, most modern semantic layers allow them to slice and dice metrics by presenting associated dimensions, automatically performing joins between the underlying data sources. This workflow, where metrics are curated and their dimensions are freely available, allows users to focus on the metrics that matter to them, while providing confidence that the underlying data is correct and well defined.
Because Superset is fundamentally dataset-centric, integrations with semantic layers have been timid so far. When they exist, they usually represent the semantic layer as pseudo-database, exposing one or more pseudo-datasets that represent models in the semantic layer. Cube, for example, uses the Postgres wire protocol to expose cubes as pseudo-tables that can be queried in Superset. Minerva, an inhouse semantic layer from AirBnB, uses a different approach, exposing all metrics and dimensions as a single dataset, with custom overrides to indicate to the users which metrics and dimensions are compatible. Other experimental integrations (with MetricFlow, Snowflake, and DataJunction) used a similar approach, though they haven’t been contributed to OSS yet.
There are a couple limitations in Superset's architecture that create friction when integrating with semantic layers and modern data platforms:
Datasets are the Superset Semantic Layer
In order to explore data in Superset we need some kind of semantic layer that informs us which columns are available, which ones can be filtered, which ones are groupable, etc. In Superset, that semantic layer is the dataset, a thin abstraction that allows users to define metrics, declare derived columns, and add metadata that informs the UI and prevents expensive queries from running (group bys on high cardinality columns, for example).
Because the dataset is the native semantic layer, adding an external source from a semantic layer as a dataset to Superset is unlikely to work as expected, since we’re adding a second semantic layer on top of the first one. For example, the external semantic layer might not allow for adhoc metrics or computed columns, making it incompatible with the Superset dataset editor, as well as making the experience in Explore unintuitive and error-prone.
For these “semantic datasets” (or “semantic models”) we likely want to disable the Superset semantic layer, since the metadata is already defined externally:
This suggests that these semantic models should not be represented as standard datasets in Superset, given that in order to make it work we need to remove all the value that datasets provide — the semantics.
Query generation
The current flow for chart creation is tightly coupled not only with SQL but with SQLAlchemy. The frontend sends a request to the backend, via the
QueryObject, indicating which columns, metrics, filters, and other parameters the user has selected. The backend then generates a SQLAlchemy query from this payload. The query generation is highly dependent on the specific database engine, so theget_sqla_querymethod inspects several attributes of the DB engine spec. Once this query is generated, it is transpiled to the target dialect using SQLAlchemy, and passed to the DB engine spec for execution.This approach works reasonably well for traditional databases, but it creates friction when integrating with semantic layers that do not use SQL or SQLAlchemy. For example, if a semantic layer uses GraphQL or REST APIs, the current flow requires parsing the generated SQL and building a custom request. This is the case for the experimental MetricFlow DB engine spec, which has this flow:
For the new Snowflake semantic layer, even though it exposes a SQL interface to semantic views, the flow looks like this:
The SQL parsing step is necessary for Snowflake because Explore builds the query on top of either a table or a subquery, but in Snowflake it should be a UDTF (user defined table function):
This is not only inefficient and brittle but also limits the flexibility of how queries can be executed against different data sources.
Proposed Change
In order to properly support semantic layers we need to move away from the current solutions based on pseudo-databases and custom DB engine specs. Instead, we should implement first class support for semantic layers, bypassing the need for using a dataset when exploring data. This will allow users to choose between the semantics provided by Superset datasets, or an external system.
The first change will be the introduction of an
Explorableinterface, defined as a Python protocol. This interface is in some ways similar to the existingExploreMixin, that was added when we decided to supportQueryobjects in explore. The problem with the currentExploreMixinis that it’s too tightly-coupled with SQLAlchemy and datasets, and has no clear separation between the functionality needed for data exploration vs. query generation.The new
Explorable, on the other hand, is concerned only with chart building:Note the
get_query_resultmethod, which essentially returns a Pandas dataframe from aQueryObject. This allows exploring data from sources that are not SQL based, as well as decoupling the SQL generating from SQLAlchemy. This gives a much simpler flow:In addition to the
Explorablewe would also add models for a new class of connections for semantic layers. This would be similar to existing database connections, but with a few key differences:BasicParametersMixinfor a better experience when defining the connection parameters.Databaseinstance and the engine spec that should be used with it; we infer the DB engine spec based on the scheme of the SQLAlchemy URI, which has caused problems in the past, especially because early implementations used only the backend name, and not the driver. In this work, new semantic layers will be added as extensions, following the design from [SIP-151] The vision for a new Superset Plugins Architecture #31932.An important note is that this change would not be very different from the initial versions of Superset, where we had different connectors for Druid (via REST and JSON, before its SQL API was introduced) and SQLAlchemy. Having different connectors offered some challenges, mostly when defining adhoc metrics (since for Druid the user would have to provide JSON). I hope that we can avoid these problems by ensuring a consistent interface in Explore that works across all types of
Explorable, while still allowing for flexibility in how queries are executed.Finally, we would also need a lightweight model for explorables, containing metadata about them: UUID, name, parent connection, timeout, default metric when exploring.
The introduction of semantic layers should open up new interesting workflows. A deployment of Superset could have a traditional connection to Snowflake, allowing powers users to run SQL directly, as well as defining semantic views in SQL Lab. These semantic views could then be exposed to other users via the semantic layer connection, providing a curated collection of metrics and dimensions that non power users would have access to. Both connections to Snowflake, via DB engine spec and via semantic layer, would have their purposes and target audiences.
Implementation Plan
Phase 0: implement the
ExplorableprotocolPhase 1: introduce semantic layers
Phase 2: UI/UX Updates
All of these phases require considerable work, and should be done with constant feedback from the community when it comes to terminology, UI, and UX.
Alternative Approaches Considered
We considered using the plugin architecture for semantic layers. While potentially valuable, the proposed approach provides the necessary flexibility without the complexity of a full plugin system, especially when taking in consideration the tight coupling between Superset and SQL/SQLAlchemy.