Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIP-121] Proposal for Metrics Comparison base on the Temporal Dimension or/and Textual Dimension #27544

Closed
zhaoyongjie opened this issue Mar 17, 2024 · 7 comments
Labels
sip Superset Improvement Proposal

Comments

@zhaoyongjie
Copy link
Member

zhaoyongjie commented Mar 17, 2024

Background

The period-over-period analysis is a BI technique that compares metrics from a recent period to the same period in the past. We already have this feature in Advanced Analytics and apply it to time-series analytic visualization. This feature utilizes time_offsets and post_processing in the query_object to achieve period-over-period analysis. Practically, this feature, along with the entire Advanced Analytics section, should have supported all visualizations because the legacy Time Section has been removed and the Time Filter has already moved into Adhoc Filter.

As an experiment, as long as the time dimension is provided and corresponding time filter populate in Adhoc Filter, any charts should easily incorporate this feature. I've opened a toy pull request to demonstrate that only 5 frontend lines (excluding import statements) are needed to integrate this feature into the current Table chart.

image

Motivation

Currently, there are some limitations in time_offsets and Advanced Analytics:

  1. Theoretically, Period-over-period analysis is a method of analyzing different data slices base on specific dimensions within the same dataset. Therefore, the comparisons should not only be based on temporal dimensions, but also on textual dimensions.

  2. According to the first perspective, flexable filters should be provided when users compare data slices, not just time filters.

  3. By default, time_offset uses the first occurrence of a time column in query_object.columns as the basis for time shifted. It cannot specify which time column to use for generating shifted time range because there can be multiple time filters in current Adhoc Filter.

  4. The actual shifted time-range values cannot be displayed on the UI and there isn't an approach to provide it.

Based on the above analysis, I propose to enhance/refactoring the time_offsets field to achieve a more generalized period-over-period analysis.

Proposed Changes

To illustrate what needs to change, I will construct a time comparison using the current time_offsets field and the cleaned_sales_data dataset. This query_object will generate a time comparison based on the order_date, comparing data slices from 1 week ago, 1 month ago, and 1 year ago :

{
  "columns": [{
    "timeGrain": "P1D",
    "columnType": "BASE_AXIS",
    "sqlExpression": "order_date",
    "label": "order_date",
    "expressionType": "SQL"
  }],
  "metrics": ["count"],
  "filters": [
    {
      "col": "order_date",
      "op": "TEMPORAL_RANGE",
      "val": "2004-01-01 : 2004-01-31"
    },
    {
      "col": "deal_size",
      "op": "==",
      "val": "Small"
    }
  ],
  "time_offsets": [
    "1 week ago",
    "1 month ago",
    "1 year ago"
  ],
}

After the change, the time_offsets will replace with comparison field:

{
  "columns": [{
    "timeGrain": "P1D",
    "columnType": "BASE_AXIS",
    "sqlExpression": "order_date",
    "label": "order_date",
    "expressionType": "SQL"
  }],
  "metrics": ["count"],
  "filters": [
    {
      "col": "order_date",
      "op": "TEMPORAL_RANGE",
      "val": "2004-01-01 : 2004-01-31"
    },
    {
      "col": "deal_size",
      "op": "==",
      "val": "Small"
    }
  ],
  "comparisons": {
    "on": ["order_date"],
    "how": "left",
    "suffix": ["1 week ago", "1 month ago", "1 year ago"],
    "time_filter_offsets": [
      "1 week ago",
      "1 month ago",
      "1 year ago"
    ],
    "filters": [
      [
        {
          "col": "deal_size",
          "op": "==",
          "val": "Small"
        }
      ],
      [
        {
          "col": "deal_size",
          "op": "==",
          "val": "Small"
        }
      ],
      [
        {
          "col": "deal_size",
          "op": "==",
          "val": "Small"
        }
      ],
    ],
  },
}

on: specifies which columns will be used to join the data slices.
how: determines how the data slices will be joined, providing options "inner" and "left," with the default value being "left"
suffix : defines suffixes for compared metrics
time_filter_offsets: specifies the time offset delta if performing a time comparison; if performing a textual dimension comparison, it should be ignored
filters: indicates the filters applied on the specific data slice

The following query_object payload show that how to generate a textual dimension comparison

{
  "columns": [{
    "timeGrain": "P1D",
    "columnType": "BASE_AXIS",
    "sqlExpression": "order_date",
    "label": "order_date",
    "expressionType": "SQL"
  }],
  "metrics": ["count"],
  "filters": [
    {
      "col": "order_date",
      "op": "TEMPORAL_RANGE",
      "val": "2004-01-01 : 2004-01-31"
    },
    {
      "col": "deal_size",
      "op": "==",
      "val": "Small"
    }
  ],
  "comparisons": {
    "on": ["deal_size"],
    "how": "left",
    "suffix": ["Medium", "Large"],
    "filters": [
      [
        {
          "col": "order_date",
          "op": "TEMPORAL_RANGE",
          "val": "2004-01-01 : 2004-01-31"
        },
        {
          "col": "deal_size",
          "op": "==",
          "val": "Medium"
        }
      ],
      [
        {
          "col": "order_date",
          "op": "TEMPORAL_RANGE",
          "val": "2004-01-01 : 2004-01-31"
        },
        {
          "col": "deal_size",
          "op": "==",
          "val": "Large"
        }
      ],
    ],
  },
}

New or Changed Public Interfaces

New dependencies

No new dependencies

Migration Plan and Compatibility

Provide DB migration script

Rejected Alternatives

@zhaoyongjie zhaoyongjie added the sip Superset Improvement Proposal label Mar 17, 2024
@zhaoyongjie zhaoyongjie changed the title [SIP-WIP] Proposal for Metrics Comparison base on the Temporal Dimension or/and Textual Dimension in all visualization [SIP-WIP] Proposal for Metrics Comparison base on the Temporal Dimension or/and Textual Dimension Mar 17, 2024
@eschutho
Copy link
Member

eschutho commented Mar 21, 2024

@zhaoyongjie this looks great, and I think it should solve what we're trying to build. Do you have an example of what the resulting sql query would look like from this api schema?

@zhaoyongjie zhaoyongjie changed the title [SIP-WIP] Proposal for Metrics Comparison base on the Temporal Dimension or/and Textual Dimension [SIP-121] Proposal for Metrics Comparison base on the Temporal Dimension or/and Textual Dimension Mar 22, 2024
@Antonio-RiveroMartnez
Copy link
Member

Antonio-RiveroMartnez commented Mar 25, 2024

Hey @zhaoyongjie Thanks for putting this together, it looks really promising. On top of the question from @eschutho I would like to know if we should consider different databases capabilities like CTE and JOINS support when comparing and having a fallback as part of this new processing? Just to handle as mentioned before, some cases where some of those are not supported.

Thanks.

@zhaoyongjie
Copy link
Member Author

Hey @zhaoyongjie Thanks for putting this together, it looks really promising. On top of the question from @eschutho I would like to know if we should consider different databases capabilities like CTE and JOINS support when comparing and having a fallback as part of this new processing? Just to handle as mentioned before, there are some cases where some of those are not supported.

Also, how datasets that use for example Jinja templates would be compared and processed with the Joins and the offsets? Should we consider cases where we simply don't support comparison and offer the aforementioned fallback methods? or simply ignore them.

Thanks.

I'll reply the topic tonight, thanks for the mention.

@zhaoyongjie
Copy link
Member Author

zhaoyongjie commented Mar 26, 2024

Do you have an example of what the resulting sql query would look like from this api schema?

I would like to know if we should consider different databases capabilities like CTE and JOINS support when comparing and having a fallback as part of this new processing? Just to handle as mentioned before, some cases where some of those are not supported.

@eschutho @Antonio-RiveroMartnez
The SQL will be the same as the time comparison in the previous AA panel, consisting of multiple separate SQL queries. I have also been thinking about how to design a JOIN QueryObject to achieve this use case and replace the entire Pandas Join. However, I am stuck on how to generate an appropriate time expression that contains time delta, as the time delta expression varies in different databases.

In practice, the JOIN in Superset has been discussed many times in the community, but the original technical design doesn't seem to have considered this use case. The data modeling in Superset depends on SQLLab exposing a "virtual dataset" to achieve it.

@betodealmeida
Copy link
Member

However, I am stuck on how to generate an appropriate time expression that contains time delta, as the time delta expression varies in different databases.

@zhaoyongjie can you give more context on why you'd need to compute a time delta?

@zhaoyongjie
Copy link
Member Author

@betodealmeida

If the different data slices would be filtered in datasource, we have to filter the time filter + time delta in the datasource. different database have different time delta expression.

image

@cannmenus-vib
Copy link

This is incredibly useful and something I'm surprised isn't built in to superset's tables already. Thanks for putting this together!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sip Superset Improvement Proposal
Projects
Status: Denied / Closed / Discarded
Development

No branches or pull requests

5 participants