-
Notifications
You must be signed in to change notification settings - Fork 13.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(advanced analysis): support MultiIndex column in post processing stage #19116
Conversation
9d36bfb
to
04fad0e
Compare
Codecov Report
@@ Coverage Diff @@
## master #19116 +/- ##
==========================================
- Coverage 66.65% 66.64% -0.01%
==========================================
Files 1672 1674 +2
Lines 64611 64602 -9
Branches 6505 6498 -7
==========================================
- Hits 43066 43057 -9
Misses 19862 19862
Partials 1683 1683
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
04fad0e
to
f218b89
Compare
74e592d
to
8479dc8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice refactor! Code LGTM and makes the AA/post processing functionality more uniform. My main concern is the assumption that before we've run the pivot operation, the data is not indexed, but after the pivot operation the df is indexed. This could be slightly confusing for viz developers, as they need to be aware of this. But we can keep iterating on this later (maybe do a big breaking change on the post processing API on 3.0)
superset-frontend/packages/superset-ui-chart-controls/src/operators/flatOperator.ts
Outdated
Show resolved
Hide resolved
) | ||
|
||
|
||
def flat(df: pd.DataFrame, reset_index: bool = True,) -> pd.DataFrame: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought (doesn't need to be done here): Should we introduce a separate operation index
that sets the index without needing to do a full pivot
along with aggregations on the metrics? Then we could perhaps call them set_index
and flatten_index
or something that clearly communicates that we're specifically changing the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like stack
and unstack
function in Pandas. I will add those in future.
/testenv up |
@jinghua-qa Ephemeral environment spinning up at http://18.237.75.154:8080. Credentials are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detail test steps~ LGTM
8479dc8
to
b7bc340
Compare
I'm thinking that the |
Ephemeral environment shutdown and build artifacts deleted. |
Summary
Superset uses Pandas Dataframe for query results processing. Now, these Dataframe between different Operators only support 1-dimensional Dataframe. This has caused confusion in some calculations. This PR introduces MultiIndex Dataframe for Operators. This is useful for simplifying calculations.
Current PostProcessing
Typical PostProcessing Operators are used in time-series charts for getting queries. e.g.:
We face these challenges for existing designs:
rollingWindowOperator
should beforetimeCompareOperator
.resample.py
. We need to "flat" multidimensional Dataframe to adapt this one.After Changed
The new operators support calculate on multidimensional Dataframe. So, operators in
QueryObject
can change its position at will. eg:For example, The
compare()
operator supports the Dataframe shape like the following:adding new operator
flat.py
, this new operator will transform the multidimensional dataframe to flatten dataframe.The full examples:
TESTING INSTRUCTIONS
Let's use a new dataset to test.
DailyDelhiClimateTrain.csv
to your local.DailyDelhiClimateTrain
into Superseta) click
upload CSV to database
on top blue plus sign of Supersetb) fill in
DailyDelhiClimateTrain
ontable name
c) select
DailyDelhiClimateTrain.csv
d) fill in
date
inParse Dates
linee) click Save
DailyDelhiClimateTrain
datasetLine Chart
, notice that you should pick up echart versiondate
inX-Axis
meantemp
inmetrics
and selectmax
for aggregateday
as time grainularity1 year ago
andactual values
rolling window
and validate dataDaily PCT with cumsum
Monthly comparison with quarter sum rolling
ADDITIONAL INFORMATION