-
Notifications
You must be signed in to change notification settings - Fork 13.9k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SIP-94] Support star schemas #12729
Comments
@wernerdaehn What do you think? |
@zhaoyongjie From a technical side, if all databases would support views with a virtual unlimited number of columns and do a perfect job from the SQL optimizer point of view, then yes. But the opposite is the case. Only Oracle And SAP Hana (Calculation Views) are even close to that and far from perfect. So no, this has nothing to do with OLAP, it is related to being a business intelligence tool. I see that as something fairly straight forward to implement and with lots of potential initially and later. |
Thanks for the comprehensive explanation @wernerdaehn ! Ping @amitmiran137 , here's a good motivation for adding support for join semantics. |
@villebro A generic join feature I would advise against. Things like a theta join, snowflake-like joins with intermediate tables etc. These can all be built in the database using views. So why implementing these things a second time? I would really support only the most trivial case with a single fact in the middle and directly linked dimension tables with PKs. As with the other proposals, let me know if I should create a comprehensive write-up. I would be willing to invest the time, but only if it is worth it. |
@wernerdaehn I was speaking in general terms, i.e. supporting joining tables beyond the current table/virtual table functionality. One thing that sticks out: if we have 100 FKs and thousands of columns, this can quickly become very burdensome, both for the application but even for the end user. So this needs to be properly scoped to make sure it improves the user experience rather than overwhelm it. |
Understood. My fear is just that when you implement that, you start with the SQL join clause in mind. I have seen multiple products fail because the join clause is extremely powerful and frankly, not suited for graphical tools. Use the SQL Editor of Superset for those. Anyway, you got my point, that is the important part. Regarding the second paragraph, while your statement is certainly correct, it is not related to the star schema feature as such. If all needs to be in a single table, then this single table has equally many columns and must not be overwhelming as well. Please let me know if I can do any work in that regards, in case it makes sense. |
I would love to see this feature go live, not sure if there is any progress. The description above is good enough, I just want to share my case.
|
@ducchetrongminh you can build a virtual dateset based on a view which join the fact to the dimension tables ? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue |
@jomach Thanks for the reply. The key point that distinguishes Superset from PBI/Tableau is SQL modeling. The Superset is dependent on SQLLab manually writing SQL to achieve the snowflake or star model. |
@zhaoyongjie that's fair point, but it is very hard to model multiple fact tables with different grains using just one flat table ? |
Hi @djouallah, the star modeling seems like 1 fact table and some dimension tables. Do you have more use cases for multiple fact tables? |
@zhaoyongjie thanks for Asking, all my workload since I started this BI thing was always multiple fact tables. |
Regarding the multiple fact tables: I would ask to support a single fact table with multiple dimensions only. First because this is significant improvement over the current situation already and second multiple fact tables is the same as a single fact table if a view is used. Imagine this example: So what you must do instead is aggregate the CO data to year and cost center group level, pick from the Fact P the most recent plan version and show just that data. Now both intermediate data sets have a 1:1 relationship and can be joined. This is a complex operation, best implemented as view. Then you would have a third Fact CO_vs_Plan, the created view, that takes care of the correctness of the comparison. Yes, supporting such out of the box would be the next evolution of this enhancement request but then you are at a full blown semantic layer, something superset tries to stay away from if I am reading correctly between the lines. |
Hey Guys, multiple fact tables is not a big issues imho. As people can work around it. Important would be that we can slice and dice data as the user needs. Like fact table I just build a datawarehouse and I end up using powerbi because there you can define the join dependencies. Imho this is crucial as feature. Otherwise if you datalake with multiple TB and dimensions with a couple of GB, you would have to join that into a single table which is huge. Thanks |
Star schema is not a showstopper, it can be replaced by a flat, it is a pain if your dimensions changes a lot but there is a solution, multi fact is the real problem, because there is no alternative, Look at the long term solution not a temporary fix. |
Do you have a link for that Tableu example ? Can you explain how you solve ghe star to flat without views ? |
https://help.tableau.com/current/online/en-us/datasource_datamodel.htm of course, you need a view or a materialized table what I meant is, there is a workaround, but for multi fact, there are none !!!, unless you support nested table which is a pain !!! or merin multiple fact with different grains which is not good IMO |
@djouallah Interesting, I am saying the exact opposite. Combining that into one table means adding 20 times 100 columns times 1m rows each. That is 2bn values, instead of 2010100 = 20k values. If your propose to store all in a physical table, what was a large table before becomes enormous. The only option would be a view with join pruning, meaning the SQL optimizer removes joins from the view that are not used by any selected column. This is possible but only under a very strict set of conditions often not met. The other aspect is the UI. What do you prefer? Selecting the columns from a list with 2000 column names or having a tree where you first select the type, e.g. Material, and then select the material related properties? So no, Star schema is the simpler thing to do and the by far more important. Combining multiple facts is harder and can be done with views as a workaround. It would be nice if superset supports both but let's be reasonable and start with the by far more important one. Agree? |
@wernerdaehn I am not disagreeing with you, of course, simple star is better, all what I am saying, people will ask for Galaxy schema from day 1. |
Why is there no workaround for multiple facts? It is not that difficult. And all that makes that SQL complicated are logical decisions, which the UI must provide as switches as well. So it won't be that much easier.
|
Amit from thoughtspot has a great blog, and he explain better what I am trying to say |
With this approach you need to read most of the things ans perform a distinct. Further more you are not leaving the user the option to play with your dashboard. It's kind of hardcoded |
No, there is no distinct clause in the SQL. Let me ask the question differently: How would you suggest the UI should look like to support combining multiple facts and what would happen under the cover, using my example? |
It seems this SIP proposal was never given an official number, nor was it taken to the Apache Superset dev mailing list for the official discussion as part of the SIP procedure. I'll number it as 94 (even if that's a little strange at this juncture), and request that it either get proposed officially as a step leading toward a vote, or that we close it if there's no interest in pursuing it further. If you want to attend the next Superset Town Hall, that would be a great place to gather a bit of sentiment around it, as well. |
Since there has been no substantive discussion here in about a year and a half, I'm going to close this issue and move it to discarded. If anyone wants to rekindle the effort and carry it forward in the SIP process (i.e. starting a DISCUSS thread on the Apache Superset dev mailing list), I'm more than happy to re-open it and adjust the status. |
Hi, More generally speaking, I wonder how to prevent an issue from being "closed due to inactivity" when it looks like everything that needed to be said was said, and waiting for superset side feedback on feasability. Should I for instance re-open a new issue in this case referencing this issue ? (and maybe add a "up!" commet every 6 months ;-) ) |
@squalou , some years ago I replied the issue. For now, I'm also thinking the initial design of Superset never consider data modeling, e.g. snowflake or star modeling. The workaround for data modeling you should consider other sematic/modeling tools e.g. DBT, and expose a "big flat table" to Superset. |
@zhaoyongjie thanks ! |
we decided not to use Superset because it doesn't support Star Schema |
@leonliuaptean To be honest, there are lots of workarounds for the Star/Snowflake/and so on data modeling. Superset is a viz tool, and its original design didn't focus too much on data modeling. |
@zhaoyongjie could you point us to the examples that you mean? I could not find a very good workaround for it. Take power bi as an example, the support this and it is very good why ? On larger teams where I'm working there are dedicated teams to building dashboards, which actually do not know it works in the background. Something like power bi modeling would be a huge step forward on superset |
@zhaoyongjie Superset should not allow data modeling but be able to consume data from query optimized tables. Let's see what the current method is, if you want to visualize the data from a data warehouse: You create a view All to workaround the fact that these queries can be built fully automatically by looking at the columns picked. Columns from the dimension go into the group by, dimensions that are not used are not added, columns with numbers from the fact table are aggregated. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
[SIP] Support star schemas
Motivation
While I am absolutely fine with the decision to base a diagram on a single table/view only, most databases cannot cope with that. A typical view would join 100 tables and have 5000 columns.
Take sales orders as example: For a proper self service BI you would join order header, order line item, three times the customer mater table for soldto-shipto-billto, the material master table, tons of tiny lookup tables with e.g. statusid, statustext etc.
If the joined table has a foreign key defined and the column is not-null, then the database optimizer can ignore that join in case none of the columns are used. Only a few databases are that advanced.
Further more, for many visual clues like slice/dice and filters a distinct list of attributes is needed. What is faster? Finding the three order states in the state text table or a select on the text table itself?
Hence I would suggest to support star schema data models. A chart is based on a fact table. The list of columns shown are all columns of the fact table plus all columns of all tables this fact table has a FK relationship with.
Proposed Change
The text was updated successfully, but these errors were encountered: