Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: improve perf in SIP-68 migration #19416

Merged
merged 5 commits into from
Mar 30, 2022

Conversation

betodealmeida
Copy link
Member

@betodealmeida betodealmeida commented Mar 29, 2022

SUMMARY

This PR improves the performance on the SIP-68 migration script by using sqloxide (https://pypi.org/project/sqloxide/) to parse the SQL when extracting dependencies. In case the parsing fails it falls back to using sqlparse.

It also addresses a few bugs:

  • Tables were assigned incorrectly to datasets because of the lack of the database id in the predicate (also fixed in perf(alembic): paginize db migration for new dataset models #19406).
  • Datasets where incorrectly flagged as virtual when their sql was an empty string.
  • Datasets incorrectly flagged as virtual would have all tables associated with them, since the predicate was empty.
  • Tables referenced in virtual datasets but not present as physical datasets were not being created.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

N/A

TESTING INSTRUCTIONS

$ superset db downgrade 5afbb1a5849b
$ superset db upgrade

Migration still works and relationships are populated correctly. We can see all the datasets and the associated tables with this query:

SELECT
  sl_datasets.name AS dataset_name,
  sl_datasets.is_physical,
  sl_datasets.expression,
  ARRAY_AGG(sl_tables.name) AS table_names
FROM sl_datasets
JOIN sl_dataset_tables
  ON sl_datasets.id = sl_dataset_tables.dataset_id
JOIN sl_tables
  ON sl_dataset_tables.table_id = sl_tables.id
GROUP BY 1, 2, 3
ORDER BY 2 DESC;

And the results:

       dataset_name        | is_physical |                                                                                                expression                                                                                                 |           table_names
---------------------------+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------
 covid_vaccines            | t           | covid_vaccines                                                                                                                                                                                            | {covid_vaccines}
 bart_lines                | t           | bart_lines                                                                                                                                                                                                | {bart_lines}
 threads                   | t           | threads                                                                                                                                                                                                   | {threads}
 messages                  | t           | messages                                                                                                                                                                                                  | {messages}
 video_game_sales          | t           | video_game_sales                                                                                                                                                                                          | {video_game_sales}
 birth_france_by_region    | t           | birth_france_by_region                                                                                                                                                                                    | {birth_france_by_region}
 sf_population_polygons    | t           | sf_population_polygons                                                                                                                                                                                    | {sf_population_polygons}
 users                     | t           | users                                                                                                                                                                                                     | {users}
 users_channels            | t           | users_channels                                                                                                                                                                                            | {users_channels}
 long_lat                  | t           | long_lat                                                                                                                                                                                                  | {long_lat}
 wb_health_population      | t           | wb_health_population                                                                                                                                                                                      | {wb_health_population}
 FCC 2018 Survey           | t           | "FCC 2018 Survey"                                                                                                                                                                                         | {"FCC 2018 Survey"}
 birth_names               | t           | birth_names                                                                                                                                                                                               | {birth_names}
 flights                   | t           | flights                                                                                                                                                                                                   | {flights}
 channels                  | t           | channels                                                                                                                                                                                                  | {channels}
 exported_stats            | t           | exported_stats                                                                                                                                                                                            | {exported_stats}
 unicode_test              | t           | unicode_test                                                                                                                                                                                              | {unicode_test}
 channel_members           | t           | channel_members                                                                                                                                                                                           | {channel_members}
 users_channels-uzooNNtSRO | f           | SELECT uc1.name as channel_1, uc2.name as channel_2, count(*) AS cnt FROM users_channels uc1 JOIN users_channels uc2 ON uc1.user_id = uc2.user_id GROUP BY uc1.name, uc2.name HAVING uc1.name <> uc2.name+| {users_channels}
                           |             |                                                                                                                                                                                                           |
 new_members_daily         | f           | SELECT date, total_membership - lag(total_membership) OVER (ORDER BY date) AS new_members FROM exported_stats                                                                                             | {exported_stats}
 messages_channels         | f           | SELECT m.ts, c.name, m.text FROM messages m JOIN channels c ON m.channel_id = c.id                                                                                                                        | {messages,channels}
 members_channels_2        | f           | SELECT c.name AS channel_name, u.name AS member_name FROM channel_members cm JOIN channels c ON cm.channel_id = c.id JOIN users u ON cm.user_id = u.id                                                    | {channels,channel_members,users}
(22 rows)

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@betodealmeida betodealmeida requested a review from a team as a code owner March 29, 2022 18:35
@codecov
Copy link

codecov bot commented Mar 29, 2022

Codecov Report

Merging #19416 (f92077c) into master (816a2c3) will decrease coverage by 0.09%.
The diff coverage is 93.05%.

❗ Current head f92077c differs from pull request most recent head 96d513c. Consider uploading reports for the commit 96d513c to get more accurate results

@@            Coverage Diff             @@
##           master   #19416      +/-   ##
==========================================
- Coverage   66.48%   66.39%   -0.10%     
==========================================
  Files        1670     1670              
  Lines       63968    63824     -144     
  Branches     6512     6510       -2     
==========================================
- Hits        42531    42374     -157     
- Misses      19748    19761      +13     
  Partials     1689     1689              
Flag Coverage Δ
hive ?
mysql 81.86% <92.68%> (+0.21%) ⬆️
postgres 81.91% <92.68%> (+0.21%) ⬆️
presto ?
python 82.00% <92.68%> (-0.13%) ⬇️
sqlite 81.67% <92.68%> (+0.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ntrols/src/components/CertifiedIconWithTooltip.tsx 80.00% <ø> (ø)
...d/packages/superset-ui-chart-controls/src/index.ts 100.00% <ø> (ø)
...omponents/ColumnConfigControl/ColumnConfigItem.tsx 0.00% <ø> (ø)
...tiveFilters/FiltersConfigModal/DraggableFilter.tsx 71.87% <ø> (ø)
...t/annotation_layers/annotations/commands/update.py 88.23% <ø> (ø)
superset/annotation_layers/annotations/schemas.py 100.00% <ø> (ø)
superset/cli/examples.py 0.00% <ø> (ø)
superset/cli/importexport.py 80.00% <ø> (ø)
superset/cli/main.py 0.00% <ø> (ø)
superset/cli/thumbnails.py 0.00% <ø> (ø)
... and 111 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 816a2c3...96d513c. Read the comment docs.

Copy link
Member

@ktmud ktmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick turnaround. The table name extraction function seems to be a useful utility, can it be added to some more shared place?

Comment on lines +105 to +108
try:
model = getattr(Base.classes, table)
except AttributeError:
continue
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed to run the bechmark migration script on the SIP-68 migration.

@@ -2278,8 +2285,7 @@ def write_shadow_dataset( # pylint: disable=too-many-locals
)

# physical dataset
tables = []
if dataset.sql is None:
if not dataset.sql:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of our example datasets have .sql == '', which made them to be marked as virtual during the migration. it's not a big deal, since they will still work when we switch to the new models (in the new Dataset model the difference between virtual and physical is greatly reduced).

)
tables = session.query(NewTable).filter(predicate).all()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were only assigning tables that already exist. The load_or_create_tables function will create any tables that are referenced in the SQL but don't exist yet.

@betodealmeida betodealmeida force-pushed the faster_parser_sip68 branch 3 times, most recently from bc59a0c to e863bb9 Compare March 30, 2022 01:41
@betodealmeida betodealmeida merged commit 63b5e2e into apache:master Mar 30, 2022
villebro pushed a commit that referenced this pull request Apr 3, 2022
* chore: improve perf in SIP-68 migration

* Small fixes

* Create tables referenced in SQL

* Update logic in SqlaTable as well

* Fix unit tests

(cherry picked from commit 63b5e2e)
villebro pushed a commit that referenced this pull request Apr 4, 2022
* chore: improve perf in SIP-68 migration

* Small fixes

* Create tables referenced in SQL

* Update logic in SqlaTable as well

* Fix unit tests

(cherry picked from commit 63b5e2e)
@betodealmeida betodealmeida mentioned this pull request Apr 6, 2022
9 tasks
philipher29 pushed a commit to ValtechMobility/superset that referenced this pull request Jun 9, 2022
* chore: improve perf in SIP-68 migration

* Small fixes

* Create tables referenced in SQL

* Update logic in SqlaTable as well

* Fix unit tests
@mistercrunch mistercrunch added 🍒 1.5.3 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 2.0.0 labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels lts-v1 size/L 🍒 1.5.0 🍒 1.5.1 🍒 1.5.2 🍒 1.5.3 🚢 2.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants