Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: presto table schema expansion when nested rows contain columns of map and/or array types #26892

Open
wants to merge 158 commits into
base: master
Choose a base branch
from

Conversation

brouberol
Copy link
Contributor

@brouberol brouberol commented Jan 30, 2024

SUMMARY

When the PRESTO_EXPAND_DATA feature flag was set to True, we've noticed that some of our Presto tables were able to be previewed, while some other failed, with the following traceback

2024-01-30 14:08:00,598:ERROR:flask_appbuilder.api:list index out of range
Traceback (most recent call last):
  File "//path/to/virtualenv/site-packages/flask_appbuilder/api/__init__.py", line 110, in wraps
    return f(self, *args, **kwargs)
  File "//path/to/virtualenv/site-packages/superset/views/base_api.py", line 127, in wraps
    raise ex
  File "//path/to/virtualenv/site-packages/superset/views/base_api.py", line 121, in wraps
    duration, response = time_function(f, self, *args, **kwargs)
  File "//path/to/virtualenv/site-packages/superset/utils/core.py", line 1454, in time_function
    response = func(*args, **kwargs)
  File "//path/to/virtualenv/site-packages/superset/utils/log.py", line 255, in wrapper
    value = f(*args, **kwargs)
  File "//path/to/virtualenv/site-packages/superset/databases/api.py", line 740, in table_metadata
    table_info = get_table_metadata(database, table_name, schema_name)
  File "//path/to/virtualenv/site-packages/superset/databases/utils.py", line 67, in get_table_metadata
    columns = database.get_columns(table_name, schema_name)
  File "//path/to/virtualenv/site-packages/superset/models/core.py", line 847, in get_columns
    return self.db_engine_spec.get_columns(
  File "//path/to/virtualenv/site-packages/superset/db_engine_specs/presto.py", line 1008, in get_columns
    cls._parse_structural_column(column.Column, column.Type, result)
  File "//path/to/virtualenv/site-packages/superset/db_engine_specs/presto.py", line 922, in _parse_structural_column
    column_spec = cls.get_column_spec(field_info[1])
IndexError: list index out of range

The discriminating factor turned out to be the presence of map fields in nested rows, causing the parent_data_type argument passed to PrestoEngineSpec._parse_structural_column classmethod to be mis-parsed, as demonstrated:

(Pdb) w
...
-> spec._parse_structural_column("test_column", parent_data_type, accumulator)
> /Users/br/wmf/apache/superset/superset/db_engine_specs/presto.py(905)_parse_structural_column()
-> stack: list[tuple[str, str]] = []
(Pdb) parent_data_type
'row("protocol" varchar, "request_headers" map(varchar, varchar))'
(Pdb) data_types
['test_column row',
 '"protocol" varchar,
 "request_headers" map',
 'varchar, varchar))']

The "request_headers" map(varchar, varchar) token was parsed into ['"request_headers" map', 'varchar, varchar))'], which later on breaks on ['varchar, varchar))'][1], as demonstrated in the previous traceback.

I've decided to fix this bug by removing the key/value types of the map itself from the parent_data_type string, as assuming they were parsed correctly, the types do not seem to be used by SQLAlchemy anyway:

(Pdb) cls.get_column_spec('map')
ColumnSpec(sqla_type=Map(), generic_type=<GenericDataType.STRING: 1>, is_dttm=False, python_date_format=None)
(Pdb) cls.get_column_spec('map(varchar, varchar)')
ColumnSpec(sqla_type=Map(), generic_type=<GenericDataType.STRING: 1>, is_dttm=False, python_date_format=None)

This way, the parent_data_type string 'row("protocol" varchar, "request_headers" map(varchar, varchar))' becomes 'row("protocol" varchar, "request_headers" map)', which gets successfully parsed as ['test_column row', '"protocol" varchar, "request_headers" map)']`.

Note: We also had the same issue with array types as well. A Presto array(string) column corresponds to a sqlalchemy.types.ARRAY type, so we can drop the (string) part of the column type.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Before After
image Screenshot 2024-01-30 at 17 16 50

TESTING INSTRUCTIONS

Preview the schema of a Presto table contained nested rows with the PRESTO_EXPAND_DATA feature flag enabled.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@brouberol brouberol changed the title fix: presto table schema expansion when nested rows contain map columns fix: presto table schema expansion when nested rows contain columns of map type Jan 30, 2024
@john-bodley john-bodley added the review:checkpoint Last PR reviewed during the daily review standup label Jan 30, 2024
@john-bodley
Copy link
Member

Thanks @brouberol for the PR. I've added a few reviewers who are likely familiar with this feature.

Copy link

codecov bot commented Jan 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 65.08%. Comparing base (76d897e) to head (ff43648).
Report is 467 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #26892      +/-   ##
==========================================
+ Coverage   60.48%   65.08%   +4.59%     
==========================================
  Files        1931      526    -1405     
  Lines       76236    37907   -38329     
  Branches     8568        0    -8568     
==========================================
- Hits        46114    24672   -21442     
+ Misses      28017    13235   -14782     
+ Partials     2105        0    -2105     
Flag Coverage Δ
hive 49.09% <0.00%> (-0.07%) ⬇️
javascript ?
presto 53.66% <0.00%> (-0.15%) ⬇️
python 65.08% <100.00%> (+1.59%) ⬆️
unit 59.99% <100.00%> (+2.36%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor Author

@brouberol brouberol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI flagged the following required change.

superset/db_engine_specs/presto.py Outdated Show resolved Hide resolved
@brouberol brouberol force-pushed the fix-presto-schema-preview-nested-map-columns branch from 0c3f6ea to ccbe4e9 Compare February 8, 2024 08:05
@pull-request-size pull-request-size bot added size/L and removed size/M labels Feb 9, 2024
@brouberol brouberol changed the title fix: presto table schema expansion when nested rows contain columns of map type fix: presto table schema expansion when nested rows contain columns of map and/or array types Feb 9, 2024
@brouberol brouberol force-pushed the fix-presto-schema-preview-nested-map-columns branch from e6879bd to e357679 Compare February 9, 2024 09:59
@john-bodley john-bodley removed the review:checkpoint Last PR reviewed during the daily review standup label Feb 12, 2024
michael-s-molina and others added 14 commits February 20, 2024 11:42
@github-actions github-actions bot added i18n Namespace | Anything related to localization risk:db-migration PRs that require a DB migration i18n:spanish Translation related to Spanish language i18n:italian Translation related to Italian language i18n:french Translation related to French language i18n:chinese Translation related to Chinese language i18n:japanese Translation related to Japanese language i18n:russian Translation related to Russian language i18n:korean Translation related to Korean language api Related to the REST API doc Namespace | Anything related to documentation plugins dependencies:npm github_actions Pull requests that update GitHub Actions code packages i18n:dutch i18n:slovak i18n:ukrainian i18n:portuguese i18n:brazilian labels Jul 19, 2024
@mistercrunch
Copy link
Member

mistercrunch commented Jul 22, 2024

Happy to review if/when this PR gets rebased

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Related to the REST API dependencies:npm doc Namespace | Anything related to documentation github_actions Pull requests that update GitHub Actions code i18n:brazilian i18n:chinese Translation related to Chinese language i18n:dutch i18n:french Translation related to French language i18n:italian Translation related to Italian language i18n:japanese Translation related to Japanese language i18n:korean Translation related to Korean language i18n:portuguese i18n:russian Translation related to Russian language i18n:slovak i18n:spanish Translation related to Spanish language i18n:ukrainian i18n Namespace | Anything related to localization packages plugins risk:db-migration PRs that require a DB migration size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.