Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(dataset-api): get_or_create creates a dataset for an existing table_name but different schema #30379

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

luizcapu
Copy link

@luizcapu luizcapu commented Sep 24, 2024

SUMMARY

At Pinterest we were trying to use the get_or_create endpoint to automate the integration between our MetricsLayer and Superset. During our tests we've encountered the following issue: #30377

The issue happens because at the moment the get_or_create code does not account for the payload schema attribute, doing a search by table_name only.

This PR changes get_or_create to take the schema into account:

  • A new get_table_by_schema_and_name was added to the DatasetDAO class
  • get_or_create now checks if the dataset exists by calling DatasetDAO.get_table_by_schema_and_name (instead of DatasetDAO.get_table_by_name)

TESTING INSTRUCTIONS

  • Integration tests were added for the new behaviour
  • Manual tests: reproducing the steps described on the issue

Case 1 - False Positive

  1. Go to the datasets page
  2. Pick any existing dataset name and prepare a payload as follows (example using the users datasets)
{
  'table_name': 'users',
  'schema': 'other',
  'database_id': 1,
}
  1. Submit this payload via a POST request to /api/v1/dataset/get_or_create
  2. A new dataset is created. The API returns with a 200 pointing to the new dataset. No false positives anymore.

Case 2 - Internal Server Error

  1. Create 2 or more datasets with the same table_name and different schemas (either via UI or create dataset API)
  2. Try to create a new dataset. Again, with same table_name but a different schema. Payload example:
{
  'table_name': 'users',
  'schema': 'any_new_schema_name',
  'database_id': 1,
}
  1. Submit this payload via a POST request to /api/v1/dataset/get_or_create
  2. A new dataset is created. The API returns with a 200 pointing to the new dataset. No 500 errors anymore.

Backward Compatibility

  1. Go to the datasets page
  2. Pick any existing dataset name and prepare a payload as follows (without passing the schema)
{
  'table_name': 'users',
  'database_id': 1,
}
  1. Submit this payload via a POST request to /api/v1/dataset/get_or_create
  2. No new dataset is created. The API returns with a 200 with the response body pointing to the existing dataset.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@github-actions github-actions bot added the api Related to the REST API label Sep 24, 2024
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congrats on making your first PR and thank you for contributing to Superset! 🎉 ❤️

We hope to see you in our Slack community too! Not signed up? Use our Slack App to self-register.

@michael-s-molina michael-s-molina added review:draft review:checkpoint Last PR reviewed during the daily review standup labels Sep 24, 2024
Copy link

codecov bot commented Sep 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.91%. Comparing base (76d897e) to head (ed26eb6).
Report is 835 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master   #30379       +/-   ##
===========================================
+ Coverage   60.48%   83.91%   +23.42%     
===========================================
  Files        1931      533     -1398     
  Lines       76236    38627    -37609     
  Branches     8568        0     -8568     
===========================================
- Hits        46114    32412    -13702     
+ Misses      28017     6215    -21802     
+ Partials     2105        0     -2105     
Flag Coverage Δ
hive 48.95% <40.00%> (-0.22%) ⬇️
javascript ?
mysql 76.78% <100.00%> (?)
postgres 76.85% <100.00%> (?)
presto 53.44% <40.00%> (-0.37%) ⬇️
python 83.91% <100.00%> (+20.42%) ⬆️
unit 60.75% <40.00%> (+3.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@michael-s-molina michael-s-molina removed the review:checkpoint Last PR reviewed during the daily review standup label Sep 25, 2024
@luizcapu
Copy link
Author

Marking this PR as ready for review. test-sqlite is the only test failing.

However, it's failing with sqlalchemy.exc.IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: tables.table_name, which is inherently opposed to the nature of this PR.

Furthermore, there are some signals that this constraint should be dropped:

  • it doesn't align with other databases (MySQL/Postgres) or with the overall Superset functionalities (as it does allow duplicate table_names for different databases/schemas)
  • The Github action logs says: SQLite Database support for metadata databases will be removed in a future version of Superset.
  • There's this migration that is removing the uniqueness constraint on the table_name, but is doing it for MySQLDialect only

I could use some guidance to decide whether or not dropping this constraint is the right decision. And if so, how to do it.

Thank you in advance.

@luizcapu luizcapu marked this pull request as ready for review September 26, 2024 09:48
@dosubot dosubot bot added the data:dataset Related to dataset configurations label Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api Related to the REST API data:dataset Related to dataset configurations size/M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants