Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid inferring type varchar #1188

Merged
merged 12 commits into from
Mar 21, 2022
Merged

Avoid inferring type varchar #1188

merged 12 commits into from
Mar 21, 2022

Conversation

vrutik2809
Copy link
Contributor

@vrutik2809 vrutik2809 commented Mar 17, 2022

Fixes #1121

While importing the data from .csv file, previously default datatype for the string was VARCHAR but now it is changed to TEXT (check the Screenshot)

Screenshots

image

Checklist

  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the master branch of the repository
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@vrutik2809 vrutik2809 requested review from a team and dmos62 March 17, 2022 08:42
@pavish pavish added the pr-status: review A PR awaiting review label Mar 17, 2022
Copy link
Contributor

@mathemancer mathemancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See specific comment regarding the DAG.

@@ -15,6 +15,7 @@
MAX_INFERENCE_DAG_DEPTH = 100

TYPE_INFERENCE_DAG = {
base.PostgresType.TEXT.value: [],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is set up, you'll never infer a column to be any type except TEXT, if it starts as TEXT. You need to put TEXT upstream of all other types (i.e., where String is now).

You need to change the reverse type map function to avoid an infinite loop.

Copy link
Contributor

@dmos62 dmos62 Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is set up, you'll never infer a column to be any type except TEXT, if it starts as TEXT.

@mathemancer I'm fuzzy about how this works too, and I'm surprised that this didn't trigger a test failure. If I understand this correctly, we should update our test suite to cover this, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. So i need to add base.PostgresType.TEXT.value as first value in base.STRING and then change the reverse_type_function() to avoid an infinite loop.
Please confirm it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please elaborate what _get_reverse_type_map() function is doing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way the type inference algorithm works is:

  1. reflect a column.
  2. get its type.
  3. map that type to a string representation through _get_reverse_type_map.
  4. Look in the TYPE_INFERENCE_DAG dictionary for that string as a key
  5. Get the list value associated with that key.
  6. Attempt to cast the column to the type represented by each list value in turn.
  7. If all fail, you're done (leave the column as-is) if you succeed, go to (1).

@dmos62 Actually, there is a test that checks this, and it's failing, but for the wrong reasons. 💩 I interpreted this failure incorrectly in previous comments. It's actually failing just because it wasn't updated to the new regime where the default column type is TEXT.

We treat all string-like types as the same, except for the extra-restrictive CHAR. This is what the little dictionary update in _get_reverse_type_map is for. It maps all those types to the same string representation.

@vrutik2809 You should try to accomplish the following:

If a column is of type TEXT, in the database, it should have every type as a downstream option. Columns of type TEXT in the database will have an SQLAlchemy type of either sqlalchemy.TEXT or sqlalchemy.Text. (The former when reflected, and the latter when freshly created in the Python layer). To represent this, I prefer replacing the base.STRING key with base.PostgresType.TEXT.value in the TYPE_INFERENCE_DAG, and then modifying the _get_reverse_type_map function to point all of the python types Text, TEXT, and VARCHAR (as imported in that file) to base.PostgresType.TEXT.value in the reverse type map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I will look into that.
Thank you

@dmos62 dmos62 added pr-status: revision A PR awaiting follow-up work from its author after review and removed pr-status: review A PR awaiting review labels Mar 17, 2022
@vrutik2809 vrutik2809 requested a review from mathemancer March 17, 2022 18:45
@vrutik2809
Copy link
Contributor Author

These tests are failing:

1. mathesar/tests/api/test_table_api.py::test_table_type_suggestion - Ass...
2. mathesar/tests/api/test_table_api.py::test_table_patch_columns_invalid_type_with_name
3. mathesar/tests/api/test_table_api.py::test_table_patch_columns_invalid_type_with_type
4. mathesar/tests/api/test_table_api.py::test_table_patch_columns_invalid_type_with_drop
5. mathesar/tests/api/test_table_api.py::test_table_patch_columns_invalid_type_with_multiple_changes

The 1st one I got it that it is failing because the default is set to varchar. So that I need to change that to TEXT. But I can't figure out the rest. I'm assuming that they are failing because they all are using the _get_data_types_column_data() function in which VARCHAR is there.

@kgodey kgodey self-assigned this Mar 18, 2022
@kgodey
Copy link
Contributor

kgodey commented Mar 18, 2022

@vrutik2809 For many of the tests you listed, the data is being tested against _get_data_types_column_data() on line 1116, which sets types as VARCHAR, not TEXT. If you change those, I think your tests will pass.

def test_table_patch_columns_invalid_type_with_name(create_data_types_table, client, engine_email_type):
    table_name = 'PATCH columns 15'
    table = create_data_types_table(table_name)
    column_data = _get_data_types_column_data()
    column_data[1]['name'] = 'hello'
    column_data[3]['type'] = 'BOOLEAN'

    body = {
        'columns': column_data
    }
    response = client.patch(f'/api/db/v0/tables/{table.id}/', body)
    response_json = response.json()
    assert response.status_code == 400
    assert 'Pizza is not a boolean' in response_json[0]['message']

    current_table_response = client.get(f'/api/db/v0/tables/{table.id}/')
    # The table should not have changed
    original_column_data = _get_data_types_column_data()
    _check_columns(current_table_response.json()['columns'], original_column_data)

For example, see the penultimate line of this test, which is getting _get_data_types_column_data() and comparing that to the API response.

@kgodey kgodey removed their assignment Mar 18, 2022
- Change test types from `VARCHAR` to `TEXT`
@codecov-commenter
Copy link

codecov-commenter commented Mar 19, 2022

Codecov Report

Merging #1188 (475b067) into master (3ae5e0c) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1188   +/-   ##
=======================================
  Coverage   93.39%   93.39%           
=======================================
  Files         113      113           
  Lines        4332     4332           
=======================================
  Hits         4046     4046           
  Misses        286      286           
Flag Coverage Δ
pytest-backend 93.39% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
db/columns/operations/infer_types.py 97.14% <ø> (ø)
db/tables/operations/create.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3ae5e0c...475b067. Read the comment docs.

@kgodey kgodey removed the request for review from dmos62 March 19, 2022 19:37
@kgodey kgodey assigned mathemancer and unassigned dmos62 Mar 19, 2022
@kgodey kgodey added pr-status: review A PR awaiting review and removed pr-status: revision A PR awaiting follow-up work from its author after review labels Mar 19, 2022
@kgodey kgodey requested a review from dmos62 March 19, 2022 19:38
@kgodey kgodey removed the request for review from dmos62 March 19, 2022 19:38
Copy link
Contributor

@mathemancer mathemancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. Thank you for your effort on this, @vrutik2809 !

@mathemancer mathemancer enabled auto-merge March 21, 2022 07:24
@mathemancer mathemancer merged commit 37c32d0 into mathesar-foundation:master Mar 21, 2022
@vrutik2809 vrutik2809 deleted the avoid-inferring-type-VARCHAR branch March 21, 2022 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-status: review A PR awaiting review
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

The type inference algorithm should use TEXT rather than VARCHAR
6 participants