Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): improved dtype inference/refinement for read_database results #15126

Merged
merged 1 commit into from
Mar 18, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented Mar 18, 2024

Ref: #15076 (comment)

Additional layer of dtype inference for query results that do not return Arrow data directly, and that do not have an explicit "schema_overrides" entry.

This update adds support for inferring more accurate dtypes from the two simplest flavours of cursor "type_code"1 description, specifically simple python types (eg: datetime, int, str) and string descriptions (eg: "varchar", "double", "array[float4]"). The string-based inference is, by necessity, quite flexible/involved.

Also, if set in the cursor description, the "internal_size", "precision", and "scale" entries are also used to further refine the inferred dtypes (eg: if we have "type_code" = int and internal_size=4, we can infer the more accurate Float32, saving some memory and speeding things up later).

Note that more sophisticated inference requires specific driver module knowledge in order to reverse-lookup bespoke integer codes, enums, and all manner of driver-specific custom type designations (the DPAPI2 spec did not solve this part of the interface at all... ;)

Example

Before:

Result from a SQLAlchemy query returning no rows, using pyodbc against MSSQL. (Previously we would only infer the column names, but not the dtypes).

from sqlalchemy import create_engine
import polars as pl

alchemy_conn = create_engine(
  f"mssql+pyodbc:///?odbc_connect={odbc_string}"
).connect()

df = pl.read_database(
  query = "SELECT TOP 1 * FROM test_table WHERE 1=0",
  connection = alchemy_conn,
)
# shape: (0, 5)
# ┌──────┬───────┬───────┬───────┬──────────┐
# │ name ┆ value ┆ major ┆ minor ┆ revision │
# │ ---  ┆ ---   ┆ ---   ┆ ---   ┆ ---      │
# │ null ┆ null  ┆ null  ┆ null  ┆ null     │  << no dtype inference
# ╞══════╪═══════╪═══════╪═══════╪══════════╡
# └──────┴───────┴───────┴───────┴──────────┘

After:

While pyodbc does not provide especially detailed dtypes (eg: does not specify the size of int/floats, etc) we can infer the broad dtype, which is a notable improvement over "null":

# shape: (0, 5)
# ┌──────┬───────┬───────┬───────┬──────────┐
# │ name ┆ value ┆ major ┆ minor ┆ revision │
# │ ---  ┆ ---   ┆ ---   ┆ ---   ┆ ---      │
# │ str  ┆ bool  ┆ i64   ┆ i64   ┆ i64      │  << dtypes inferred
# ╞══════╪═══════╪═══════╪═══════╪══════════╡
# └──────┴───────┴───────┴───────┴──────────┘

(Note that arrow-odbc is strongly preferred over pyodbc in real-world use with Polars, due to significant performance -and typing- benefits)

Also

Queries using the SQLAlchemy duckdb-engine dialect now automatically take the Arrow-aware duckdb fast-path)

Footnotes

  1. "type_code": https://peps.python.org/pep-0249/#cursor-attributes

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels Mar 18, 2024
@alexander-beedie alexander-beedie changed the title feat(python): cursor-level dtype inference/refinement for read_database results feat(python): improved dtype inference/refinement for read_database results Mar 18, 2024
Copy link

codecov bot commented Mar 18, 2024

Codecov Report

Attention: Patch coverage is 71.11111% with 39 lines in your changes are missing coverage. Please review.

Project coverage is 81.20%. Comparing base (f8ade71) to head (f709520).
Report is 13 commits behind head on main.

Files Patch % Lines
py-polars/polars/io/database.py 48.88% 16 Missing and 7 partials ⚠️
py-polars/polars/datatypes/convert.py 82.22% 9 Missing and 7 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15126      +/-   ##
==========================================
+ Coverage   81.08%   81.20%   +0.12%     
==========================================
  Files        1342     1346       +4     
  Lines      174112   175233    +1121     
  Branches     2459     2506      +47     
==========================================
+ Hits       141178   142302    +1124     
+ Misses      32467    32451      -16     
- Partials      467      480      +13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ritchie46 ritchie46 merged commit 0b65c33 into pola-rs:main Mar 18, 2024
12 checks passed
@alexander-beedie alexander-beedie deleted the cursor-dtype-inference branch March 18, 2024 19:21
@alexander-beedie alexander-beedie added the A-io-database Area: reading/writing to databases label Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-database Area: reading/writing to databases enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants