Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Data Frame Shape When Initializing with List Depending on Content Type #6968

Open
2 tasks done
20rk00 opened this issue Feb 17, 2023 · 2 comments
Open
2 tasks done
Labels
A-input-parsing Area: parsing input arguments bug Something isn't working P-medium Priority: medium python Related to Python Polars

Comments

@20rk00
Copy link

20rk00 commented Feb 17, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

When creating a data frame with pl.DataFrame using list, the shape of the resulting data frame may vary depending on the content type of the list.

Reproducible example

>>> import polars as pl
>>> pl.__version__
'0.16.6'
>>> pl.DataFrame(
...     [[1, None], [2, None],  [3, None]]
... )
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0column_1column_2 │
│ ---------      │
│ i64i64i64      │
╞══════════╪══════════╪══════════╡
│ 123        │
│ nullnullnull     │
└──────────┴──────────┴──────────┘
>>> pl.DataFrame(
...     [[None, None], [2, None],  [3, None]]
... )
shape: (3, 2)
┌──────────┬──────────┐
│ column_0column_1 │
│ ------      │
│ i64bool     │
╞══════════╪══════════╡
│ nullnull     │
│ 2null     │
│ 3null     │
└──────────┴──────────┘

Expected behavior

pl.DataFrame(
... [[None, None], [2, None], [3, None]]
... )
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════════╪══════════╪══════════╡
│ null ┆ 2 ┆ 3 │
│ null ┆ null ┆ null │
└──────────┴──────────┴──────────┘

Installed versions

---Version info---
Polars: 0.16.6
Index type: UInt32
Platform: Linux-4.15.0-202-generic-x86_64-with-glibc2.27
Python: 3.10.8 (main, Nov 17 2022, 14:32:18) [GCC 7.5.0]
---Optional dependencies---
pyarrow: 11.0.0
pandas: 1.5.3
numpy: 1.23.5
fsspec: <not installed>
connectorx: <not installed>
xlsx2csv: <not installed>
deltalake: <not installed>
matplotlib: 3.7.0

@20rk00 20rk00 added bug Something isn't working python Related to Python Polars labels Feb 17, 2023
@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Feb 17, 2023

The orient param can help you here, as there is no "right" answer to how to load the above frames, since the data may represent either rows or columns - without any hints it has to be guessed heuristically, so different data may result in different choices being made by the heuristic.

(In the case above the heuristic makes a different choices as the first "row" appears to be all null, so it has less type information to work with and reason about, so it defers to the default orientation for list data - which is row).

In the absence of declared columns/schema, you can supply orient to always get the shape you expect, eg:

pl.DataFrame(
    [[None, None], [2, None],  [3, None]],
    orient = "col",
)
# shape: (2, 3)
# ┌──────────┬──────────┬──────────┐
# │ column_0 ┆ column_1 ┆ column_2 │
# │ ---      ┆ ---      ┆ ---      │
# │ f64      ┆ i64      ┆ i64      │
# ╞══════════╪══════════╪══════════╡
# │ null     ┆ 2        ┆ 3        │
# │ null     ┆ null     ┆ null     │
# └──────────┴──────────┴──────────┘
pl.DataFrame(
    [[None, None], [2, None],  [3, None]],
    orient = "row",
)
# shape: (3, 2)
# ┌──────────┬──────────┐
# │ column_0 ┆ column_1 │
# │ ---      ┆ ---      │
# │ i64      ┆ bool     │
# ╞══════════╪══════════╡
# │ null     ┆ null     │
# │ 2        ┆ null     │
# │ 3        ┆ null     │
# └──────────┴──────────┘

@20rk00
Copy link
Author

20rk00 commented Feb 18, 2023

Thank you for your response!

Regarding the issue, while it doesn't seem to be a bug, I think that the behavior may be a bit misleading. Specifically, it appears that the inference result for a null column is being affected by the value of "1 or None" in another row, which is causing a change in the inference result's boolean value.

As for the type inference algorithm, I wasn't able to find any documentation on it. If you happen to know what algorithm is being used, I would appreciate it if you could share that information.

@stinodego stinodego added needs triage Awaiting prioritization by a maintainer P-high Priority: high A-input-parsing Area: parsing input arguments P-medium Priority: medium and removed needs triage Awaiting prioritization by a maintainer P-high Priority: high labels Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-input-parsing Area: parsing input arguments bug Something isn't working P-medium Priority: medium python Related to Python Polars
Projects
Status: Ready
Development

No branches or pull requests

3 participants