Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame construction from numpy arrays and polars.datatypes.Array schema #15745

Open
2 tasks done
dpinol opened this issue Apr 18, 2024 · 2 comments
Open
2 tasks done
Labels
A-dtype-list/array Area: list/array data type A-interop-numpy Area: interoperability with NumPy bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@dpinol
Copy link
Contributor

dpinol commented Apr 18, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

pl.DataFrame({"A": [np.array([1], dtype=np.int64)]}, {"A": pl.Array(pl.Int64, 1)})

Log output

Traceback (most recent call last):
  File "/home/dani/.local/share/virtualenvs/sc-api-DipHMMiG/lib/python3.11/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-11-2a88ebee4fc0>", line 1, in <module>
    pl.DataFrame({"A": [np.array([1], dtype=np.int64)]}, {"A": pl.datatypes.Array(pl.datatypes.Int64, 1)})
  File "/home/dani/.local/share/virtualenvs/sc-api-DipHMMiG/lib/python3.11/site-packages/polars/dataframe/frame.py", line 367, in __init__
    self._df = dict_to_pydf(
               ^^^^^^^^^^^^^
  File "/home/dani/.local/share/virtualenvs/sc-api-DipHMMiG/lib/python3.11/site-packages/polars/_utils/construction/dataframe.py", line 137, in dict_to_pydf
    for s in _expand_dict_values(
             ^^^^^^^^^^^^^^^^^^^^
  File "/home/dani/.local/share/virtualenvs/sc-api-DipHMMiG/lib/python3.11/site-packages/polars/_utils/construction/dataframe.py", line 362, in _expand_dict_values
    updated_data[name] = pl.Series(
                         ^^^^^^^^^^
  File "/home/dani/.local/share/virtualenvs/sc-api-DipHMMiG/lib/python3.11/site-packages/polars/series/series.py", line 311, in __init__
    self._s = sequence_to_pyseries(
              ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dani/.local/share/virtualenvs/sc-api-DipHMMiG/lib/python3.11/site-packages/polars/_utils/construction/series.py", line 136, in sequence_to_pyseries
    pyseries = _construct_series_with_fallbacks(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dani/.local/share/virtualenvs/sc-api-DipHMMiG/lib/python3.11/site-packages/polars/_utils/construction/series.py", line 327, in _construct_series_with_fallbacks
    return constructor(name, values, strict)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: unexpected value while building Series of type Array(Int64, 1)
Hint: Try setting `strict=False` to allow passing data with mixed types.

Issue description

It's not possible to create columns of type Array passing using numpy arrays.

What works

with polars List

pl.DataFrame({"A": [np.array([1], dtype=np.int64)]}, {"A": pl.List(pl.Int64)})

with python array and polars Array

pl.DataFrame([[[1]]], {"A": pl.Array(pl.Int64, 1)}, orient="row")

What doesn't work

not specifying the type

pl.DataFrame([[np.array([1], dtype=np.int64)]]
It assigns polars List instead of Array. Why?

Row orientation

pl.DataFrame([[np.array([1], dtype=np.int64)]], {"A": pl.List(pl.Int64)}, orient="row")
You get nulls (also with pl.Array)

┌───────────────┐
│ A             │
│ ---           │
│ array[i64, 1] │
╞═══════════════╡
│ null          │
└───────────────┘

Expected behavior

1 Without schema, imho it should assign polars Array schema

pl.DataFrame({"A": [np.array([1], dtype=np.int64)]})

2 With schema, it should create the same as passing python

pl.DataFrame({"A": [np.array([1], dtype=np.int64)]}, {"A": pl.datatypes.Array(pl.datatypes.Int64, 1)}) == pl.DataFrame({"A": [[1]]}, {"A": pl.datatypes.Array(pl.datatypes.Int64, 1)})

Installed versions

Polars:               0.20.21
Index type:           UInt32
Platform:             Linux-6.5.0-9-generic-x86_64-with-glibc2.38
Python:               3.11.5 (main, Sep 20 2023, 13:23:03) [GCC 12.2.0]
@dpinol dpinol added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 18, 2024
@dpinol dpinol changed the title DataFrame construction from numpy arrays setting datatype Array DataFrame construction from numpy arrays and polars.datatypes.Array schema Apr 18, 2024
@deanm0000
Copy link
Collaborator

deanm0000 commented Apr 18, 2024

The Array or (in arrow terms) FixedSizeList doesn't have all the methods and functions that List does yet (the "yet" is hopefully). It was added later so between the Array dtype not having all the methods that List does and that List was already the default (and changing the default would be breaking), that is the answer to "why" List not Array.

1 Without schema, imho it should assign polars Array schema

I, frankly, don't think this makes sense in the context of a 1d np array. A 1d np array should become a column of whatever numeric type not a nested type. For passing 2d np arrays to the DataFrame constructor, it makes more sense, to me at least, to return a df of the same shape. I think the issue is that I don't think you'd make a list of numpy arrays naturally, instead, you'd have a 2d np array. If for some reason you have a list of np arrays then wrap the list in np.vstack(list_of_np_arrays)

That said, you can get (nearly) what you want by passing a 2d np array to the Series constructor. For example:

pl.Series(np.array([[1,2],[2,3]]), dtype=pl.Array(pl.Int64,2))
# or
pl.Series(np.array([1]).reshape(1,1), dtype=pl.Array(pl.Int64,1))
# or 
pl.Series(np.array([[1]]), dtype=pl.Array(pl.Int64,1))

which can then be the input to a DataFrame constructor

pl.DataFrame({
    "A": pl.Series(np.array([1]).reshape(1,1), dtype=pl.Array(pl.Int64,1))
})

As an aside, notice that you don't need to go through pl.datatypes they're all in the top level pl.

2 With schema, it should create the same as passing python

Confirmed bug:

To drill down...

Since this works:

pl.Series("A",[np.array([1], dtype=np.int64)]).cast(pl.Array(pl.Int64,1))

then so should

pl.Series("A",[np.array([1], dtype=np.int64)], dtype=pl.Array(pl.Int64,1))

but it doesn't.

I think addressing that would make your example work

@deanm0000 deanm0000 added P-low Priority: low A-dtype-list/array Area: list/array data type and removed needs triage Awaiting prioritization by a maintainer labels Apr 18, 2024
@stinodego stinodego added the A-interop-numpy Area: interoperability with NumPy label May 22, 2024
@dpinol
Copy link
Contributor Author

dpinol commented Jul 19, 2024

I think the issue is that I don't think you'd make a list of numpy arrays naturally, instead, you'd have a 2d np array

This would work if there's a single column, but would not work eg. if you have an array column and a string column.

Update in polars 1.2.1

Single nested column

This now works

df=pl.DataFrame({"A": [np.array([1])]}, {"A": pl.Array(pl.Int64, 1)})

But it's not possible to put again the data into a DataFrame through the exported numpy array

pl.DataFrame(df.to_numpy(), {"A": pl.Array(pl.Int64, 1)}, orient="row")

ComputeError: cannot cast 'Object' type

unless the input is a python list

pl.DataFrame(df.to_numpy().tolist(), {"A": pl.Array(pl.Int64, 1)}, orient="row")

┌───────────────┐
│ A             │
│ ---           │
│ array[i64, 1] │
╞═══════════════╡
│ [1]           │
└───────────────┘

1 nested column and a scalar one

In this case, the nested column takes null value even if strict=True

pl.DataFrame([[np.array([1]), 3], [np.array([4]), 6]], schema={"A": pl.List(pl.Int64), "B":pl.Int64}, nan_to_null=True, orient="row", strict=True)
Out[33]: 
shape: (2, 2)
┌───────────┬─────┐
│ AB   │
│ ------ │
│ list[i64] ┆ i64 │
╞═══════════╪═════╡
│ null3   │
│ null6   │
└───────────┴─────┘

Series from numpy

This now works

pl.Series("A",[np.array([1], dtype=np.int64)], dtype=pl.Array(pl.Int64,1))

but this doesn't

pl.Series("A",[np.array([1], dtype=np.object_)], dtype=pl.Array(pl.Int64,1))

It would be nice to have this working, since when I cut an np.object_ numpy by columns, I always get np.object dtype.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-dtype-list/array Area: list/array data type A-interop-numpy Area: interoperability with NumPy bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Status: Ready
Development

No branches or pull requests

3 participants