Skip to content

Dataframe construction from numba typed list #27539

Open
@leohaim

Description

@leohaim

Code Sample, a copy-pastable example if possible

import pandas as pd
import numba

a = numba.typed.List()
a.append(1)
a.append(2)

pd.DataFrame(a)

raises with

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-17-0844eae3ab56> in <module>
      6 a.append(2)
      7
----> 8 pd.DataFrame(a)

~/sandbox/pandas/pandas/core/frame.py in __init__(self, data, index, columns, dtype, copy)
    458                     mgr = arrays_to_mgr(arrays, columns, index, columns, dtype=dtype)
    459                 else:
--> 460                     mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
    461             else:
    462                 mgr = init_dict({}, index, columns, dtype=dtype)

~/sandbox/pandas/pandas/core/internals/construction.py in init_ndarray(values, index, columns, dtype, copy)
    158     # by definition an array here
    159     # the dtypes will be coerced to a single dtype
--> 160     values = prep_ndarray(values, copy=copy)
    161
    162     if dtype is not None:

~/sandbox/pandas/pandas/core/internals/construction.py in prep_ndarray(values, copy)
    279             values = values.copy()
    280
--> 281     if values.ndim == 1:
    282         values = values.reshape((values.shape[0], 1))
    283     elif values.ndim != 2:

AttributeError: 'List' object has no attribute 'ndim'

Problem description

Numba since version 0.45 provides a new typed list class that allows fast manipulation of lists in compiled code.
Construction of a pandas DataFrame from such a typed list is not straightforward, however.

First one cannot put such an object directly into the Dataframe constructor, but one has to convert it to a list or numpy array first.

Second, the conversion is slow. In the above example it takes one second on my machine if I convert the typed list of 100000 float32 values into a list and then put it into pandas. If I convert the typed list into a numpy array it takes almost 2 seconds.

Conversely, constructing a DataFrame from a conventional list or numpy array takes only about 1/100 seconds.

I wonder if it is possible to write a more efficient Dataframe constructor that uses numba typed lists as input.

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugConstructorsSeries/DataFrame/Index/pd.array Constructorsnumbanumba-accelerated operations

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions