Redesign DataFrame structure #815

akharche · 2020-04-21T16:47:35Z

Extension for #801

Implementation of new DataFrame structure based on lists instead of tuples
Improved df.count() codegen for testing

Example:

df = pd.DataFrame({'A': [1,2,3], 'B': [.5, .6, .7], 'C': [4, 5, 6], 'D': ['a', 'b', 'c']})

(['A', 'B', 'C', 'D'],)
([array([1, 2, 3], dtype=int64), array([4, 5, 6], dtype=int64)], [array([0.5, 0.6, 0.7])], [array(['a', 'b', 'c'], dtype=object)])

Reproduce:

@njit
def run_df():
    df = pd.DataFrame({'A': [1,2,3], 'B': [.5, .6, .7], 'C': [4, 5, 6], 'D': ['a', 'b', 'c']})

    print(df._columns)
    print(df._data)

    return df.count()

AlexanderKalistratov · 2020-04-21T20:00:38Z

sdc/hiframes/pd_dataframe_ext.py

+        if col_typ not in data_typs_map:
+            data_typs_map[col_typ] = (type_id, [col_id])
+            # The first column in each type always has 0 index
+            df_structure[col_name] = (type_id, 0)


Probably we could use named tuple?

AlexanderKalistratov · 2020-04-21T20:28:35Z

sdc/hiframes/pd_dataframe_type.py

+        self.df_structure = df_structure
        super(DataFrameType, self).__init__(
-            name="dataframe({}, {}, {}, {})".format(data, index, columns, has_parent))
+            name="dataframe({}, {}, {}, {}, {})".format(data, index, columns, has_parent, df_structure))


Do we really want structure to be part of type name?

AlexanderKalistratov · 2020-04-21T20:29:45Z

sdc/hiframes/pd_dataframe_type.py

+            ('data', types.Tuple([types.List(typ) for typ in df_types])),
            ('index', fe_type.index),
-            ('columns', types.UniTuple(string_type, n_cols)),
+            ('columns', types.UniTuple(types.List(string_type), 1)),


Why not just list?

AlexanderKalistratov · 2020-04-21T20:30:06Z

sdc/hiframes/pd_dataframe_type.py

-            ('columns', types.UniTuple(string_type, n_cols)),
+            ('columns', types.UniTuple(types.List(string_type), 1)),
            ('parent', types.pyobject),
+            ('df_structure', types.pyobject),


Why do we need it here?

akharche · 2020-04-23T14:45:05Z

Duplicate of #817

Redesign DataFrame structure

16570a0

akharche added the Ready for Review label Apr 21, 2020

akharche requested review from AlexanderKalistratov, kozlov-alexey and densmirn April 21, 2020 16:47

AlexanderKalistratov reviewed Apr 21, 2020

View reviewed changes

akharche closed this Apr 23, 2020

akharche deleted the change_df_structure branch April 23, 2020 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Redesign DataFrame structure #815

Redesign DataFrame structure #815

Uh oh!

akharche commented Apr 21, 2020 •

edited

Loading

Uh oh!

AlexanderKalistratov Apr 21, 2020

Uh oh!

AlexanderKalistratov Apr 21, 2020

Uh oh!

AlexanderKalistratov Apr 21, 2020

Uh oh!

AlexanderKalistratov Apr 21, 2020

Uh oh!

akharche commented Apr 23, 2020

Uh oh!

Uh oh!

Redesign DataFrame structure #815

Redesign DataFrame structure #815

Uh oh!

Conversation

akharche commented Apr 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexanderKalistratov Apr 21, 2020

Choose a reason for hiding this comment

Uh oh!

AlexanderKalistratov Apr 21, 2020

Choose a reason for hiding this comment

Uh oh!

AlexanderKalistratov Apr 21, 2020

Choose a reason for hiding this comment

Uh oh!

AlexanderKalistratov Apr 21, 2020

Choose a reason for hiding this comment

Uh oh!

akharche commented Apr 23, 2020

Uh oh!

Uh oh!

akharche commented Apr 21, 2020 •

edited

Loading