Skip to content

ENH: add to_records() option to output NumPy string dtypes, not objects #18146

Closed
@jzwinck

Description

@jzwinck

DataFrame.to_records() outputs string columns with the object dtype, which is sometimes not efficient (e.g. for short, similar-length strings, or when storing with np.save()). I wrote the following function to fix this:

def to_records_plain(df):
    """Return a NumPy recarray like df.to_records() but with strings stored as bytes, not objects.
    This gives more compact storage and does not require pickling objects when saving to disk.
    Assumes all object arrays in df are strings.

    >>> df = pd.DataFrame({'a': [1, 2], 'b': [0.5, 0.9], 'c': ['x', 'yyy']})
    >>> to_records_plain(df)
    rec.array([(0, 1,  0.5, b'x'), (1, 2,  0.9, b'yyy')], 
              dtype=[('index', '<i8'), ('a', '<i8'), ('b', '<f8'), ('c', 'S3')])
    """
    records = df.to_records()
    descr = records.dtype.descr
    for ii, (name, dtype) in enumerate(descr):
        if dtype == '|O':
            length = df[name].str.len().max()
            descr[ii] = (name, 'S{}'.format(length))

    return records.astype(descr)

I suggest exposing something like this as an option in DataFrame.to_records(). An option to convert to Unicode ('U') too would be good too (NumPy's 'S' is effectively bytes in Python 3).

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions