Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 string dtype #29183

Closed
rth opened this issue Oct 23, 2019 · 3 comments
Closed

UTF8 string dtype #29183

rth opened this issue Oct 23, 2019 · 3 comments

Comments

@rth
Copy link
Contributor

rth commented Oct 23, 2019

As far as I understand, for strings pandas uses an object dtype with Unicode representation where each character takes 4 bytes, similarly to numpy U dtype,

>>> import  numpy as np
>>> x = np.array(["ac", "bd"])
>>> x.nbytes / (len(x) * 2)   # bytes per char
4.0
>>> df = pd.DataFrame({'a': x})
>>> df.dtypes
a    object
dtype: object
>>> df['a'].values.nbytes / (len(x) * 2)  # bytes per char
4.0

Wouldn't it be possible to support UTF-8 encoded strings? That would decrease by 2-4 the memory requirements, for a number of languages. For instance parquet and arrow seem to be doing that [1], [2].

Since each cell is a variable length string represented as an object, using UTF8 might be possible unlike for fixed length numpy strings (#5261)?

(Not sure if there were other discussions about this previously; "utf8", "string", and "dtype" are pretty noisy search key words).

@rth
Copy link
Contributor Author

rth commented Oct 23, 2019

https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/ is probably relevant.

Although given that,

>>> x = np.array(['a very long string'])
>>> x.nbytes
72   # 4 bytes / char
>>> import sys
>>> sys.getsizeof("a very long string")
67   # 49 bytes overhead ? + 1 bytes / char
>>> df = pd.DataFrame({'a': x})
>>> df.memory_usage()
Index    128
a          8
dtype: int64
>>> x.values.nbytes
8

and that it should take more than 8 bytes to store that 18 char string, ndarray.nbytes / DataFrame.memory_usage don't seem to work as expected with object dtype or am I missing something? So the memory usage per string I gave above is probably wrong.

To give more context, I'm loading data from parquet with pandas 0.25.2 and parquet 0.15 and once loaded it takes significantly more space in memory (mostly str columns) that what I would expect.

@TomAugspurger
Copy link
Contributor

I believe that #8640 covers this issue.

Pandas 1.0 will have a StringDtype(), which currently uses the same implementation of an ndarray of Python strings.

As soon as possible, we'd like to change that implementation to use a native string implementation (Arrow perhaps).

don't seem to work as expected with object dtype or am I missing something?

You need deep=True to inspect the size of the python objects contained within.

@rth
Copy link
Contributor Author

rth commented Oct 23, 2019

Thanks @TomAugspurger , makes sense.

Closing this in favor of #8640 then.

@rth rth closed this as completed Oct 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants