-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF8 string dtype #29183
Comments
https://arrow.apache.org/blog/2019/02/05/python-string-memory-0.12/ is probably relevant. Although given that, >>> x = np.array(['a very long string'])
>>> x.nbytes
72 # 4 bytes / char
>>> import sys
>>> sys.getsizeof("a very long string")
67 # 49 bytes overhead ? + 1 bytes / char
>>> df = pd.DataFrame({'a': x})
>>> df.memory_usage()
Index 128
a 8
dtype: int64
>>> x.values.nbytes
8 and that it should take more than 8 bytes to store that 18 char string, To give more context, I'm loading data from parquet with pandas 0.25.2 and parquet 0.15 and once loaded it takes significantly more space in memory (mostly str columns) that what I would expect. |
I believe that #8640 covers this issue. Pandas 1.0 will have a As soon as possible, we'd like to change that implementation to use a native string implementation (Arrow perhaps).
You need |
Thanks @TomAugspurger , makes sense. Closing this in favor of #8640 then. |
As far as I understand, for strings pandas uses an
object
dtype with Unicode representation where each character takes 4 bytes, similarly to numpyU
dtype,Wouldn't it be possible to support UTF-8 encoded strings? That would decrease by 2-4 the memory requirements, for a number of languages. For instance parquet and arrow seem to be doing that [1], [2].
Since each cell is a variable length string represented as an object, using UTF8 might be possible unlike for fixed length numpy strings (#5261)?
(Not sure if there were other discussions about this previously; "utf8", "string", and "dtype" are pretty noisy search key words).
The text was updated successfully, but these errors were encountered: