Closed
Description
Code Sample, a copy-pastable example if possible
import pandas as pd
def make_series(categories, repeats):
categories = [str(i) for i in range(categories)]
return pd.Series(categories*repeats).astype('category')
def series_info(series):
true_size = series.nbytes
n = len(series.cat.categories)
m = len(series.cat.codes)
expected_size = n*8+m # Note: not n*m*constant
print('Number of categories (n): %d, length of data (m): %d, bytes: %d, '
'8n+m: %d, nm: %d' % (n, m, true_size, expected_size, n*m))
series_info(make_series(123, 456))
# Output:
# Number of categories (n): 123, length of data (m): 56088, bytes: 57072, 8n+m: 57072, nm: 6898824
#
# Note that 8n+m matches true size as it should. nm is much larger.
Problem description
The documentation says:
The memory usage of a Categorical is proportional to the number of categories times the length of the data.
If there are n categories and the length of data is m, this means that memory usage is O(nm). As the script above shows however, the true usage is O(n+m). The claimed usage O(nm) suggests that the implementation would store the categories using one-hot encoding, which would be strange to say the least.