Skip to content

DOC: Categorical gotchas says memory usage is O(nm), when it is actually O(n+m) #17705

Closed
@RauliRuohonen

Description

@RauliRuohonen

Code Sample, a copy-pastable example if possible

import pandas as pd

def make_series(categories, repeats):
    categories = [str(i) for i in range(categories)]
    return pd.Series(categories*repeats).astype('category')

def series_info(series):
    true_size = series.nbytes
    n = len(series.cat.categories)
    m = len(series.cat.codes)
    expected_size = n*8+m  # Note: not n*m*constant
    print('Number of categories (n): %d, length of data (m): %d, bytes: %d, '
          '8n+m: %d, nm: %d' % (n, m, true_size, expected_size, n*m))

series_info(make_series(123, 456))

# Output:
#  Number of categories (n): 123, length of data (m): 56088, bytes: 57072, 8n+m: 57072, nm: 6898824
#
# Note that 8n+m matches true size as it should. nm is much larger.

Problem description

The documentation says:

The memory usage of a Categorical is proportional to the number of categories times the length of the data.

If there are n categories and the length of data is m, this means that memory usage is O(nm). As the script above shows however, the true usage is O(n+m). The claimed usage O(nm) suggests that the implementation would store the categories using one-hot encoding, which would be strange to say the least.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions