Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: make pandas.DataFrame.info() method able to display memory usage of each column #59690

Open
2 of 3 tasks
Gregory108 opened this issue Sep 2, 2024 · 8 comments
Open
2 of 3 tasks
Assignees
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Output-Formatting __repr__ of pandas objects, to_string

Comments

@Gregory108
Copy link

Gregory108 commented Sep 2, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

.info() method describes a DataFrame by each column dtype and count of non-null values, but, IMO, misses an opportunity to be more valuable by also displaying memory usage of each column.

Feature Description

I think, thousand of hours of human time would be saved if this would be a build-in feature with "memory_usage='by_column'" and "memory_usage='by_column_deep'" argument options.

Alternative Solutions

The alternative way to see all "technical" information in by-column form in one table is to create the following "Frankenstein":

import pandas
def better_info(df: pandas.DataFrame) -> None:
  import io
  import sys

  print(f"{sys.getsizeof(df) / 1024} KB")
  buffer = io.StringIO()
  df.info(buf=buffer)
  lines = buffer.getvalue().splitlines()
  df = (pd.DataFrame([x.split() for x in lines[5:-2]], columns=lines[3].split())
        .drop('Count',axis=1)
        .rename(columns={'Non-Null':'Non-Null Count'})) \
        .join(
            pd.DataFrame(
                [(col, df[col].memory_usage(deep=True)) for col in df.columns],
                columns=['Column', 'Memory Usage (bytes)']
            ).set_index('Column'),
            on='Column'
        ) \
        .drop(columns=["#"])
  print(df)

Resulting in some output looking like this:
image

Additional Context

I searched for similar suggestions in repo issues and have not found a duplicate.

@Gregory108 Gregory108 added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 2, 2024
@msid268
Copy link

msid268 commented Sep 3, 2024

take

@rhshadrach
Copy link
Member

Thanks for the request. It seems to me there are lots of things one could add to DataFrame.info that may be useful for some users but not others. Instead of adding more an more features to this method over time (which seems unsustainable to me), it is preferable that users combine what summary information they want through various methods using concat, e.g.

df = pd.DataFrame({"a": [1, 1, 2], "b": [3, 4, 5]})
result = pd.concat(
    [
        df.memory_usage(index=False).rename("Memory Usage"),
        (~df.isnull()).sum().rename("Non-Null Count"),
        df.dtypes.rename("Dtype"),
    ],
    axis=1,
)
print(result)
#    Memory Usage  Non-Null Count  Dtype
# a            24               3  int64
# b            24               3  int64

@rhshadrach rhshadrach added Needs Discussion Requires discussion from core team before further action Output-Formatting __repr__ of pandas objects, to_string and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 3, 2024
@Gregory108
Copy link
Author

@rhshadrach what are examples of "lots of things" one could add to DataFrame.info that aslo fall into scope of this function to give technical parameters of variables?

@RaghavKhemka
Copy link

@rhshadrach what are examples of "lots of things" one could add to DataFrame.info that aslo fall into scope of this function to give technical parameters of variables?

@Gregory108 @rhshadrach

I have some suggestions for the same:

  1. Null_percenatge - This will show the % of null values).
  2. Range - This can show the range of numeric columns. Min and Max values in a numeric column)
  3. Unique_counts - This will be more helpful for a categorical column.
  4. Memory usage.
  5. sample - It will show first three values (maybe unique) for each column combined with commas

I suggest we add another parameter to the info function (maybe something like 'more_details') which will be Boolean and only work when Verbose=True. When enabled 'more_details', it will calculate and show all other values along with the existing ones.

Please tell me what you think, I am ready to take up the task in case we proceed.

@Gregory108
Copy link
Author

Gregory108 commented Sep 4, 2024

@RaghavKhemka
In my opinion:

  1. "Null_percentage" is more valuable than "Null count", but
  • adding a second metric about the same property is superficial
  • replacing current "Null count" is not viable, as it undermines compatibility
  • soo, current "Null count" is "good enough" status quo
  1. "Range" is about properties of data stored - and, thus, is beyond the scope of "info"
  2. Same, not technical, but about data content itself
  3. Agree (my proposition=)
  4. Same as 2-3 - out of scope of technical characteristics of the table or its columns

For a description of data (i.e. points 2,3,5) there are other functions DataFrame.describe() and DataFrame.head(3).

Of those I think only "memory usage" belongs to .info().

@rhshadrach
Copy link
Member

that aslo fall into scope of this function to give technical parameters of variables?

What is the definition of "technical", and where is it documented that DataFrame.info is only to contain "technical" information?

3. Same, not technical, but about data content itself

How are null counts not about the data content itself?

For a description of data (i.e. points 2,3,5) there are other functions DataFrame.describe() and DataFrame.head(3).

Why is it not the case that DataFrame.memory_usage() belongs in the umbrella of "there are other functions" that users can use?

@msid268
Copy link

msid268 commented Sep 5, 2024

Thank you, everyone, for your valuable input. After carefully considering the pros and cons, I think we should proceed with adding the memory usage feature to the .info() method, specifically the "memory_usage='by_column'" and "memory_usage='by_column_deep'" options.

Justification:

  • Technical Alignment: Memory usage is already a part of the .info() method through the memory_usage=True and memory_usage='deep' options. Enhancing these options to display memory usage by column is a natural extension that remains within the technical scope of .info(). This improvement will allow users to gain a more granular and comprehensive understanding of their DataFrames without deviating from the method's intended purpose.

  • User Benefit: This enhancement will save users significant time by integrating a commonly needed piece of information directly into .info(). It reduces the need for custom functions and makes the user experience more streamlined and efficient, especially when dealing with large datasets.

  • Avoiding Scope Creep: By focusing only on memory usage and not including other data content metrics (like ranges or unique counts), we can avoid bloating the .info() method or blurring its scope.

@Gregory108
Copy link
Author

Gregory108 commented Sep 5, 2024

Please, answer the question in response to your earlier concern that there might be "lots of other (technical) variables" to add. Maybe we are missing some other technical parameters suitable for .info().

What is the definition of "technical", and where is it documented that DataFrame.info is only to contain "technical" information?

I am not an authority on defining the terminology, yet, I'd say that "technical" information is about "how" data is stored rather than about "what" data is stored.

In my opinion, "technicality" is implied by the existence of .describe() (that describes columns content) and by .info() description in the documentation:

This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

So I fully expected to get by-column information about memory usage.

How are null counts not about the data content itself?

I agree that it is about content, but it is already there. I can see only corner-case justification: if all values are Null then the user might consider dropping the column to reduce memory usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

No branches or pull requests

4 participants