Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce amount of data for DataFrames by sampling #21011

Open
uprokevin opened this issue Jun 10, 2023 · 4 comments
Open

Reduce amount of data for DataFrames by sampling #21011

uprokevin opened this issue Jun 10, 2023 · 4 comments

Comments

@uprokevin
Copy link

uprokevin commented Jun 10, 2023

On Side Panel, variable visualizer,
When clicking on large dataframe or large dictionnary, Panel and Spyder freezed.

Suggest the workaround : 
    nmax= 50000
    On Click Visualize(   dfbig.sample( n = min( len(dfbig, max_df))    , replace=False ) 

Suppose len(dfbig) = 1 million ...
It will sample the dataframe with nmax= 50000 values. and Spyder does not crash...

Same for list 
    On Click Visualize (    listbig[:nmax. ) 

Reference:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html

Thanks !

@ccordoba12
Copy link
Member

Hey @pprokevin, thanks for reporting. Could you post a video or animated gif that shows Spyder freezing after opening a big dataframe or dictionary?

I just tested with a one million row/single column dataframe, and Spyder didn't freeze for me.

@uprokevin
Copy link
Author

Does it handle visualization of
10 million rows with 560 columns in string ?

believe sub-sampling is simple and efficient way to reduce load in visualization ...

@ccordoba12
Copy link
Member

Does it handle visualization of
10 million rows with 560 columns in string ?

That depends on the amount of memory available in your computer, not on Spyder. That's because we need to make a copy of the dataframe in the IPython console kernel to send and display it in Spyder (which runs in a different process).

believe sub-sampling is simple and efficient way to reduce load in visualization ...

Sure, this is a good idea too. Thanks for the suggestion, I didn't know about it. We'll try to implement it in Spyder 6.

@ccordoba12 ccordoba12 changed the title Feature: Variable Explorer : reduce amount of data visualized for DataFrame by sampling Reduce amount of data for DataFrames by sampling Jun 13, 2023
@ccordoba12 ccordoba12 modified the milestones: v6.0.1, v6.0alpha3 Jun 13, 2023
@uprokevin
Copy link
Author

Thanks for considering it.
Think visualizing 1 million rows table does not make much sense for human...
At max 100,000 rows would handl most use visualization use cases ( ie find pattern, wrong columns)
and reduce memory footprint a lot.

@ccordoba12 ccordoba12 modified the milestones: v6.0alphaX, v6.1.0 Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants