Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spyder freezes when large MaskedArrays are in memory #2748

Closed
durack1 opened this issue Oct 8, 2015 · 33 comments
Closed

Spyder freezes when large MaskedArrays are in memory #2748

durack1 opened this issue Oct 8, 2015 · 33 comments

Comments

@durack1
Copy link

durack1 commented Oct 8, 2015

I've experienced a long-standing issue with Spyder that dates back to Spyder 2.1.13 (and likely before) and is still present in the current 3.0.0b1. It is likely linked to Spyder interactions with the cdms2 module that uses numpy.ma for its backend array manipulation.

After loading a ~large [50,50,300,360] ~1GB matrix into memory (from a netcdf file - using cdms2.open), the responsiveness of the console disappears.. And intermittently responds to inputs. This continues to occur until the matrix is reduced in size, i.e. var = var[0:1,:,:,:]

The issue disappears when using pure numpy, so var = numpy.ma.ones([50,50,300,360]) doesn't yield the same lack of responsiveness.

The issue is not related to resource limitations on the machine (redhat6.7, 128GB, Xeon E5-2643 quad core).

As it's related to Spyder interactions with the cdms2 module, I'm not sure the best way to get to the bottom of this issue.. Following #1958 and #1968 I have adjusted the auto refresh feature of the variable explorer (turned this off) but this doesn't appear to solve the problem.

What further information is required to attempt to get to the bottom of this issue? For completeness:
Spyder version: 3.0.0-b1
Python version: 2.7.10
Qt version: 4.8.4
PyQT4 version: 4.11.3
numpy version: 1.9.0
uv-cdat/cdms2 version: 2.4.0-rc1

@dnadeau4 pinging you here..

@Nodd
Copy link
Contributor

Nodd commented Oct 8, 2015

Did you try to completely disable the variable explorer (ie close the pane) ? There is nothing special about cdms2 in particular in spyder.

@durack1
Copy link
Author

durack1 commented Oct 8, 2015

@Nodd good call.. Closing the variable explorer starts to speed things up considerably.. Ok so my issue appears to be with the variable explorer really slowing things down.. Is this noted somewhere in another open issue?

It's a pity this issue exists, because I do like having access to a visual of the variables (and their dimensions and types) that are currently in memory..

@ccordoba12
Copy link
Member

@durack1, thanks for taking the time to open this issue and letting us know about this problem.

This is a very interesting use case to improve Spyder responsiveness with big data. We have done a lot of work lately to better handle big DataFrames and NumPy arrays when they are opened.

However, the problem here could lie in the fact that we're not optimizing how the array is represented in the Variable Explorer. What I mean is that in the Value column we are just putting the full repr of the array (i.e. what's printed when you call print(array)) instead of something simpler (like it's first 10 or 20 elements).

But to test this hypothesis, I need you to give some sample code (using cdms2) I can test on my side to fix this problem :-)

@durack1
Copy link
Author

durack1 commented Oct 9, 2015

@ccordoba12 no problem, the easiest example would be just load a variable from a netcdf file using cdms2, while you have a variable explorer pane open in Spyder.

So the code:

import cdms2 as cdm
fileHandle = cdm.open('AusCOM1-0.Salt.Omon.so.00011231-00501231.nc')
var = fileHandle('so')
fileHandle.close()

Should reproduce the laggyness that I've experienced - the variable will require ~1GB available memory to load.. You can grab the file above from here (it's 355MB)

I've noted what would appear to be a rogue Python process at 100% CPU when the laggyness is occurring, so it's trying to do something.. But what I have no idea..

@goanpeca
Copy link
Member

goanpeca commented Oct 9, 2015

@ccordoba12 we should use that file for future tests! 😉

@durack1
Copy link
Author

durack1 commented Oct 9, 2015

@goanpeca @ccordoba12 if you would like an even larger test file, I'd be more than happy to provide this!

@durack1
Copy link
Author

durack1 commented Oct 16, 2015

@ccordoba12 @goanpeca let me know if you have any trouble getting access to cdms2 and installing this along with netcdf4.. From memory the file above uses deflation, so netcdf will need to be built against the zlib libraries too..

@goanpeca
Copy link
Member

@durack1 thanks for the heads up :-)

@ccordoba12
Copy link
Member

@durack1, could you also upload a smaller nc file? I mean one that doesn't cause Spyder to freeze? That would be really helpful too :-)

@durack1
Copy link
Author

durack1 commented Oct 18, 2015

@ccordoba12 it'd have to be pretty small.. I've been experiencing this freezing issue even with smaller matrices.. I'm currently on travel but will drop a smaller file on the webserver when I'm back in a week or so..

@ccordoba12
Copy link
Member

Ok, no problem. I'm working on other things right now, but I hope to address this issue for beta2.

@durack1
Copy link
Author

durack1 commented Oct 19, 2015

@ccordoba12 great - what it the timeline for beta2? I'm not holding things up if I get this data to you next week am I?

@ccordoba12
Copy link
Member

Don't worry, we are two or three weeks away of it :-)

El 19/10/15 a las 07:26, Paul J. Durack escribió:

@ccordoba12 https://github.com/ccordoba12 great - what it the
timeline for beta2? I'm not holding things up if I get this data to
you next week am I?


Reply to this email directly or view it on GitHub
#2748 (comment).

@durack1
Copy link
Author

durack1 commented Oct 31, 2015

@ccordoba12 apologies for the delay.. Here is a much smaller file that doesn't appear to trigger the laggyness issues. In the example below I have re-enabled the variable explorer pane in Spyder 3.0.0b1 and it seems to work fine - it loads two variables from a netcdf file using cdms2.

So the code:

import cdms2 as cdm
fileHandle = cdm.open('DurackandWijffels_GlobalOceanSurfaceChanges_1950-2000.nc')
saltChange = fileHandle('salinity_change')
thetaoChange = fileHandle('thetao_change')

You can grab the file above from here (it's 607KB)

@ccordoba12 ccordoba12 modified the milestones: v2.3.8, v3.0 Nov 16, 2015
@ccordoba12
Copy link
Member

@durack1, please give us a smaller file than the one you uploaded first (called AusCOM1-0.Salt.Omon.so.00011231-00501231.nc).

That file seems to require 10 gigs of ram (not 1, I tested it on my virtual machines! :-), so I can't use it for testing.

@ccordoba12 ccordoba12 modified the milestones: v3.0, v2.3.8 Nov 22, 2015
@ccordoba12 ccordoba12 changed the title Spyder freezes when large matrices are in memory Spyder freezes when large MaskedArrays are in memory Nov 22, 2015
@durack1
Copy link
Author

durack1 commented Nov 24, 2015

@ccordoba12 the file should be fine - you could load a smaller subset using:

>>> import resource
>>> import cdms2 as cdm
>>> import numpy as np

>>> print 'Max mem: %05.3f GB' % (np.float32(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)/1.e6)
Max mem: 0.155 GB
>>> fileHandle = cdm.open('AusCOM1-0.Salt.Omon.so.00011231-00501231.nc')
>>> print 'Max mem: %05.3f GB' % (np.float32(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)/1.e6)
Max mem: 0.155 GB
>>> var = fileHandle('so',time=slice(0,1))
>>> print 'Max mem: %05.3f GB' % (np.float32(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)/1.e6)
Max mem: 0.155 GB
>>> fileHandle.close()
>>> print 'Max mem: %05.3f GB' % (np.float32(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)/1.e6)
Max mem: 0.155 GB

Obviously then changing the number of timesteps loaded by altering your indexes (e.g. var = fileHandle('so',time=slice(0,10))) will then get progressively larger matrices loaded into memory. The single time slice example above should need just ~0.2GB or so.. For the larger matrix, you'll need ~1GB:

>>> print 'Max mem: %05.3f GB' % (np.float32(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)/1.e6)
Max mem: 0.156 GB
>>> fileHandle = cdm.open('AusCOM1-0.Salt.Omon.so.00011231-00501231.nc')
>>> print 'Max mem: %05.3f GB' % (np.float32(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)/1.e6)
Max mem: 0.156 GB
>>> var = fileHandle('so',time=slice(0,10))
>>> print 'Max mem: %05.3f GB' % (np.float32(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)/1.e6)
Max mem: 0.706 GB
>>> fileHandle.close()
>>> print 'Max mem: %05.3f GB' % (np.float32(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)/1.e6)
Max mem: 0.706 GB

For the full matrix I only need ~4GB:

>>> print 'Max mem: %05.3f GB' % (np.float32(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)/1.e6)
Max mem: 0.156 GB
>>> fileHandle = cdm.open('AusCOM1-0.Salt.Omon.so.00011231-00501231.nc')
>>> print 'Max mem: %05.3f GB' % (np.float32(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)/1.e6)
Max mem: 0.156 GB
>>> var = fileHandle('so')
>>> print 'Max mem: %05.3f GB' % (np.float32(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)/1.e6)
Max mem: 3.239 GB
>>> fileHandle.close()
>>> print 'Max mem: %05.3f GB' % (np.float32(resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)/1.e6)
Max mem: 3.239 GB

@Nodd
Copy link
Contributor

Nodd commented Nov 24, 2015

I wonder if we should show the repr of objects smaller than say 1MB, and keep a whitelist for types that we know have no problems (like pandas dataframe). It would avoid freezing or crashing spyder because of unknown big objects.

@ccordoba12
Copy link
Member

@Nodd, it's a good idea. But how to determine reliably an object's memory?

@ccordoba12
Copy link
Member

Sorry, it's not its memory the problem, but the size of its repr. How could we account for that?

@durack1
Copy link
Author

durack1 commented Nov 24, 2015

@ccordoba12 @Nodd in the case of a numpy array, why not use numpyArray.shape?

@ccordoba12
Copy link
Member

@durack1, that just gives its size :-) repr is the string printed when you do something like:

>>> a = np.array([1, 2, 3])
>>> a
array([1, 2, 3])

Most numpy repr's are very efficient, but some (like record arrays and it seems masked arrays) seem to not be :-)

@durack1
Copy link
Author

durack1 commented Nov 24, 2015

@ccordoba12 but such shape info can then be used to query a small subset of the array - using the very efficient indexing syntax - so the example above var.shape = (50,50,300,360), could then define a more targeted (and much smaller) temporary matrix in which repr can then be performed.

@Nodd
Copy link
Contributor

Nodd commented Nov 24, 2015

We can guess that the bigger an object is, the bigger its repr could be. If we know that a type has a simple repr, we can force to always show it.

My proposal is not perfect, it's just a workaround to avoid recurring problems with the variable explorer and big objects.

As for the size, I vaguely recall that python has a sizeof equivalent. It may fail for some types, but again it can be better than nothing.

@ccordoba12
Copy link
Member

I discovered that this problem was generated because the repr of masked arrays is terribly inefficient.

So the fix is to use as the Value of masked arrays (i.e. what appears in the fourth column of the Variable Explorer) a simple string (i.e. just "Masked array") instead of the variable's repr.

And then the problem goes away :-)

@durack1
Copy link
Author

durack1 commented Dec 9, 2015

@ccordoba12 great, looking forward to testing this in beta2.. @goanpeca any chance code-folding will also find itself into beta2?

@goanpeca
Copy link
Member

goanpeca commented Dec 9, 2015

@durack1, not yet sorry :-(
I think it will be for beta3

@durack1
Copy link
Author

durack1 commented Dec 9, 2015

@goanpeca no problem, looking forward to testing code-folding in beta3 then!

@Nodd
Copy link
Contributor

Nodd commented Dec 9, 2015

@ccordoba12 Maybe it would be worth opening a bug report for numpy ?

@ccordoba12
Copy link
Member

@Nodd, sure. Could you do it, please?

@goanpeca
Copy link
Member

goanpeca commented Dec 9, 2015

@Nodd could you do it, please :-), and add a link to this issue :p

@Nodd
Copy link
Contributor

Nodd commented Dec 10, 2015

Yeah I'll do it, but first I'll have to check what "the repr of masked arrays is terribly inefficient." means.

@ccordoba12
Copy link
Member

@Nodd, it means that when you run

repr(ma)

where ma is a masked array, it takes a lot of time to return the result (at least for the arrays provided by @durack1)

@durack1
Copy link
Author

durack1 commented Dec 10, 2015

@Nodd those files are available through the links above here (large 3D matrices ~GB) and here (smaller 2D matrices ~KB)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants