Skip to content

Conditionally use indexed_gzip #552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Sep 25, 2017
Merged

Conversation

pauldmccarthy
Copy link
Contributor

Howdy all,

I've been sitting on this one for a while. This PR proposes to conditionally use indexed_gzip instead of gzip to allow for faster random access to .gz files. The default behaviour of nibabel is unchanged by this PR - to take advantage of indexed_gzip, one needs to load an image with a new flag, like so:

img = nib.load('large_gzipped_image.nii.gz', keep_file_open=True)

This PR also proposes changes to ArrayProxy and fileslice.read_segments which makes paired calls to seek and read thread-safe - see here for a discussion surrounding this change.

Changes

  1. The ArrayProxy has been modified so that it will optionally create and use a single ImageOpener for its lifetime, rather than creating/destroying an ImageOpener every time the image data is accessed. This is controlled via the keep_file_open flag.

  2. The fileslice.read_segments function optionally accepts a threading.Lock so that paired calls to the seek/read functions can be protected against changes to the running thread.

  3. The openers._gzip_open function has been changed to conditionally use indexed_gzip when available.

  4. (Unrelated) There was a bug in _gzip_open which was causing the max_read_chunk fix to have no effect. This has been fixed.

Outstanding issues

  • indexed_gzip is not available for windows platforms. I don't have a windows machine to test on, so would need somebody to help out in making it work on windows. See Installation on Windows pauldmccarthy/indexed_gzip#1. Note that this PR in its initial state does not break nibabel on windows, as the use of indexed_gzip is conditional upon its presence.

  • indexed_gzip does not support writing. If a file is opened in write mode, the built-in GzipFile class is used. I am not against adding write-support into indexed_gzip, but to date have not considered it a necessity (and don't know how much time it would take!).

  • The MGHFormat class uses the ArrayProxy, but I have not updated it to use the keep_file_open flag, so it does not benefit from indexed_gzip. Adding this would be trivial, but I thought I'd just start by proposing changes to NIFTI/ANALYZE access.

  • The PARRECImage class implements its own ArrayProxy, so does not benefit from indexed_gzip. The PARRECArrayProxy class would need to be updated similarly, or some refactorings made so that it uses the standard ArrayProxy. I've not looked too closely at the PARRECArrayProxy, but on the surface it seems to be quite similar to the ArrayProxy class, so this should not be too difficult.

  • indexed_gzip is currently available as a binary wheel for macos, but only a source package for other platforms. Hence cython and a C compiler must be available for it to be installed. Is this a concern? I can look into generating linux builds if you think it is needed.

Other notes

  • I have not updated fileslice._simple_fileslice to accept a Lock, because it is not used by anything (apart from unit tests)

  • I decided against using pread for the thread-safe fix (as discussed in the mailing list thread) because the built-in GzipFile class doesn't have a pread method.

  • I have run the nibabel unit tests locally, both with and without indexed_gzip. It may be desirable to build both of these scenarios into the standard nibabel CI test suite.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 96.257% when pulling 63ff5ef on pauldmccarthy:indexed_gzip into bb8c318 on nipy:master.

@codecov-io
Copy link

codecov-io commented Sep 6, 2017

Codecov Report

Merging #552 into master will increase coverage by 0.04%.
The diff coverage is 98.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #552      +/-   ##
==========================================
+ Coverage   94.29%   94.34%   +0.04%     
==========================================
  Files         177      177              
  Lines       24440    24670     +230     
  Branches     2619     2634      +15     
==========================================
+ Hits        23046    23275     +229     
- Misses        919      920       +1     
  Partials      475      475
Impacted Files Coverage Δ
nibabel/fileslice.py 97.27% <100%> (+0.08%) ⬆️
nibabel/tests/test_arrayproxy.py 100% <100%> (ø) ⬆️
nibabel/tests/test_openers.py 98.58% <100%> (+0.15%) ⬆️
nibabel/freesurfer/mghformat.py 88.5% <100%> (ø) ⬆️
nibabel/analyze.py 97.69% <100%> (ø) ⬆️
nibabel/tests/test_fileslice.py 98.26% <100%> (+0.18%) ⬆️
nibabel/spm99analyze.py 91.04% <100%> (ø) ⬆️
nibabel/openers.py 82.07% <90%> (+2.27%) ⬆️
nibabel/arrayproxy.py 98.11% <94.59%> (-1.89%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 23539e8...790c037. Read the comment docs.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 96.257% when pulling a279175 on pauldmccarthy:indexed_gzip into bb8c318 on nipy:master.

Copy link
Member

@matthew-brett matthew-brett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also:

  • add a test matrix entry to .travis.yml to test that this works;
  • can you build / upload PyPI wheels for linux and windows, for indexed_gzip? See https://github.com/matthew-brett/multibuild for one way of doing this.

'''If this ``ArrayProxy`` was created with ``keep_file_open=True``,
the open file object is closed if necessary.
'''
if hasattr(self, '_opener') and not self._opener.closed:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if the user passes an open file to the original image creation routine? Does that get closed too?

@@ -155,12 +182,26 @@ def inter(self):
def is_proxy(self):
return True

@contextmanager
def _get_fileobj(self):
'''Create and return a new ``ImageOpener``, or return an existing one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean up docstring for numpy docstring standard?

# Make a dummy lock-like thing to make the code below a bit nicer
if lock is None:

class DummyLock(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Push this out of the method to the module top-level, maybe as class _NullLock ?

@pauldmccarthy
Copy link
Contributor Author

Hopefully all of the code comments have been addressed with the recent push - well spotted on the handling of open file objects - the ArrayProxy was actually closing them if keep_file_open=True :) I've added another test case to check for this.

I will set up indexed_gziplinux wheel builds later today. But Windows is a problem - I think that some work is needed for indexed_gzip to compile on windows, but I don't have a test machine, nor do I know anything about windows development. I could do with some help on this one!

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 96.259% when pulling 0c902d2 on pauldmccarthy:indexed_gzip into bb8c318 on nipy:master.

@effigies
Copy link
Member

effigies commented Sep 6, 2017

Right now, we have two variables (keep_file_open and have_indexed_gzip) that control behavior, and these are independent, one controlled by the environment, and one by user call.

keep_file_open have_indexed_gzip action
True True Correct behavior
True False Filehandles not garbage-collected
False True Gzip index recalculated with each ArrayProxy read
False False Old behavior

So code intended to use indexed_gzip may accumulate filehandles for no purpose on machines that don't have it installed, and that code that doesn't expect it can't take advantage of it. Further, keep_file_open is being applied to unzipped files, which do not need this behavior at all. I would propose the following logic:

keep_file_open have_indexed_gzip gzipped? action
True True True Use indexed gzip
* * False Old behavior
* False * Old behavior
False * * Old behavior

In addition, I think we need to update the default behaviors. I think the simplest approach, conceptually, would be keep_file_open = have_indexed_gzip and fname.endswith('gz').

The big drawback to that logic would be that a user who installs have_indexed_gzip will suddenly start exhausting filehandles if they're opening large numbers of .nii.gz files. I'm unsure how to deal with this. I don't love the idea of proliferating environment variables, but perhaps the simplest approach would be to allow the new behavior to be disabled with NIBABEL_NO_INDEXED_GZIP. Then you have a simple check:

keep_file_open = all(have_indexed_gzip, fname.endswith('gz'), 'NIBABEL_NO_INDEXED_GZIP' not in os.environ)

The minimum update, I think, would be to make keep_file_open have no effect in ArrayProxy, if have_indexed_gzip is False or the file isn't gzipped.

What do y'all think?

@pauldmccarthy
Copy link
Contributor Author

pauldmccarthy commented Sep 6, 2017

Hey @effigies, I guess my line of thinking was that the keep_files_open flag was conceptually independent of the use of indexed_gzip, and would be explicitly selected by the user.

But to be honest I can't think of a specific use case where one would want or need keep_file_open=True, except when using something like indexed_gzip. So in principle I am ok with your proposal.

Except for one thing - couldn't there be a situation where have_indexed_gzip is True, and the user explicitly wantskeep_file_open=False?

Oh, I also meant to clarify. Pretty sure you already know this, but when keep_file_open=True, file handles will persist only for the lifetime of the owning ArrayProxy. They will be closed when the ArrayProxy is gc'd.

@effigies
Copy link
Member

effigies commented Sep 6, 2017

Yes, that's definitely a possibility.

Perhaps what makes sense is a different default to keep_file_open, such as keep_file_open='indexed_gzip', which can then be forced one way or the other with True or False. I don't really like the idea of having the default value change from one install to another... that is pydoc showing True or False, depending on whether another library is installed.

In this case, if we have keep_file_open='indexed_gzip' and we discover that we don't have a .gz file OR we don't have indexed_gzip, then we can set it to False and get standard behavior. If the user specifies False, we also get standard behavior. In the very rare case that the user specifies True, they'll get a bunch of filehandles, but at least they'll be expecting them.

@pauldmccarthy
Copy link
Contributor Author

Right - I had overlooked the fact that you were only talking about changing the default behaviour, rather than hiding the keep_file_open option from the user.

I think your new proposal sounds good - sensible/common usage default, but still allowing the user full control. I'll push some new commits either tonight or tomorrow accordingly.

@@ -17,6 +17,21 @@
SKIP_THRESH = 2 ** 8



Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One too many blank lines, here, causing the style checks to fail.

threading.Lock to protect seek/read pairs. ArrayProxy creates said
lock, and passes it into fileslice calls. Also fixed typo in openers.py
from previous commit
read_segments, and to the model level. Cleaned up ArrayProxy._get_fileobj
docs (and replaced all triple-single quotes with triple-double quotes)
…g - if

file is gz, and indexed_gzip is True, kfo is set to True. ArrayProxy is also
using an RLock instead of a Lock, as I don't think there is a good reason not
to.
@matthew-brett
Copy link
Member

:) - I will need it!

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 96.262% when pulling 16f191e on pauldmccarthy:indexed_gzip into 23539e8 on nipy:master.

@matthew-brett
Copy link
Member

Hi guys. Sorry, now I have grant deadline - for Monday afternoon UK time.

Can I check my understanding here?

Before this PR, if you open an image with a filename, there's a new file object created and deleted for each read you do.

After this PR, you'll get the same behavior only if indexed_gzip is not installed?

So, with this PR, if could happen that your code is working, opening lots of files without running out of file handles, until you install indexed_gzip at which point you do start running out of file handles?

If that's right - can you think of a way round it? What about having module-level constants to control the behavior?

@pauldmccarthy
Copy link
Contributor Author

pauldmccarthy commented Sep 23, 2017

Hey @matthew-brett,

This is why I originally had the keep_file_open flag default to False, which would preserve the original behaviour of opening a new file handle for each access.

But I think we already have a solution - in the PR's current state, the use of indexed_gzip, and whether file handles are kept open or not, is dictated by the module-level nibabel.openers.HAVE_INDEXED_GZIP flag. So this flag could be overridden by a user who has indexed_gzip installed, but wants the original behaviour of using a new file handle on each access.

So the original behaviour can be achieved in one of two ways:

  • Explicitly set keep_file_open=False when you create an image
  • Set nibabel.openers.HAVE_INDEXED_GZIP = False before creating any images

What do you think?

@matthew-brett
Copy link
Member

I'm worried that the user might hit a 'too many files open' error without knowing why. So I think the default behavior should stay the same, but it can be overridden by the module level default.

Can we just have keep_file_open as the thing that can be changed by default, and we always use indexed_gzip if available, and this flag is set to True, either by the module default, or by the user with the explicit call? Is there a reason not to use indexed_gzip if it is present, and the open command is compatible?

@pauldmccarthy
Copy link
Contributor Author

If I understand you correctly, these changes should do the trick:

  • Add a module-level KEEP_FILE_OPEN flag to nibabel.arrayproxy, which takes one of True, False, 'auto', and is initally set to False
  • Change ArrayProxy.__init__ so the default value for keep_file_open is set to the current value of nibabel.arrayproxy.KEEP_FILE_OPEN

It may be better to default keep_file_open arguments to None, and then have ArrayProxy.__init__ clobber it with the current value of the module-level KEEP_FILE_OPEN. This would avoid problems with other modules (e.g. analyze, mghformat) importing a stale version of KEEP_FILE_OPEN on startup.

How does that sound?

@matthew-brett
Copy link
Member

I'm not sure what you mean by the "stale version", but yes, that sounds like a reasonable way to go.

@pauldmccarthy
Copy link
Contributor Author

Great, I'll make the changes now :)

By 'stale', I just mean avoiding having multiple copies of this module level KEEP_FILE_OPEN flag floating about. So rather than analyze.py doing:

from .arrayproxy import ArrayProxy, KEEP_FILE_OPEN
...
    def from_file_map(..., keep_file_open=KEEP_FILE_OPEN):

it would be better to do this:

    def from_file_map(..., keep_file_open=None):

and then have arrayproxy.py do this:

KEEP_FILE_OPEN_DEFAULT = False
class ArrayProxy(object):
...
    def __init__(..., keep_file_open=None):
        if keep_file_open is None:
            keep_file_open = KEEP_FILE_OPEN_DEFAULT

Because in the former there would be multiple versions of KEEP_FILE_OPEN floating around.

@matthew-brett
Copy link
Member

I see - thanks - yes, that sounds right.

…via a

module-level arrayproxy.KEEP_FILE_OPEN_DEFAULT flag
Copy link
Member

@matthew-brett matthew-brett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these changes - nicely done.

@@ -37,6 +37,24 @@
from .openers import ImageOpener, HAVE_INDEXED_GZIP


KEEP_FILE_OPEN_DEFAULT = False
"""This flag controls whether a new file handle is created every time an image
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer the comment above, though I know it's a horrible nit-pick.

assert not proxy._keep_file_open


def test_keep_file_open_default():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 96.262% when pulling 9008696 on pauldmccarthy:indexed_gzip into 23539e8 on nipy:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 96.262% when pulling 0b41c49 on pauldmccarthy:indexed_gzip into 23539e8 on nipy:master.

Copy link
Member

@matthew-brett matthew-brett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more small comments.

Sorry - I didn't check - did you check None as a possible input argument to the various functions that do now allow it (instead of 'auto', True, False).

@@ -950,6 +950,18 @@ def from_file_map(klass, file_map, mmap=True):
`mmap` value of True gives the same behavior as ``mmap='c'``. If
image data file cannot be memory-mapped, ignore `mmap` value and
read array from file.
keep_file_open : { 'auto', True, False }, optional, keyword only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you now need {None, 'auto', True, False}, and to say that None is the default, and gives the KEEP_FILE_OPEN_DEFAULT value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise for the other docstrings.

return keep_file_open
if keep_file_open != 'auto':
raise ValueError(
'keep_file_open should be one of {\'auto\', True, False }')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add None here too I think.


The specific behaviour depends on the value of the ``keep_file_open``
flag that was passed to ``__init__``.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra blank line.

@@ -17,6 +17,20 @@
SKIP_THRESH = 2 ** 8


class _NullLock(object):
"""The ``_NullLock`` is an object which can be used in place of a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single standalone line as first line in docstring. Maybe:

Can be used as no-function dummy object in place of ``threading.lock``

@@ -634,29 +648,41 @@ def read_segments(fileobj, segments, n_bytes):
absolute file offset in bytes and number of bytes to read
n_bytes : int
total number of bytes that will be read
lock : threading.Lock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can actually be any context manager - or at least the NullLock, no?

Maybe:

lock : {None, threading.Lock, lock-a-like}, optional

... None (the default) gives a lock-a-like object that does no locking.   

from ..nifti1 import Nifti1Header

from numpy.testing import assert_array_equal, assert_array_almost_equal
from nose.tools import (assert_true, assert_false, assert_equal,
assert_not_equal, assert_raises)
from nibabel.testing import VIRAL_MEMMAP
import mock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should go above the other less-standard test-related modules like numpy.testing.

# created
class CountingImageOpener(ImageOpener):

numOpeners = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be num_openers (mixedCase not allowed unless it's already the default naming scheme : https://www.python.org/dev/peps/pep-0008/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A throw-back from many years as a java programmer :)

x , y, z = [int(c) for c in voxels[i, :]]
assert proxy_kfp[x, y, z] == x * 100 + y * 10 + z
assert CountingImageOpener.numOpeners == 1
# Test that the keep_file_open flag has no effect if an open file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice comments explaining the tests, thanks.

@@ -20,6 +20,7 @@
from nose.tools import (assert_true, assert_false, assert_equal,
assert_not_equal, assert_raises)
from ..testing import error_warnings
import mock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mock before other testing imports.

assert not proxy._keep_file_open
# KEEP_FILE_OPEN_DEFAULT=True should cause keep_file_open to be True,
# regardless of whether or not indexed_gzip is present
with mock.patch('nibabel.arrayproxy.KEEP_FILE_OPEN_DEFAULT', True), \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way of making this decorator so you can use it without having to retype it?

@coveralls
Copy link

Coverage Status

Coverage increased (+0.02%) to 96.263% when pulling 67405b0 on pauldmccarthy:indexed_gzip into 23539e8 on nipy:master.

…pen file

handles because, when called by Opener.__init__, it will only ever get passed
file names.
@coveralls
Copy link

Coverage Status

Coverage increased (+0.03%) to 96.271% when pulling 790c037 on pauldmccarthy:indexed_gzip into 23539e8 on nipy:master.

@pauldmccarthy
Copy link
Contributor Author

It's my turn to apologise - I'm on holiday from tomorrow for a week, without laptop. So there'll be a delay on my part.

@matthew-brett
Copy link
Member

I think it's OK to merge now - I'd like to do some tiny docstring cleanup of blank lines and such like, but no need to wait for that.

Chris - OK with you? Wanna push the green button?

@effigies
Copy link
Member

Nice! Thanks for suffering so many rounds of review, @pauldmccarthy.

@effigies effigies merged commit dbe74e1 into nipy:master Sep 25, 2017
@effigies effigies mentioned this pull request Sep 27, 2017
28 tasks
@pauldmccarthy pauldmccarthy deleted the indexed_gzip branch March 26, 2018 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants