-
Notifications
You must be signed in to change notification settings - Fork 262
Conditionally use indexed_gzip #552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #552 +/- ##
==========================================
+ Coverage 94.29% 94.34% +0.04%
==========================================
Files 177 177
Lines 24440 24670 +230
Branches 2619 2634 +15
==========================================
+ Hits 23046 23275 +229
- Misses 919 920 +1
Partials 475 475
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also:
- add a test matrix entry to .travis.yml to test that this works;
- can you build / upload PyPI wheels for linux and windows, for indexed_gzip? See https://github.com/matthew-brett/multibuild for one way of doing this.
nibabel/arrayproxy.py
Outdated
'''If this ``ArrayProxy`` was created with ``keep_file_open=True``, | ||
the open file object is closed if necessary. | ||
''' | ||
if hasattr(self, '_opener') and not self._opener.closed: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the user passes an open file to the original image creation routine? Does that get closed too?
nibabel/arrayproxy.py
Outdated
@@ -155,12 +182,26 @@ def inter(self): | |||
def is_proxy(self): | |||
return True | |||
|
|||
@contextmanager | |||
def _get_fileobj(self): | |||
'''Create and return a new ``ImageOpener``, or return an existing one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean up docstring for numpy docstring standard?
nibabel/fileslice.py
Outdated
# Make a dummy lock-like thing to make the code below a bit nicer | ||
if lock is None: | ||
|
||
class DummyLock(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Push this out of the method to the module top-level, maybe as class _NullLock
?
Hopefully all of the code comments have been addressed with the recent push - well spotted on the handling of open file objects - the I will set up |
Right now, we have two variables (
So code intended to use
In addition, I think we need to update the default behaviors. I think the simplest approach, conceptually, would be The big drawback to that logic would be that a user who installs keep_file_open = all(have_indexed_gzip, fname.endswith('gz'), 'NIBABEL_NO_INDEXED_GZIP' not in os.environ) The minimum update, I think, would be to make What do y'all think? |
Hey @effigies, I guess my line of thinking was that the But to be honest I can't think of a specific use case where one would want or need Except for one thing - couldn't there be a situation where Oh, I also meant to clarify. Pretty sure you already know this, but when |
Yes, that's definitely a possibility. Perhaps what makes sense is a different default to In this case, if we have |
Right - I had overlooked the fact that you were only talking about changing the default behaviour, rather than hiding the I think your new proposal sounds good - sensible/common usage default, but still allowing the user full control. I'll push some new commits either tonight or tomorrow accordingly. |
nibabel/fileslice.py
Outdated
@@ -17,6 +17,21 @@ | |||
SKIP_THRESH = 2 ** 8 | |||
|
|||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One too many blank lines, here, causing the style checks to fail.
creating a new on on every file access.
threading.Lock to protect seek/read pairs. ArrayProxy creates said lock, and passes it into fileslice calls. Also fixed typo in openers.py from previous commit
…internal threading.Lock object.
openers._gzip_open function
…s are created appropriately
read_segments, and to the model level. Cleaned up ArrayProxy._get_fileobj docs (and replaced all triple-single quotes with triple-double quotes)
…g - if file is gz, and indexed_gzip is True, kfo is set to True. ArrayProxy is also using an RLock instead of a Lock, as I don't think there is a good reason not to.
…file_open flag. Other minor refactorings
:) - I will need it! |
Hi guys. Sorry, now I have grant deadline - for Monday afternoon UK time. Can I check my understanding here? Before this PR, if you open an image with a filename, there's a new file object created and deleted for each read you do. After this PR, you'll get the same behavior only if So, with this PR, if could happen that your code is working, opening lots of files without running out of file handles, until you install If that's right - can you think of a way round it? What about having module-level constants to control the behavior? |
Hey @matthew-brett, This is why I originally had the But I think we already have a solution - in the PR's current state, the use of So the original behaviour can be achieved in one of two ways:
What do you think? |
I'm worried that the user might hit a 'too many files open' error without knowing why. So I think the default behavior should stay the same, but it can be overridden by the module level default. Can we just have |
If I understand you correctly, these changes should do the trick:
It may be better to default How does that sound? |
I'm not sure what you mean by the "stale version", but yes, that sounds like a reasonable way to go. |
Great, I'll make the changes now :) By 'stale', I just mean avoiding having multiple copies of this module level from .arrayproxy import ArrayProxy, KEEP_FILE_OPEN
...
def from_file_map(..., keep_file_open=KEEP_FILE_OPEN): it would be better to do this: def from_file_map(..., keep_file_open=None): and then have KEEP_FILE_OPEN_DEFAULT = False
class ArrayProxy(object):
...
def __init__(..., keep_file_open=None):
if keep_file_open is None:
keep_file_open = KEEP_FILE_OPEN_DEFAULT Because in the former there would be multiple versions of |
I see - thanks - yes, that sounds right. |
…via a module-level arrayproxy.KEEP_FILE_OPEN_DEFAULT flag
KEEP_FILE_OPEN_DEFAULT attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for these changes - nicely done.
nibabel/arrayproxy.py
Outdated
@@ -37,6 +37,24 @@ | |||
from .openers import ImageOpener, HAVE_INDEXED_GZIP | |||
|
|||
|
|||
KEEP_FILE_OPEN_DEFAULT = False | |||
"""This flag controls whether a new file handle is created every time an image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer the comment above, though I know it's a horrible nit-pick.
nibabel/tests/test_arrayproxy.py
Outdated
assert not proxy._keep_file_open | ||
|
||
|
||
def test_keep_file_open_default(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more small comments.
Sorry - I didn't check - did you check None as a possible input argument to the various functions that do now allow it (instead of 'auto', True, False).
nibabel/analyze.py
Outdated
@@ -950,6 +950,18 @@ def from_file_map(klass, file_map, mmap=True): | |||
`mmap` value of True gives the same behavior as ``mmap='c'``. If | |||
image data file cannot be memory-mapped, ignore `mmap` value and | |||
read array from file. | |||
keep_file_open : { 'auto', True, False }, optional, keyword only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you now need {None, 'auto', True, False}
, and to say that None is the default, and gives the KEEP_FILE_OPEN_DEFAULT value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise for the other docstrings.
nibabel/arrayproxy.py
Outdated
return keep_file_open | ||
if keep_file_open != 'auto': | ||
raise ValueError( | ||
'keep_file_open should be one of {\'auto\', True, False }') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add None here too I think.
|
||
The specific behaviour depends on the value of the ``keep_file_open`` | ||
flag that was passed to ``__init__``. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra blank line.
nibabel/fileslice.py
Outdated
@@ -17,6 +17,20 @@ | |||
SKIP_THRESH = 2 ** 8 | |||
|
|||
|
|||
class _NullLock(object): | |||
"""The ``_NullLock`` is an object which can be used in place of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Single standalone line as first line in docstring. Maybe:
Can be used as no-function dummy object in place of ``threading.lock``
nibabel/fileslice.py
Outdated
@@ -634,29 +648,41 @@ def read_segments(fileobj, segments, n_bytes): | |||
absolute file offset in bytes and number of bytes to read | |||
n_bytes : int | |||
total number of bytes that will be read | |||
lock : threading.Lock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can actually be any context manager - or at least the NullLock, no?
Maybe:
lock : {None, threading.Lock, lock-a-like}, optional
... None (the default) gives a lock-a-like object that does no locking.
nibabel/tests/test_arrayproxy.py
Outdated
from ..nifti1 import Nifti1Header | ||
|
||
from numpy.testing import assert_array_equal, assert_array_almost_equal | ||
from nose.tools import (assert_true, assert_false, assert_equal, | ||
assert_not_equal, assert_raises) | ||
from nibabel.testing import VIRAL_MEMMAP | ||
import mock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should go above the other less-standard test-related modules like numpy.testing.
nibabel/tests/test_arrayproxy.py
Outdated
# created | ||
class CountingImageOpener(ImageOpener): | ||
|
||
numOpeners = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this should be num_openers
(mixedCase not allowed unless it's already the default naming scheme : https://www.python.org/dev/peps/pep-0008/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A throw-back from many years as a java programmer :)
nibabel/tests/test_arrayproxy.py
Outdated
x , y, z = [int(c) for c in voxels[i, :]] | ||
assert proxy_kfp[x, y, z] == x * 100 + y * 10 + z | ||
assert CountingImageOpener.numOpeners == 1 | ||
# Test that the keep_file_open flag has no effect if an open file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice comments explaining the tests, thanks.
nibabel/tests/test_openers.py
Outdated
@@ -20,6 +20,7 @@ | |||
from nose.tools import (assert_true, assert_false, assert_equal, | |||
assert_not_equal, assert_raises) | |||
from ..testing import error_warnings | |||
import mock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mock before other testing imports.
nibabel/tests/test_arrayproxy.py
Outdated
assert not proxy._keep_file_open | ||
# KEEP_FILE_OPEN_DEFAULT=True should cause keep_file_open to be True, | ||
# regardless of whether or not indexed_gzip is present | ||
with mock.patch('nibabel.arrayproxy.KEEP_FILE_OPEN_DEFAULT', True), \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any way of making this decorator so you can use it without having to retype it?
…lice thread lock. Variable rename in fileslice to avoid collision with built-in.
…pen file handles because, when called by Opener.__init__, it will only ever get passed file names.
It's my turn to apologise - I'm on holiday from tomorrow for a week, without laptop. So there'll be a delay on my part. |
I think it's OK to merge now - I'd like to do some tiny docstring cleanup of blank lines and such like, but no need to wait for that. Chris - OK with you? Wanna push the green button? |
Nice! Thanks for suffering so many rounds of review, @pauldmccarthy. |
Howdy all,
I've been sitting on this one for a while. This PR proposes to conditionally use
indexed_gzip
instead ofgzip
to allow for faster random access to.gz
files. The default behaviour ofnibabel
is unchanged by this PR - to take advantage ofindexed_gzip
, one needs to load an image with a new flag, like so:This PR also proposes changes to
ArrayProxy
andfileslice.read_segments
which makes paired calls toseek
andread
thread-safe - see here for a discussion surrounding this change.Changes
The
ArrayProxy
has been modified so that it will optionally create and use a singleImageOpener
for its lifetime, rather than creating/destroying anImageOpener
every time the image data is accessed. This is controlled via thekeep_file_open
flag.The
fileslice.read_segments
function optionally accepts athreading.Lock
so that paired calls to theseek
/read
functions can be protected against changes to the running thread.The
openers._gzip_open
function has been changed to conditionally useindexed_gzip
when available.(Unrelated) There was a bug in
_gzip_open
which was causing themax_read_chunk
fix to have no effect. This has been fixed.Outstanding issues
indexed_gzip
is not available for windows platforms. I don't have a windows machine to test on, so would need somebody to help out in making it work on windows. See Installation on Windows pauldmccarthy/indexed_gzip#1. Note that this PR in its initial state does not breaknibabel
on windows, as the use ofindexed_gzip
is conditional upon its presence.indexed_gzip
does not support writing. If a file is opened in write mode, the built-inGzipFile
class is used. I am not against adding write-support intoindexed_gzip
, but to date have not considered it a necessity (and don't know how much time it would take!).The
MGHFormat
class uses theArrayProxy
, but I have not updated it to use thekeep_file_open
flag, so it does not benefit fromindexed_gzip
. Adding this would be trivial, but I thought I'd just start by proposing changes to NIFTI/ANALYZE access.The
PARRECImage
class implements its ownArrayProxy
, so does not benefit fromindexed_gzip
. ThePARRECArrayProxy
class would need to be updated similarly, or some refactorings made so that it uses the standardArrayProxy
. I've not looked too closely at thePARRECArrayProxy
, but on the surface it seems to be quite similar to theArrayProxy
class, so this should not be too difficult.indexed_gzip
is currently available as a binary wheel for macos, but only a source package for other platforms. Hence cython and a C compiler must be available for it to be installed. Is this a concern? I can look into generating linux builds if you think it is needed.Other notes
I have not updated
fileslice._simple_fileslice
to accept aLock
, because it is not used by anything (apart from unit tests)I decided against using
pread
for the thread-safe fix (as discussed in the mailing list thread) because the built-inGzipFile
class doesn't have apread
method.I have run the
nibabel
unit tests locally, both with and withoutindexed_gzip
. It may be desirable to build both of these scenarios into the standardnibabel
CI test suite.