- 
                Notifications
    You must be signed in to change notification settings 
- Fork 3.9k
ARROW-2036: [Python] Support standard IOBase methods on NativeFile #1517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 
 I believe this also fixes https://issues.apache.org/jira/browse/ARROW-2030, cc @cpcloud. edit: I may have misunderstood ARROW-2030, this may not apply here | 
| @jcrist I don't think this PR fixes that because even if the initialization happens in the parent class it still won't show up in the child instance. If you attach  | 
| 
 Are you sure? I added a bogus attribute (without  | 
        
          
                python/pyarrow/io.pxi
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should own_file be true for most of these? I stuck with the existing behavior, but this seems odd to me. For example MemoryMappedFile opens the file itself, why isn't it responsible for closing it on __dealloc__? Are we relying on a C++ destructor to handle that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this arose due to some wacky GC issues with HDFS file handles, I'd have to look back at the original commit that added this to remember. My prior would be that __dealloc__ / __del__ should close the file if it is open. With MemoryMappedFile, this is a no-op because the file descriptor is owned by the mapped buffer, whose lifetime may exceed the file object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'll leave that for a follow-up PR if that's ok. I believe this attribute is unnecessary for any of the existing file objects.
|  | 
| If you look at the generated C code, there's no  | 
| That's a completely separate issue than the one described though. Those attributes are C level only, not exposed to python. This has nothing to do with initialization order or subclassing. FWIW this PR adds  | 
| I am aware that those properties are C only. ARROW-2030 was showing up in a user's code as an attribute error (IIRC not because they were explicitly accessing the property) which occurred because something was calling  It's not about making them accessible, it's about the manifestation of a possible bug. It's possible I'm mistaken here, so I'll post in the JIRA with a reproducible test case if I can make one. We can continue the discussion there. | 
ae2f1f5    to
    082acd6      
    Compare
  
    | Sounds good, I suspect we're just talking past each other here. Apologies for sidetracking this PR. In regards to this PR, I'm a bit unsure about the  | 
| I renamed Writeable -> Writable in C++ recently, so probably best to use the more common spelling such that it's not disruptive | 
| 
 Was this supposed to be a complete port over from "writeable" to "writable" (minus libhdfs, which is a dependency)? I found a few classes/methods still using "Writeable" (e.g.  The python code uses writable wherever it makes sense. I think this is ready for review/merge. | 
        
          
                python/pyarrow/io.pxi
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated, but I wonder what happens if two threads call close() concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, maybe the nogil block here should be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, if rd_file.close() or wr_file.close() may do I/O, you want to release the GIL.
One possible solution is something like:
rd_file, self.rd_file = self.rd_file, None
wr_file, self.wr_file = self.wr_file, None
with nogil:
    if self.is_readable:
        check_status(rd_file.get().close())
    else:
        check_status(rd_file.get().close())
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'd work, but prevents calling close again in the case of a failure. It's not actually clear to me what should happen if close fails (reset state?, set closed anayway?). Anyway, should probably open a jira about this, as it's unrelated to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A general guideline for close() is that even if it fails somehow (e.g. IO error), it should still put the object in a "closed" state (and release associated resources).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GIL should be released here -- if any Python resources are required in Close then the GIL will be re-acquired
| Test failure is unrelated, ready for review. We're waiting on this to simplify a PR adding arrow's hdfs (really just libhdfs) support to dask. | 
| Looking now. Can you rebase? | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine -- just a minor comment about methods that might be better as properties (@cpcloud do you have an opinion?)
        
          
                python/pyarrow/io.pxi
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be properties?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The seekable/readable/writable? No, this is required for the python file interface. See https://docs.python.org/3/library/io.html#io.IOBase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Roger, thanks
        
          
                python/pyarrow/io.pxi
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GIL should be released here -- if any Python resources are required in Close then the GIL will be re-acquired
Adds support for most common file methods, adding enough to use `io.TextIOWrapper`. Added attribtes/methods: - `closed` attribute - `readable`, `writable`, `seekable` methods - `read1` alias for `read` to support `TextIOWrapper` on python 2 - No-op `flush` method Also refactored the cython internals a bit, adding default settings for `is_readable` and `is_writable`, which makes subclasses not need to set them in all places. Also renamed `is_writeable` to `is_writable` for common spelling with the standard python method `writable`.
082acd6    to
    f3bc954      
    Compare
  
    | 
 Rebased. I'm curious why was that necessary; there were no merge conflicts (none of the files edited have been touched since I submitted the PR). Just to run the CI again? | 
| +1, will merge once the build completes | 
| @jcrist can you enable Appveyor on your fork? Then builds for your PRs will run a little more quickly there | 
| Cool, from https://ci.appveyor.com/project/jcrist/arrow/build/1.0.1 looks like we are good. The Travis CI failure is ARROW-2062 | 
Adds support for most common file methods, adding enough to use
io.TextIOWrapper.Added attribtes/methods:
closedattributereadable,writable,seekablemethodsread1alias forreadto supportTextIOWrapperon python 2flushmethodAlso refactored the cython internals a bit, adding default settings for
is_readableandis_writable, which makes subclasses not need to setthem in all places.
Also renamed
is_writeabletois_writablefor common spelling withthe standard python method
writable.