Skip to content

Conversation

@jcrist
Copy link
Contributor

@jcrist jcrist commented Jan 25, 2018

Adds support for most common file methods, adding enough to use
io.TextIOWrapper.

Added attribtes/methods:

  • closed attribute
  • readable, writable, seekable methods
  • read1 alias for read to support TextIOWrapper on python 2
  • No-op flush method

Also refactored the cython internals a bit, adding default settings for
is_readable and is_writable, which makes subclasses not need to set
them in all places.

Also renamed is_writeable to is_writable for common spelling with
the standard python method writable.

@jcrist
Copy link
Contributor Author

jcrist commented Jan 25, 2018

Also refactored the cython internals a bit, adding default settings for
is_readable and is_writable, which makes subclasses not need to set
them in all places.

I believe this also fixes https://issues.apache.org/jira/browse/ARROW-2030, cc @cpcloud.

edit: I may have misunderstood ARROW-2030, this may not apply here

@cpcloud
Copy link
Contributor

cpcloud commented Jan 25, 2018

@jcrist I don't think this PR fixes that because even if the initialization happens in the parent class it still won't show up in the child instance. If you attach public or readonly then the child sees the initialized values of the parent's parameters. There's some aspect of cdef class initialization that's not 100% clear to me.

@jcrist
Copy link
Contributor Author

jcrist commented Jan 25, 2018

@jcrist I don't think this PR fixes that because even if the initialization happens in the parent class it still won't show up in the child instance.

Are you sure? I added a bogus attribute (without readonly or public), set it only in the __cinit__ of the base class, and added a method in a subclass that accesses it. Everything worked fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should own_file be true for most of these? I stuck with the existing behavior, but this seems odd to me. For example MemoryMappedFile opens the file itself, why isn't it responsible for closing it on __dealloc__? Are we relying on a C++ destructor to handle that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this arose due to some wacky GC issues with HDFS file handles, I'd have to look back at the original commit that added this to remember. My prior would be that __dealloc__ / __del__ should close the file if it is open. With MemoryMappedFile, this is a no-op because the file descriptor is owned by the mapped buffer, whose lifetime may exceed the file object

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll leave that for a follow-up PR if that's ok. I believe this attribute is unnecessary for any of the existing file objects.

@cpcloud
Copy link
Contributor

cpcloud commented Jan 25, 2018

@jcrist

In [12]: import pyarrow as pa

In [13]: hdfsfile = pa.HdfsFile()

In [14]: hasattr(hdfsfile, 'is_readable')
Out[14]: False

@cpcloud
Copy link
Contributor

cpcloud commented Jan 25, 2018

If you look at the generated C code, there's no PyMemberDef array generated to enable attribute lookup in the interpreter. You can access attributes in Cython code because it's not going through Python's attribute access mechanism

@jcrist
Copy link
Contributor Author

jcrist commented Jan 25, 2018

That's a completely separate issue than the one described though. Those attributes are C level only, not exposed to python. This has nothing to do with initialization order or subclassing.

FWIW this PR adds readable/writable methods to mirror the standard python file interface, which are thin wrappers around those attributes. I don't think is_readable/is_writable should be exposed at the Python level, as they're part of the implementation and not standard python file attributes.

@cpcloud
Copy link
Contributor

cpcloud commented Jan 26, 2018

I am aware that those properties are C only.

ARROW-2030 was showing up in a user's code as an attribute error (IIRC not because they were explicitly accessing the property) which occurred because something was calling _assert_readable() and raising an exception.

It's not about making them accessible, it's about the manifestation of a possible bug.

It's possible I'm mistaken here, so I'll post in the JIRA with a reproducible test case if I can make one. We can continue the discussion there.

@jcrist jcrist force-pushed the full-python-file-interface branch from ae2f1f5 to 082acd6 Compare January 26, 2018 14:37
@jcrist
Copy link
Contributor Author

jcrist commented Jan 26, 2018

Sounds good, I suspect we're just talking past each other here. Apologies for sidetracking this PR.

In regards to this PR, I'm a bit unsure about the writeable -> writable change. If you google "writeable or writable" you'll find other projects(e.g. nodejs) that have also gone through this decision - most settled on writable as it's a more common english spelling. I made this change originally to mirror the added writable method (part of the python file interface). However, a few dependencies of pyarrow (libhdfs and numpy) use "writeable" instead, as does the arrow c++ code. I could go either way on this change.

@wesm
Copy link
Member

wesm commented Jan 26, 2018

I renamed Writeable -> Writable in C++ recently, so probably best to use the more common spelling such that it's not disruptive

@jcrist
Copy link
Contributor Author

jcrist commented Jan 29, 2018

I renamed Writeable -> Writable in C++ recently, so probably best to use the more common spelling such that it's not disruptive

Was this supposed to be a complete port over from "writeable" to "writable" (minus libhdfs, which is a dependency)? I found a few classes/methods still using "Writeable" (e.g. WriteableFile or OSFile::OpenWriteable).


The python code uses writable wherever it makes sense. I think this is ready for review/merge.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, but I wonder what happens if two threads call close() concurrently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, maybe the nogil block here should be removed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if rd_file.close() or wr_file.close() may do I/O, you want to release the GIL.
One possible solution is something like:

rd_file, self.rd_file = self.rd_file, None
wr_file, self.wr_file = self.wr_file, None
with nogil:
    if self.is_readable:
        check_status(rd_file.get().close())
    else:
        check_status(rd_file.get().close())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd work, but prevents calling close again in the case of a failure. It's not actually clear to me what should happen if close fails (reset state?, set closed anayway?). Anyway, should probably open a jira about this, as it's unrelated to this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general guideline for close() is that even if it fails somehow (e.g. IO error), it should still put the object in a "closed" state (and release associated resources).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GIL should be released here -- if any Python resources are required in Close then the GIL will be re-acquired

@jcrist
Copy link
Contributor Author

jcrist commented Jan 30, 2018

Test failure is unrelated, ready for review. We're waiting on this to simplify a PR adding arrow's hdfs (really just libhdfs) support to dask.

@wesm
Copy link
Member

wesm commented Jan 30, 2018

Looking now. Can you rebase?

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine -- just a minor comment about methods that might be better as properties (@cpcloud do you have an opinion?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be properties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seekable/readable/writable? No, this is required for the python file interface. See https://docs.python.org/3/library/io.html#io.IOBase

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger, thanks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GIL should be released here -- if any Python resources are required in Close then the GIL will be re-acquired

Adds support for most common file methods, adding enough to use
`io.TextIOWrapper`.

Added attribtes/methods:

- `closed` attribute
- `readable`, `writable`, `seekable` methods
- `read1` alias for `read` to support `TextIOWrapper` on python 2
- No-op `flush` method

Also refactored the cython internals a bit, adding default settings for
`is_readable` and `is_writable`, which makes subclasses not need to set
them in all places.

Also renamed `is_writeable` to `is_writable` for common spelling with
the standard python method `writable`.
@jcrist jcrist force-pushed the full-python-file-interface branch from 082acd6 to f3bc954 Compare January 30, 2018 20:22
@jcrist
Copy link
Contributor Author

jcrist commented Jan 30, 2018

Can you rebase?

Rebased. I'm curious why was that necessary; there were no merge conflicts (none of the files edited have been touched since I submitted the PR). Just to run the CI again?

@wesm
Copy link
Member

wesm commented Jan 30, 2018

@jcrist the build was failing due to a Parquet dependency upgrade (Thrift 0.11), so the Python tests wouldn't run without picking up recent patches 0621765

@wesm
Copy link
Member

wesm commented Jan 30, 2018

+1, will merge once the build completes

@wesm
Copy link
Member

wesm commented Jan 30, 2018

@jcrist can you enable Appveyor on your fork? Then builds for your PRs will run a little more quickly there

@jcrist
Copy link
Contributor Author

jcrist commented Jan 30, 2018

@wesm
Copy link
Member

wesm commented Jan 30, 2018

Cool, from https://ci.appveyor.com/project/jcrist/arrow/build/1.0.1 looks like we are good. The Travis CI failure is ARROW-2062

@wesm wesm closed this in 3e63084 Jan 30, 2018
@jcrist jcrist deleted the full-python-file-interface branch January 30, 2018 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants