-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-302: [C++/Python] Implement C++ IO interfaces for interacting with Python file and bytes objects #152
Conversation
Change-Id: I712800f68cd0c9b0de1cca0d511f3f7d297c05b2
Change-Id: I2feee0fd6a6faa5c486ab4699a7cdc8c042d5726
Change-Id: Ie9f6f13d7e1fb55e948da2a7709864978df7306a
Change-Id: If86ff0ce8b25bd28c9028e93624af62ba01f8a60
Change-Id: I9d6e656ea152ef678ac01e22e4e24b3783ccacc0
Change-Id: I131f49d73c73097c5ac250bfc0c60639358c0954
Change-Id: I2ef9faca323f333e7f828d757d8740d20e5e467a
@xhochy there's a small API change here, I will patch Parquet and bump the pinned hash |
Change-Id: Icd397820eb6d7d9aa36d617c3f50900ec7d04030
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM but I hope you're going to fix the part marked with // XXX:
RETURN_NOT_OK(ReadAt(position, nbytes, &bytes_read, buffer->mutable_data())); | ||
|
||
// XXX: heuristic | ||
if (bytes_read < nbytes / 2) { RETURN_NOT_OK(buffer->Resize(bytes_read)); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we resize down to bytes_read
? PoolBuffer
itself will only increase its allocated memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do, I left this mostly to be controversial since the realloc-copy may not be worth it when bytes_read is close to nbytes
RETURN_NOT_OK(Read(nbytes, &bytes_read, buffer->mutable_data())); | ||
|
||
// XXX: heuristic | ||
if (bytes_read < nbytes / 2) { RETURN_NOT_OK(buffer->Resize(bytes_read)); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we resize down to bytes_read? PoolBuffer itself will only increase its allocated memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also how do we communicate the actual bytes read then to the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we should definitely always exactly resize so that we know the number of bytes read
public: | ||
virtual ~FileInterface() {} | ||
virtual ~FileInterface() = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually implemented interfaces.cc
so it should not be deleted here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this on purpose — I had to do some wrangling with the class hierarchy when I added default virtual implementations (which can be overridden, like they are for HDFS) of ReadAt
class PythonFile { | ||
public: | ||
PythonFile(PyObject* file); | ||
~PythonFile(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Destructors should be virtual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not here, this class has no subclasses
class PYARROW_EXPORT PyReadableFile : public arrow::io::ReadableFileInterface { | ||
public: | ||
explicit PyReadableFile(PyObject* file); | ||
~PyReadableFile(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Destructors should be virtual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implicitly virtual because of the abstract base class. I think this can be marked override
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A quick Google seems to confirm this, but it would be nice to find a more authoritative reference: http://stackoverflow.com/questions/5610135/with-virtual-destructors-do-i-need-to-explicitly-declare-a-virtual-destructor-f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually override doesn't make sense with virtual destructors because they all get invoked, not overridden
class PYARROW_EXPORT PyOutputStream : public arrow::io::OutputStream { | ||
public: | ||
explicit PyOutputStream(PyObject* file); | ||
~PyOutputStream(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Destructors should be virtual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
class PYARROW_EXPORT PyBytesReader : public arrow::io::BufferReader { | ||
public: | ||
explicit PyBytesReader(PyObject* obj); | ||
~PyBytesReader(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Destructors should be virtual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above
Change-Id: Ib67ce254d8a2aeff3df12e8d57525f3383a3c344
Change-Id: I498fb6e26642065c32c8425e5e6564bfb7b75c78
+1 |
I hope to get some time to setup |
…alues Closes apache#145 (this bug was fixed as part of other recent patches) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#152 from wesm/PARQUET-703 and squashes the following commits: 232d2f1 [Wes McKinney] Validate that ColumnChunk metadata has the number of values set including nulls
…alues Closes apache#145 (this bug was fixed as part of other recent patches) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#152 from wesm/PARQUET-703 and squashes the following commits: 232d2f1 [Wes McKinney] Validate that ColumnChunk metadata has the number of values set including nulls Change-Id: I6f7ac294fe19fb311a4f7457b1216887b4326034
…alues Closes apache#145 (this bug was fixed as part of other recent patches) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#152 from wesm/PARQUET-703 and squashes the following commits: 232d2f1 [Wes McKinney] Validate that ColumnChunk metadata has the number of values set including nulls Change-Id: I6f7ac294fe19fb311a4f7457b1216887b4326034
…alues Closes apache#145 (this bug was fixed as part of other recent patches) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#152 from wesm/PARQUET-703 and squashes the following commits: 232d2f1 [Wes McKinney] Validate that ColumnChunk metadata has the number of values set including nulls Change-Id: I6f7ac294fe19fb311a4f7457b1216887b4326034
…alues Closes apache#145 (this bug was fixed as part of other recent patches) Author: Wes McKinney <wes.mckinney@twosigma.com> Closes apache#152 from wesm/PARQUET-703 and squashes the following commits: 232d2f1 [Wes McKinney] Validate that ColumnChunk metadata has the number of values set including nulls Change-Id: I6f7ac294fe19fb311a4f7457b1216887b4326034
* support invalid charactor in url * check space charactor in url queries * change string to string_view * change the way to check invalid charactors
This will enable code (such as arrow IPC or Parquet) that only knows about Arrow's IO subsystem to interact with Python objects in various ways. In other words, when we have in C++:
then the C++ file handle could be invoking the
read
method of a Python object. Same goes forarrow::io::OutputStream
andwrite
methods. There's data copying in some places overhead because of the rigid memory ownership semantics of thePyBytes
type, but this can't be avoided here.Another nice thing is that if we have some data in a Python bytes object that we want to expose to some other C++ component, we can wrap it in the
PyBytesReader
which provides zero-copy read access to the underlying data.