Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-302: [C++/Python] Implement C++ IO interfaces for interacting with Python file and bytes objects #152

Closed
wants to merge 11 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Oct 2, 2016

This will enable code (such as arrow IPC or Parquet) that only knows about Arrow's IO subsystem to interact with Python objects in various ways. In other words, when we have in C++:

std::shared_ptr<io::ReadableFileInterface> handle = ...;
handle->Read(nbytes, &out);

then the C++ file handle could be invoking the read method of a Python object. Same goes for arrow::io::OutputStream and write methods. There's data copying in some places overhead because of the rigid memory ownership semantics of the PyBytes type, but this can't be avoided here.

Another nice thing is that if we have some data in a Python bytes object that we want to expose to some other C++ component, we can wrap it in the PyBytesReader which provides zero-copy read access to the underlying data.

Change-Id: I712800f68cd0c9b0de1cca0d511f3f7d297c05b2
Change-Id: I2feee0fd6a6faa5c486ab4699a7cdc8c042d5726
Change-Id: Ie9f6f13d7e1fb55e948da2a7709864978df7306a
Change-Id: If86ff0ce8b25bd28c9028e93624af62ba01f8a60
Change-Id: I9d6e656ea152ef678ac01e22e4e24b3783ccacc0
Change-Id: I131f49d73c73097c5ac250bfc0c60639358c0954
Change-Id: I2ef9faca323f333e7f828d757d8740d20e5e467a
@wesm
Copy link
Member Author

wesm commented Oct 3, 2016

@xhochy there's a small API change here, I will patch Parquet and bump the pinned hash

Change-Id: I87b6281d7a42ce7c3d0a9b9ae8744afe530a84d6
Change-Id: Icd397820eb6d7d9aa36d617c3f50900ec7d04030
Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM but I hope you're going to fix the part marked with // XXX:

RETURN_NOT_OK(ReadAt(position, nbytes, &bytes_read, buffer->mutable_data()));

// XXX: heuristic
if (bytes_read < nbytes / 2) { RETURN_NOT_OK(buffer->Resize(bytes_read)); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we resize down to bytes_read? PoolBuffer itself will only increase its allocated memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do, I left this mostly to be controversial since the realloc-copy may not be worth it when bytes_read is close to nbytes

RETURN_NOT_OK(Read(nbytes, &bytes_read, buffer->mutable_data()));

// XXX: heuristic
if (bytes_read < nbytes / 2) { RETURN_NOT_OK(buffer->Resize(bytes_read)); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we resize down to bytes_read? PoolBuffer itself will only increase its allocated memory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also how do we communicate the actual bytes read then to the caller?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means we should definitely always exactly resize so that we know the number of bytes read

public:
virtual ~FileInterface() {}
virtual ~FileInterface() = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually implemented interfaces.cc so it should not be deleted here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this on purpose — I had to do some wrangling with the class hierarchy when I added default virtual implementations (which can be overridden, like they are for HDFS) of ReadAt

class PythonFile {
public:
PythonFile(PyObject* file);
~PythonFile();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Destructors should be virtual.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not here, this class has no subclasses

class PYARROW_EXPORT PyReadableFile : public arrow::io::ReadableFileInterface {
public:
explicit PyReadableFile(PyObject* file);
~PyReadableFile();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Destructors should be virtual.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicitly virtual because of the abstract base class. I think this can be marked override

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A quick Google seems to confirm this, but it would be nice to find a more authoritative reference: http://stackoverflow.com/questions/5610135/with-virtual-destructors-do-i-need-to-explicitly-declare-a-virtual-destructor-f

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually override doesn't make sense with virtual destructors because they all get invoked, not overridden

class PYARROW_EXPORT PyOutputStream : public arrow::io::OutputStream {
public:
explicit PyOutputStream(PyObject* file);
~PyOutputStream();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Destructors should be virtual.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

class PYARROW_EXPORT PyBytesReader : public arrow::io::BufferReader {
public:
explicit PyBytesReader(PyObject* obj);
~PyBytesReader();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Destructors should be virtual.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above

Change-Id: Ib67ce254d8a2aeff3df12e8d57525f3383a3c344
Change-Id: I498fb6e26642065c32c8425e5e6564bfb7b75c78
@wesm
Copy link
Member Author

wesm commented Oct 4, 2016

+1

@asfgit asfgit closed this in c7e6a07 Oct 4, 2016
@wesm wesm deleted the ARROW-302 branch October 4, 2016 03:15
@xhochy
Copy link
Member

xhochy commented Oct 4, 2016

I hope to get some time to setup clang-static-analyzer for arrow in the future. That should help us in the question about where you need virtual for destructors and also notify us of virtual method calls in constructors.

wesm added a commit to wesm/arrow that referenced this pull request Sep 2, 2018
…alues

Closes apache#145 (this bug was fixed as part of other recent patches)

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#152 from wesm/PARQUET-703 and squashes the following commits:

232d2f1 [Wes McKinney] Validate that ColumnChunk metadata has the number of values set including nulls
wesm added a commit to wesm/arrow that referenced this pull request Sep 4, 2018
…alues

Closes apache#145 (this bug was fixed as part of other recent patches)

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#152 from wesm/PARQUET-703 and squashes the following commits:

232d2f1 [Wes McKinney] Validate that ColumnChunk metadata has the number of values set including nulls

Change-Id: I6f7ac294fe19fb311a4f7457b1216887b4326034
wesm added a commit to wesm/arrow that referenced this pull request Sep 6, 2018
…alues

Closes apache#145 (this bug was fixed as part of other recent patches)

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#152 from wesm/PARQUET-703 and squashes the following commits:

232d2f1 [Wes McKinney] Validate that ColumnChunk metadata has the number of values set including nulls

Change-Id: I6f7ac294fe19fb311a4f7457b1216887b4326034
wesm added a commit to wesm/arrow that referenced this pull request Sep 7, 2018
…alues

Closes apache#145 (this bug was fixed as part of other recent patches)

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#152 from wesm/PARQUET-703 and squashes the following commits:

232d2f1 [Wes McKinney] Validate that ColumnChunk metadata has the number of values set including nulls

Change-Id: I6f7ac294fe19fb311a4f7457b1216887b4326034
wesm added a commit to wesm/arrow that referenced this pull request Sep 8, 2018
…alues

Closes apache#145 (this bug was fixed as part of other recent patches)

Author: Wes McKinney <wes.mckinney@twosigma.com>

Closes apache#152 from wesm/PARQUET-703 and squashes the following commits:

232d2f1 [Wes McKinney] Validate that ColumnChunk metadata has the number of values set including nulls

Change-Id: I6f7ac294fe19fb311a4f7457b1216887b4326034
jackylee-ch added a commit to jackylee-ch/arrow that referenced this pull request Sep 10, 2022
* support invalid charactor in url

* check space charactor in url queries

* change string to string_view

* change the way to check invalid charactors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants