-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1756: [Python] Fix large file read/write error #1276
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this issue limited to Mac? I thought the bug reporter was on Linux, but either way this is a good thing to fix
cpp/src/arrow/io/file.cc
Outdated
@@ -246,31 +246,55 @@ static inline Status FileRead(int fd, uint8_t* buffer, int64_t nbytes, | |||
} | |||
*bytes_read = static_cast<int64_t>(_read(fd, buffer, static_cast<uint32_t>(nbytes))); | |||
#else | |||
*bytes_read = static_cast<int64_t>(read(fd, buffer, static_cast<size_t>(nbytes))); | |||
int64_t n = 0; | |||
int64_t N = nbytes; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since N
is never changed, just use nbytes
? Maybe make it const int64_t nbytes
in the function arguments for good measure
cpp/src/arrow/io/file.cc
Outdated
|
||
*bytes_read = 0; | ||
|
||
while (ret != -1 && n < N) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use *bytes_read
instead of n
?
cpp/src/arrow/io/file.cc
Outdated
*bytes_read = 0; | ||
|
||
while (ret != -1 && n < N) { | ||
int64_t i = std::min(static_cast<int64_t>(INT32_MAX), N - n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a more descriptive variable name than i
cpp/src/arrow/io/file.cc
Outdated
while (ret != -1 && n < N) { | ||
int64_t i = std::min(static_cast<int64_t>(INT32_MAX), N - n); | ||
ret = static_cast<int>(write(fd, buffer + n, static_cast<size_t>(i))); | ||
n += i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comments as above.
python/pyarrow/tests/test_feather.py
Outdated
|
||
@pytest.mark.slow | ||
def test_large_dataframe(self): | ||
df = pd.DataFrame(np.random.randint(0, 100, size=(400000000, 1)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any data will do here, so you could do np.arange
here to make this faster rather than generating so much random data
python/pyarrow/tests/test_feather.py
Outdated
@pytest.mark.slow | ||
def test_large_dataframe(self): | ||
df = pd.DataFrame(np.random.randint(0, 100, size=(400000000, 1)), | ||
columns=list('A')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
['A']
Thanks for looking at this |
@wesm |
@wesm I found the reason why. http://man7.org/linux/man-pages/man2/write.2.html
|
OK, thanks. I'll wait on you to address the code comments and then we can merge this |
3b5782a
to
81c1972
Compare
@wesm Fixed. Now it works fine on Linux! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks for taking a deep dive into those internals.
Will merge once the build is green.
@xhochy I want to solve the Windows issue as same. |
@Licht-T yes, please open a JIRA. I would merge this PR once Travis is happy but sadly C++ builds keep timeouting on Travis. I will re-trigger later again and merge on success. |
cpp/src/arrow/io/file.cc
Outdated
#define ARROW_IO_MAX_CONUT INT32_MAX | ||
#else | ||
// see notes on Linux read/write manpage | ||
#define ARROW_IO_MAX_CONUT 0x7ffff000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typos in these defines, I will fix
cpp/src/arrow/io/file.cc
Outdated
*bytes_read = 0; | ||
|
||
while (*bytes_read != -1 && *bytes_read < nbytes) { | ||
int64_t i = std::min(static_cast<int64_t>(ARROW_IO_MAX_CONUT), nbytes - *bytes_read); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use descriptive variable names. I will fix
Change-Id: I14967fb34936b61bc1fd636b702888e95a332765
This is infinite-looping on Linux, debugging |
Change-Id: I00ee07fdc0d25f399610db49b2adfa65088e9d63
+1 |
thanks @Licht-T! |
This is the part of ARROW-1756.