Skip to content

Conversation

@vibhatha
Copy link
Collaborator

@vibhatha vibhatha commented Oct 23, 2023

Rationale for this change

This PR fixes apache/arrow-java#169

What changes are included in this PR?

A minor change for a function parameter which fixes the available space in the destination buffer.

Are these changes tested?

No

Are there any user-facing changes?

No

@vibhatha
Copy link
Collaborator Author

@lidavidm Does this minor change need a test case? though I am not exactly sure how to write one.

@lidavidm
Copy link
Member

I assume you can come up with some data that is incompressible, which would overflow the buffer in the old code, and that should cause an exception when writerIndex is set, or else it would have caused a compress-decompress roundtrip to fail

@shujiewu did you observe an exception in your code? Or just silently corrupted data?

@shujiewu
Copy link

@lidavidm There was no error reported when writing compressed data, but an error occurred while decompressing the data. The error message is "ZSTD compression failed: Destination buffer is too small". This may not be easy to reproduce and only occurs on specific data.

@lidavidm
Copy link
Member

@vibhatha it sounds like we can reproduce this via a roundtrip test then. Thanks @shujiewu!

@vibhatha
Copy link
Collaborator Author

@lidavidm I will evaluate this. Thank you.

@lidavidm lidavidm added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Oct 23, 2023
@pitrou
Copy link
Member

pitrou commented Oct 24, 2023

The error message is "ZSTD compression failed: Destination buffer is too small".

I don't understand how this can occur, since we're passing a buffer size that's larger than the value returned by CompressBound.

@pitrou
Copy link
Member

pitrou commented Oct 24, 2023

For the record, the change in this PR is correct, but it's unclear to me why it would fix the mentioned compression error.

@vibhatha
Copy link
Collaborator Author

@lidavidm I did a few tests, by generating randomized data as follows;

private static void writeRandomIntData(IntVector vector, int valueCount) {
    Random random = new Random();
    for (int i = 0; i < valueCount; i++) {
      vector.setSafe(i, random.nextInt());
    }
    vector.setValueCount(valueCount);
  }

Used a function like this, assuming randomized data would be hard to compress. And I didn't get the expected exception. Also having a very smaller dataset could also make it hard to compress with (less history, specially with Zstd), and it didn't work as well.

I might be wrong with my conclusion or something could be erroneous in my test data, appreciate your feedback here.

@lidavidm
Copy link
Member

Hmm, true. @shujiewu do you have a stack trace?

@shujiewu
Copy link

Sorry, it's a mistake, the error message is "ZSTD decompression failed: Destination buffer is too small". We used the Java library for data compression and then the C++ library for decompression.

@shujiewu
Copy link

Most cases work well, only special data throws exceptions when decompressed, the code location is
https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/compression_zstd.cc#L194

@lidavidm
Copy link
Member

Ah, thanks for the clarification!

@vibhatha you should print out the buffer lengths in the problematic function and confirm that the overflow is actually happening.

@vibhatha
Copy link
Collaborator Author

@lidavidm would it be worth while if we test with the mentioned specific data? Also we should do the test by compressing with Java and decompress with C++ and evaluate this?

@shujiewu
Copy link

I think decompressing in java might throw an exception as well. I have some data that may throw an exception, but sensitive

@pitrou
Copy link
Member

pitrou commented Oct 24, 2023

If you're not able to share the data with us, then I would suggest that you try diagnosing it yourself. Start with printing out the various sizes on the compression front (in Java), then print those values when decompressing (in C++). If you notice any discrepancy, then this might lead you to the underlying cause.

@vibhatha
Copy link
Collaborator Author

vibhatha commented Feb 5, 2024

@pitrou @lidavidm shall we close this?

@vibhatha
Copy link
Collaborator Author

vibhatha commented Apr 8, 2024

Closing this PR since required data is not available to replicate the error.

@vibhatha vibhatha closed this Apr 8, 2024
@github-actions
Copy link

⚠️ GitHub issue #37413 has no components, please add labels for components.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting review Awaiting review Component: Java Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Java] ZSTD compression issue

4 participants