PARQUET-2432: Use ByteBufferAllocator over hardcoded heap allocation#1278
PARQUET-2432: Use ByteBufferAllocator over hardcoded heap allocation#1278gszadovszky merged 4 commits intoapache:masterfrom
Conversation
* Updated BytesInput implementations to rely on a ByteBufferAllocator instance for allocating/releasing ByteBuffer objects. * Extend the usage of a ByteBufferAllocator instead of the hardcoded usage of heap (e.g. byte[], ByteBuffer.allocate etc.) * parquet-cli related code parts including ParquetRewriter and tests are not changed in this effort
|
@wgtmac, if you have some time, could you check this out? |
|
Sure, I will take a look by the end of this week. |
wgtmac
left a comment
There was a problem hiding this comment.
I didn't check the test thoroughly but this overall LGTM.
| public abstract void writeAllTo(OutputStream out) throws IOException; | ||
|
|
||
| /** | ||
| * For internal use only. It is expected that the buffer is large enough to fit the content of this {@link BytesInput} |
There was a problem hiding this comment.
Should we add a comment for what to expect if the content does not fit into the ByteBuffer?
| * @return a text representation of the memory usage of this structure | ||
| */ | ||
| public String memUsageString(String prefix) { | ||
| return format("%s %s %d slabs, %,d bytes", prefix, getClass().getSimpleName(), slabs.size(), size); |
There was a problem hiding this comment.
| return format("%s %s %d slabs, %,d bytes", prefix, getClass().getSimpleName(), slabs.size(), size); | |
| return format("%s %s %d slabs, %d bytes", prefix, getClass().getSimpleName(), slabs.size(), size); |
There was a problem hiding this comment.
I've just copy-pasted this from ConcatenatingByteArrayCollector but it seems to be intentional. %,d adds separators to the value representation (e.g. 123,456,789).
| import java.nio.ByteBuffer; | ||
|
|
||
| /** | ||
| * A special {@link ByteBufferAllocator} implementation that keeps one {@link ByteBuffer} object and reuse it at the |
There was a problem hiding this comment.
| * A special {@link ByteBufferAllocator} implementation that keeps one {@link ByteBuffer} object and reuse it at the | |
| * A special {@link ByteBufferAllocator} implementation that keeps one {@link ByteBuffer} object and reuses it at the |
| this.allocator = allocator; | ||
| this.toRelease = toRelease; | ||
| void setReleaser(ByteBufferReleaser releaser) { | ||
| this.releaser = releaser; |
There was a problem hiding this comment.
Should we check if the passed releaser is null?
There was a problem hiding this comment.
This is internal (both the method and the class are package private). I wouldn't do additional checks.
| byte[] serializedFooter = new byte[combinedFooterLength - footerSignatureLength]; | ||
| System.arraycopy(footerAndSignature, 0, serializedFooter, 0, serializedFooter.length); | ||
| // Resetting to the beginning of the footer | ||
| from.reset(); |
There was a problem hiding this comment.
Should we check from.markSupported() before calling reset() and mark()?
| allocator); | ||
| } | ||
|
|
||
| @Deprecated |
There was a problem hiding this comment.
The argument list grows longer now. Should we use an options class instead to avoid frequent deprecation?
There was a problem hiding this comment.
From one hand I completely agree. From the other hand, ParquetFileWriter should be an internal class. It is unfortunate that it is public. I would not create yet another parameters builder for ParquetFileWriter.
I'll think about a solution somewhere in between.
|
Thank you, @wgtmac |
|
Hi @gszadovszky @wgtmac It seems this patch may cause deadlock. Hadoop code path: https://github.com/apache/hadoop/blob/rel/release-3.3.3/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSOutputStream.java#L631 |
Make sure you have checked all steps below.
Jira
them in the PR title. For example, "PARQUET-1234: My Parquet PR"
the ASF 3rd Party License Policy.
Tests
Commits
from "How to write a good git commit message":
Style
mvn spotless:apply -Pvector-pluginsDocumentation