-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow: Avoid extra dictionary buffer copy #5137
Conversation
156d23d
to
5d497cc
Compare
2c0c7dd
to
4d7e9ca
Compare
4d7e9ca
to
3686ec3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bryanck for working on this!
Mostly some style nits, plus a question for my own clarification. Thank you!
public UTF8String ofByteBuffer(ByteBuffer byteBuffer) { | ||
if (byteBuffer.hasArray()) { | ||
return UTF8String.fromBytes( | ||
byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: indentation
public UTF8String ofByteBuffer(ByteBuffer byteBuffer) { | ||
if (byteBuffer.hasArray()) { | ||
return UTF8String.fromBytes( | ||
byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: over-indented (should be 4 spaces from the start of return
on the line above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I fixed these. Checkstyle didn't seem to mind...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for letting me know. I’ll see if I can add a checkstyle rule for that or update one to catch it!
public UTF8String ofByteBuffer(ByteBuffer byteBuffer) { | ||
if (byteBuffer.hasArray()) { | ||
return UTF8String.fromBytes( | ||
byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: indentation
public UTF8String ofByteBuffer(ByteBuffer byteBuffer) { | ||
if (byteBuffer.hasArray()) { | ||
return UTF8String.fromBytes( | ||
byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: indentation
.toArray(genericArray(stringFactory.getGenericClass())); | ||
this.dictionary = dictionary; | ||
this.stringFactory = stringFactory; | ||
this.cache = genericArray(stringFactory.getGenericClass(), dictionary.getMaxId() + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: why are you adding 1 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To support a 0-based index of getMaxId(), you need a size that is one bigger
public String ofByteBuffer(ByteBuffer byteBuffer) { | ||
if (byteBuffer.hasArray()) { | ||
return new String(byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), | ||
byteBuffer.remaining(), StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: over-indented (should be 4 spaces from the start of return
on the line above).
Looks great. Thanks for fixing this, @bryanck! |
This PR changes the dictionary value accessors in the vectorized parquet reader so that the dictionary values are read from the underlying dictionary directly, rather than copying the values into a new buffer where relevant (this was already being done in the dictionary decimal accessor classes). The underlying parquet dictionary classes already load the values into a buffer, so copying them to a new buffer appears redundant in some cases.
This PR also makes a couple of changes to avoid binary buffer copies when building string values when possible.
In very limited testing, this shows a performance gain of over 20% in vectorized read performance in some scenarios, though more testing would be required to get more accurate metrics.