Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow: Avoid extra dictionary buffer copy #5137

Merged
merged 4 commits into from
Jun 27, 2022

Conversation

bryanck
Copy link
Contributor

@bryanck bryanck commented Jun 27, 2022

This PR changes the dictionary value accessors in the vectorized parquet reader so that the dictionary values are read from the underlying dictionary directly, rather than copying the values into a new buffer where relevant (this was already being done in the dictionary decimal accessor classes). The underlying parquet dictionary classes already load the values into a buffer, so copying them to a new buffer appears redundant in some cases.

This PR also makes a couple of changes to avoid binary buffer copies when building string values when possible.

In very limited testing, this shows a performance gain of over 20% in vectorized read performance in some scenarios, though more testing would be required to get more accurate metrics.

@github-actions github-actions bot added the arrow label Jun 27, 2022
@bryanck bryanck force-pushed the dict-value-decode branch from 156d23d to 5d497cc Compare June 27, 2022 11:53
@github-actions github-actions bot added the spark label Jun 27, 2022
@bryanck bryanck force-pushed the dict-value-decode branch from 2c0c7dd to 4d7e9ca Compare June 27, 2022 13:09
@bryanck bryanck force-pushed the dict-value-decode branch from 4d7e9ca to 3686ec3 Compare June 27, 2022 13:13
Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bryanck for working on this!

Mostly some style nits, plus a question for my own clarification. Thank you!

public UTF8String ofByteBuffer(ByteBuffer byteBuffer) {
if (byteBuffer.hasArray()) {
return UTF8String.fromBytes(
byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indentation

public UTF8String ofByteBuffer(ByteBuffer byteBuffer) {
if (byteBuffer.hasArray()) {
return UTF8String.fromBytes(
byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: over-indented (should be 4 spaces from the start of return on the line above).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I fixed these. Checkstyle didn't seem to mind...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting me know. I’ll see if I can add a checkstyle rule for that or update one to catch it!

public UTF8String ofByteBuffer(ByteBuffer byteBuffer) {
if (byteBuffer.hasArray()) {
return UTF8String.fromBytes(
byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indentation

public UTF8String ofByteBuffer(ByteBuffer byteBuffer) {
if (byteBuffer.hasArray()) {
return UTF8String.fromBytes(
byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(), byteBuffer.remaining());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: indentation

.toArray(genericArray(stringFactory.getGenericClass()));
this.dictionary = dictionary;
this.stringFactory = stringFactory;
this.cache = genericArray(stringFactory.getGenericClass(), dictionary.getMaxId() + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: why are you adding 1 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To support a 0-based index of getMaxId(), you need a size that is one bigger

public String ofByteBuffer(ByteBuffer byteBuffer) {
if (byteBuffer.hasArray()) {
return new String(byteBuffer.array(), byteBuffer.arrayOffset() + byteBuffer.position(),
byteBuffer.remaining(), StandardCharsets.UTF_8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: over-indented (should be 4 spaces from the start of return on the line above).

@rdblue
Copy link
Contributor

rdblue commented Jun 27, 2022

Looks great. Thanks for fixing this, @bryanck!

@rdblue rdblue merged commit eff6556 into apache:master Jun 27, 2022
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
namrathamyske pushed a commit to namrathamyske/iceberg that referenced this pull request Jul 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants