Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance regression when hive SerDe doesn't prefer Writables #15163

Merged

Conversation

pettyjamesm
Copy link
Contributor

Introduced in #8206 GenericHiveRecordCursor was modified to avoid extra overhead when the SerDe provided a more efficient String handling implementation with Writables. However, when the SerDe does not provide such an implementation and instead already returned String instances directly, this change introduced an extra conversion from bytes to String just to be converted back to bytes.

This change alters the behavior of GenericHiveRecordCursor#parseString to respect the PrimitiveObjectInspector's preference for using writables.

== RELEASE NOTES ==
Hive Changes
* Fix a performance regression for String field handling in GenericHiveRecordCursor when the SerDe does not provide an efficient Writable implementation

Copy link
Contributor

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid DateTimeZone.getDefault() in GenericHiveRecordCursor hot path LGTM

I see too much Slice object creation in these code but they were like that from long time ago. Not to say we have to additionally improve it by avoid creating Slice's and memory allocations in this PR, but it's certainly something to explore in near future.

Introduced in prestodb#8206, GenericHiveRecordCursor was modified to
avoid extra overhead when the SerDe provided a more efficient String
handling implementation with Writables. However, when the SerDe does
not provide such an implementation and instead already returned String
instances directly, this change introduced an extra conversion from
bytes to String just to be converted back to bytes.

This change alters the behavior of GenericHiveRecordCursor parseString
to respect the PrimitiveObjectInspector's preference for using writables.
@pettyjamesm pettyjamesm force-pushed the hive-recordcursor-writable-fix branch from 5c79a2a to 2bad2ed Compare September 15, 2020 12:54
@pettyjamesm
Copy link
Contributor Author

I see too much Slice object creation in these code but they were like that from long time ago. Not to say we have to additionally improve it by avoid creating Slice's and memory allocations in this PR, but it's certainly something to explore in near future.

Which copy are you seeing that we can avoid? The only avoidable one I see is this one https://github.com/prestodb/presto/pull/15163/files#diff-dc300db27cccd3a6d61eb547021170d6R432 to copy the contents if trimming the string down to character limits resulted in a smaller string, but without it you could have large buffers being retained by much smaller strings similar to String#substring problem that was changed in Java 7. Is there something else you see that we can easily address before merging?

@pettyjamesm
Copy link
Contributor Author

@yingsu00 anything else we need to do here to be mergeable?

@yingsu00
Copy link
Contributor

I see too much Slice object creation in these code but they were like that from long time ago. Not to say we have to additionally improve it by avoid creating Slice's and memory allocations in this PR, but it's certainly something to explore in near future.

Which copy are you seeing that we can avoid? The only avoidable one I see is this one https://github.com/prestodb/presto/pull/15163/files#diff-dc300db27cccd3a6d61eb547021170d6R432 to copy the contents if trimming the string down to character limits resulted in a smaller string, but without it you could have large buffers being retained by much smaller strings similar to String#substring problem that was changed in Java 7. Is there something else you see that we can easily address before merging?

I'm looking at removing the usage of Slice at all. The overhead creating Slices's is too high especially when the payload size is small. In our production workload we've seen it causing many GC issues. But it's not the scope of this PR. I'll accept it.

@yingsu00 yingsu00 self-requested a review September 17, 2020 00:57
@yingsu00 yingsu00 requested a review from mbasmanova September 17, 2020 00:58
@yingsu00
Copy link
Contributor

@mbasmanova Do you want to review for a second round of this PR?

@mbasmanova mbasmanova merged commit 600c157 into prestodb:master Sep 17, 2020
@pettyjamesm pettyjamesm deleted the hive-recordcursor-writable-fix branch September 17, 2020 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants