Fix performance regression when hive SerDe doesn't prefer Writables #15163

pettyjamesm · 2020-09-11T15:41:49Z

Introduced in #8206 GenericHiveRecordCursor was modified to avoid extra overhead when the SerDe provided a more efficient String handling implementation with Writables. However, when the SerDe does not provide such an implementation and instead already returned String instances directly, this change introduced an extra conversion from bytes to String just to be converted back to bytes.

This change alters the behavior of GenericHiveRecordCursor#parseString to respect the PrimitiveObjectInspector's preference for using writables.

== RELEASE NOTES ==
Hive Changes
* Fix a performance regression for String field handling in GenericHiveRecordCursor when the SerDe does not provide an efficient Writable implementation

yingsu00

Avoid DateTimeZone.getDefault() in GenericHiveRecordCursor hot path LGTM

I see too much Slice object creation in these code but they were like that from long time ago. Not to say we have to additionally improve it by avoid creating Slice's and memory allocations in this PR, but it's certainly something to explore in near future.

presto-hive/src/main/java/com/facebook/presto/hive/GenericHiveRecordCursor.java

Introduced in prestodb#8206, GenericHiveRecordCursor was modified to avoid extra overhead when the SerDe provided a more efficient String handling implementation with Writables. However, when the SerDe does not provide such an implementation and instead already returned String instances directly, this change introduced an extra conversion from bytes to String just to be converted back to bytes. This change alters the behavior of GenericHiveRecordCursor parseString to respect the PrimitiveObjectInspector's preference for using writables.

pettyjamesm · 2020-09-15T12:59:51Z

I see too much Slice object creation in these code but they were like that from long time ago. Not to say we have to additionally improve it by avoid creating Slice's and memory allocations in this PR, but it's certainly something to explore in near future.

Which copy are you seeing that we can avoid? The only avoidable one I see is this one https://github.com/prestodb/presto/pull/15163/files#diff-dc300db27cccd3a6d61eb547021170d6R432 to copy the contents if trimming the string down to character limits resulted in a smaller string, but without it you could have large buffers being retained by much smaller strings similar to String#substring problem that was changed in Java 7. Is there something else you see that we can easily address before merging?

pettyjamesm · 2020-09-16T21:38:29Z

@yingsu00 anything else we need to do here to be mergeable?

yingsu00 · 2020-09-17T00:57:28Z

I see too much Slice object creation in these code but they were like that from long time ago. Not to say we have to additionally improve it by avoid creating Slice's and memory allocations in this PR, but it's certainly something to explore in near future.

Which copy are you seeing that we can avoid? The only avoidable one I see is this one https://github.com/prestodb/presto/pull/15163/files#diff-dc300db27cccd3a6d61eb547021170d6R432 to copy the contents if trimming the string down to character limits resulted in a smaller string, but without it you could have large buffers being retained by much smaller strings similar to String#substring problem that was changed in Java 7. Is there something else you see that we can easily address before merging?

I'm looking at removing the usage of Slice at all. The overhead creating Slices's is too high especially when the payload size is small. In our production workload we've seen it causing many GC issues. But it's not the scope of this PR. I'll accept it.

yingsu00 · 2020-09-17T00:59:51Z

@mbasmanova Do you want to review for a second round of this PR?

pettyjamesm mentioned this pull request Sep 11, 2020

Fix performance regression when hive SerDe doesn't prefer Writables trinodb/trino#5142

Merged

pettyjamesm requested a review from yingsu00 September 11, 2020 23:10

yingsu00 reviewed Sep 15, 2020

View reviewed changes

presto-hive/src/main/java/com/facebook/presto/hive/GenericHiveRecordCursor.java Outdated Show resolved Hide resolved

pettyjamesm added 2 commits September 15, 2020 08:52

Avoid DateTimeZone.getDefault() in GenericHiveRecordCursor hot path

2bad2ed

pettyjamesm force-pushed the hive-recordcursor-writable-fix branch from 5c79a2a to 2bad2ed Compare September 15, 2020 12:54

yingsu00 self-requested a review September 17, 2020 00:57

yingsu00 approved these changes Sep 17, 2020

View reviewed changes

yingsu00 requested a review from mbasmanova September 17, 2020 00:58

mbasmanova merged commit 600c157 into prestodb:master Sep 17, 2020

pettyjamesm deleted the hive-recordcursor-writable-fix branch September 17, 2020 11:41

This was referenced Oct 6, 2020

Add release notes for 0.242 #15270

Merged

[Test] Add release notes for 0.242 #15291

Closed

[Test-Only] Add release notes for 0.242 #15294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix performance regression when hive SerDe doesn't prefer Writables #15163

Fix performance regression when hive SerDe doesn't prefer Writables #15163

pettyjamesm commented Sep 11, 2020

yingsu00 left a comment

pettyjamesm commented Sep 15, 2020

pettyjamesm commented Sep 16, 2020

yingsu00 commented Sep 17, 2020

yingsu00 commented Sep 17, 2020

Fix performance regression when hive SerDe doesn't prefer Writables #15163

Fix performance regression when hive SerDe doesn't prefer Writables #15163

Conversation

pettyjamesm commented Sep 11, 2020

yingsu00 left a comment

Choose a reason for hiding this comment

pettyjamesm commented Sep 15, 2020

pettyjamesm commented Sep 16, 2020

yingsu00 commented Sep 17, 2020

yingsu00 commented Sep 17, 2020