-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-30429][SQL] Optimize catalogString and usage in ValidateExternalType.errMsg to avoid OOM #27117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we really need to output the error message for large expected.catalogString
, it will fail with OOM, right? which is not good. To avoid such kind of issues, it would be nice to use StringConcat
:
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala
Lines 102 to 107 in d9b3069
/** | |
* Concatenation of sequence of strings to final string with cheap append method | |
* and one memory allocation for the final string. Can also bound the final size of | |
* the string. | |
*/ | |
class StringConcat(val maxLength: Int = ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) { |
or truncate
expected.catalogString
somehow else.
@viirya Or OOM happens because we create a lot of ValidateExternalType.errMsg
?
Test build #116234 has finished for PR 27117 at commit
|
+1 to truncate |
When we transform on the serializer, we copy expressions there, it creates redundant As we will initiate I think as you and @cloud-fan, we may need to truncate |
StructType.catalogString.
Test build #116251 has finished for PR 27117 at commit
|
sql/catalyst/src/main/scala/org/apache/spark/sql/types/StructType.scala
Outdated
Show resolved
Hide resolved
Test build #116253 has finished for PR 27117 at commit
|
empty array.init isn't empty but throws |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR, there are three different themes. Could you update the PR title and description accordingly?
- Using
lazy
: Withlazy
, the OOM is gone at benchmark - Replacing
catalogString
withsimpleString
: This looks independent improvement forerrMsg
. - Optimizing
override def catalogString
: This is not used aterrMsg
due to (2). So, this becomes independent in this PR.
If the PR title and description is correctly updated, we don't need to split this PR since the code is simple. Otherwise, we may want to split this into multiple PRs.
@dongjoon-hyun Thanks! I've updated the PR description and title. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. (Pending Jenkins).
Thank you for updating, @viirya .
Thanks! @dongjoon-hyun @MaxGekk |
Test build #116256 has finished for PR 27117 at commit
|
Test build #116258 has finished for PR 27117 at commit
|
Merged to master. Thank you all! |
What changes were proposed in this pull request?
This patch proposes:
ValidateExternalType.errMsg
lazy variable, i.e. not to initiate it in the constructorerrMsg
: ReplacingcatalogString
withsimpleString
which is truncatedoverride def catalogString
inStructType
: MakecatalogString
more efficient in string generation by usingStringConcat
Why are the changes needed?
In the JIRA, it is found that WideSchemaBenchmark fails with OOM, like:
It is after cb5ea20 commit which refactors
ExpressionEncoder
.The stacktrace shows it fails at
transformUp
onobjSerializer
inExpressionEncoder
. In particular, it fails at initializingValidateExternalType.errMsg
, that interpolatescatalogString
of givenexpected
data type in a string. In WideSchemaBenchmark we have very deeply nested data type. When we transform on the serializer which containsValidateExternalType
, we create redundant big stringerrMsg
. Because we just in transforming it and don't use it yet, it is useless and waste a lot of memory.After make
ValidateExternalType.errMsg
as lazy variable, WideSchemaBenchmark works.Does this PR introduce any user-facing change?
No
How was this patch tested?
Manual test with WideSchemaBenchmark.