-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HADOOP-18526. Leak of S3AInstrumentation instances via hadoop Metrics references #5144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-18526. Leak of S3AInstrumentation instances via hadoop Metrics references #5144
Conversation
tested, s3 london, parallel @ 8. one failure -created HADOOP-18531 for that |
testing ideas
|
latest test run s3 london, params new test to verify that instrumentation.close() unregisters on the first call and is no-op on the second...both unit test for standalone and itest for lifecycle through s3afs. that showed that instrumentation is being unregistered on close(), so if leakage is reported, it's from something creating fs instances but not closing them. the weak refs will ensure that the instrumentation isn't being held on -but thread pools are still there to consume resources. |
want to add one more feature here; s3a fs initialize() to log stack trace at TRACE; helps to track down what is creating fs instances and not cleaning up properly |
next pr logs stack at trace
|
9811c0b
to
5f34529
Compare
tested aws london w/ s3a fs was set to log at TRACE so print stack traces on all fs creation...just search for "creation stack" in the output to look. no problems when this tracing is enabled. saw HADOOP-18351 again, so have just rebased for future runs. think this is ready for review now |
build vms playing up "Resource temporarily unavailable" |
… references This has triggered an OOM in a process which was churning through s3a fs instances; the increased memory footprint of IOStatistics amplified what must have been a long standing issue. * Makes sure instrumentation is closed when fs is closed * Uses a weak reference from metrics to instrumentation, so even if the fs wasn't closed (see HADOOP-18478), this ref would not be holding things back. Change-Id: I3324d4bcfd9fb1770691bfca2d28ab0183faa27d
… references Add tests that instrumentation.close() unregisters *once*. reverted the change to close() it in s3afs.close(), because that was already done in stopAllServices(). if something is leaking S3AInstrumentation instances, it is because it is creating fs instances outside the cache, and not closing them. once dereferenced and a GC triggered, the fs instance and most internal classes are GC'd, but not the instrumentation because of the back ref. This weakref link will do that, but will result in the metrics structure being filled up with lots of weakrefs which don't get cleaned up. It's a lot less expensive than strong refs, but still not ideal. Fixing that would be tricky as the base metrics class or a subclass would need to do cleanups (when?). Change-Id: I07fe792301e23ed1d3676fbb8cc033c5a96e55a2
off by default...test/log4j.properties has it commented out Change-Id: Ie83b74009ccae8869b2c5386206b913e0f76af2f
5f34529
to
f8cdd3d
Compare
* use the weak ref when registering * log state of cache.disabled in trace-level constructor The reason for the logging is that running out of memory due to instrumentation refs is probably a sign of s3a fs instances being created and not closed. HADOOP-18476 should partially address this through FileContext; the trace is there to identify other possible causes. Change-Id: If78d16e8bfc63eea93ec328543ab110f82c93d90
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me with one comment.
Makes sure instrumentation is closed when fs is closed
I can't find where is instrumentation getting closed in fs.close()
// for tracking down memory leak issues. | ||
LOG.trace("Filesystem for {} created; fs.s3a.impl.disable.cache = {}", | ||
name, originalConf.getBoolean("fs.s3a.impl.disable.cache", false), | ||
new RuntimeException(super.toString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why RTE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't throw it, just trace it. it can be anything. what is your suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just print it? I mean I don't understand the reason behind wrapping in RuntimeEx.
Also base FileStystem doesn't implement toString() so there won't be anything. Why not use this.tpString()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- generating that runtime exception creates a stack trace which then gets logged in the trace call, so we can see who is creating the fs instance
- calling super.toString() so we get the minimal instance id rather than the expanded tostring with all the stats and stuff. This is initialize, most of those values are unset so it would only confuse.
it is that stack trace which is the most critical thing as it lets us track down where FS instances are being created and then not closed.
done in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good along with the tests.
...op-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/TestInstrumentationLifecycle.java
Show resolved
Hide resolved
// robust extract and convert to string | ||
LOG.debug("Statistics for {}: {}", uri, | ||
IOStatisticsLogging.ioStatisticsToPrettyString(getIOStatistics())); | ||
// log IO statistics, including of any file deletion during |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it means "including iostatistics of any file deletion..." so IMO it's valid
Extended S3AInstrumentation and WeakRefMetricsSource to allow for the test to verify that the link from metrics to instrumentation works and is through the weak reference Change-Id: I4fc156dfa60d7c7a868d8b765777c0f445a83257
🎊 +1 overall
This message was automatically generated. |
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
Show resolved
Hide resolved
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AInstrumentation.java
Show resolved
Hide resolved
// for tracking down memory leak issues. | ||
LOG.trace("Filesystem for {} created; fs.s3a.impl.disable.cache = {}", | ||
name, originalConf.getBoolean("fs.s3a.impl.disable.cache", false), | ||
new RuntimeException(super.toString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't throw it, just trace it. it can be anything. what is your suggestion?
// robust extract and convert to string | ||
LOG.debug("Statistics for {}: {}", uri, | ||
IOStatisticsLogging.ioStatisticsToPrettyString(getIOStatistics())); | ||
// log IO statistics, including of any file deletion during |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it means "including iostatistics of any file deletion..." so IMO it's valid
hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java
Show resolved
Hide resolved
i am happy with this patch and want to get it in to 3.3.5. please can i have reviewers tell me if they are happy now. thx |
i am still happy with this patch and want to get it in to 3.3.5. please can i have reviewers tell me if they are happy now. thx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final patch looks good to me +1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
https://github.com/apache/hadoop/pull/5144/files/48974549308b1eb63cb832e3a04beef3f593018d#r1046453289 |
@mukund-thakur tried to clarify: the exception is created to build the stack trace, which is what we really want. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 LGTM,
Thanks @steveloughran for explaining. Now I understand, the whole intent was to find the caller. Lol, I was being dumb.
… references (#5144) This has triggered an OOM in a process which was churning through s3a fs instances; the increased memory footprint of IOStatistics amplified what must have been a long-standing issue with FS instances being created and not closed() * Makes sure instrumentation is closed when the FS is closed. * Uses a weak reference from metrics to instrumentation, so even if the FS wasn't closed (see HADOOP-18478), this back reference would not cause the S3AInstrumentation reference to be retained. * If S3AFileSystem is configured to log at TRACE it will log the calling stack of initialize(), so help identify where the instance is being created. This should help track down the cause of instance leakage. Contributed by Steve Loughran.
… references (#5144) This has triggered an OOM in a process which was churning through s3a fs instances; the increased memory footprint of IOStatistics amplified what must have been a long-standing issue with FS instances being created and not closed() * Makes sure instrumentation is closed when the FS is closed. * Uses a weak reference from metrics to instrumentation, so even if the FS wasn't closed (see HADOOP-18478), this back reference would not cause the S3AInstrumentation reference to be retained. * If S3AFileSystem is configured to log at TRACE it will log the calling stack of initialize(), so help identify where the instance is being created. This should help track down the cause of instance leakage. Contributed by Steve Loughran.
|
… references (apache#5144) This has triggered an OOM in a process which was churning through s3a fs instances; the increased memory footprint of IOStatistics amplified what must have been a long-standing issue with FS instances being created and not closed() * Makes sure instrumentation is closed when the FS is closed. * Uses a weak reference from metrics to instrumentation, so even if the FS wasn't closed (see HADOOP-18478), this back reference would not cause the S3AInstrumentation reference to be retained. * If S3AFileSystem is configured to log at TRACE it will log the calling stack of initialize(), so help identify where the instance is being created. This should help track down the cause of instance leakage. Contributed by Steve Loughran.
… references (apache#5144) This has triggered an OOM in a process which was churning through s3a fs instances; the increased memory footprint of IOStatistics amplified what must have been a long-standing issue with FS instances being created and not closed() * Makes sure instrumentation is closed when the FS is closed. * Uses a weak reference from metrics to instrumentation, so even if the FS wasn't closed (see HADOOP-18478), this back reference would not cause the S3AInstrumentation reference to be retained. * If S3AFileSystem is configured to log at TRACE it will log the calling stack of initialize(), so help identify where the instance is being created. This should help track down the cause of instance leakage. Contributed by Steve Loughran. (cherry picked from commit aaf92fe)
… references (apache#5144) This has triggered an OOM in a process which was churning through s3a fs instances; the increased memory footprint of IOStatistics amplified what must have been a long-standing issue with FS instances being created and not closed() * Makes sure instrumentation is closed when the FS is closed. * Uses a weak reference from metrics to instrumentation, so even if the FS wasn't closed (see HADOOP-18478), this back reference would not cause the S3AInstrumentation reference to be retained. * If S3AFileSystem is configured to log at TRACE it will log the calling stack of initialize(), so help identify where the instance is being created. This should help track down the cause of instance leakage. Contributed by Steve Loughran. (cherry picked from commit aaf92fe)
This has triggered an OOM in a process which was churning through s3a fs instances; the increased memory footprint of IOStatistics amplified what must have been a long standing issue.
How was this patch tested?
cloud tests. no new tests...if there are suggestions, that'd be good
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?