-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21321][Spark Core] Spark very verbose on shutdown #18547
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still am not sure it's worth revising these. What's logged at shutdown isn't that big.
@@ -206,6 +206,13 @@ private[netty] class Inbox( | |||
case NonFatal(e) => | |||
try endpoint.onError(e) catch { | |||
case NonFatal(ee) => logError(s"Ignoring error", ee) | |||
case NonFatal(ee) => | |||
if (stopped) { | |||
logDebug(s"Ignoring error", ee) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logInfo, not debug, probably. I can see reducing the level if you know your'e stopping, but not in other cases. The code style here would have to be fixed. For example you don't use string interpolation here.
There is no reason to print out messages that aren't useful to the users. Many users see Warnings and read them and think there is a problem with their application or configuration. Most of these are really for advanced users or devs trying to debug what is going on. So changing them to debug makes sense to me. There are also times that shutdown can take a very long time and all its doing it spewing these log messages. For instance when you have a backlog of events in the event queue. I have seen multiple minutes of it just throwing the warnings like: Message I still need to review in detail but these should just be in the shutdown case, other times I agree that you may want to see these. We should definitely update the description here and in the jira to have more details though. |
Jenkins, test this please |
Test build #79340 has finished for PR 18547 at commit
|
@yoonlee95 please look at the test failures, it might be a race in clearing the events |
Jenkins, test this please |
1 similar comment
Jenkins, test this please |
ok to test |
Test build #79591 has finished for PR 18547 at commit
|
@@ -200,7 +200,7 @@ private[spark] class LiveListenerBus(conf: SparkConf) extends SparkListenerBus { | |||
private def queueIsEmpty: Boolean = synchronized { eventQueue.isEmpty && !processingEvent } | |||
|
|||
/** | |||
* Stop the listener bus. It will wait until the queued events have been processed, but drop the | |||
* Stop the listener bus. It will clear the queued events for faster shutdown and drop the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the comment now
Test build #79611 has finished for PR 18547 at commit
|
Test build #79617 has finished for PR 18547 at commit
|
+1 |
## What changes were proposed in this pull request? The current code is very verbose on shutdown. The changes I propose is to change the log level when the driver is shutting down and the RPC connections are closed (RpcEnvStoppedException). ## How was this patch tested? Tested with word count(deploy-mode = cluster, master = yarn, num-executors = 4) with 300GB of data. Author: John Lee <jlee2@yahoo-inc.com> Closes #18547 from yoonlee95/SPARK-21321. (cherry picked from commit 0e07a29) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
## What changes were proposed in this pull request? The current code is very verbose on shutdown. The changes I propose is to change the log level when the driver is shutting down and the RPC connections are closed (RpcEnvStoppedException). ## How was this patch tested? Tested with word count(deploy-mode = cluster, master = yarn, num-executors = 4) with 300GB of data. Author: John Lee <jlee2@yahoo-inc.com> Closes apache#18547 from yoonlee95/SPARK-21321. (cherry picked from commit 0e07a29) Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>
…parkCluster shutdown ### What changes were proposed in this pull request? Catch the `RpcEnvStoppedException` and log debug it when stop is called for a `LocalSparkCluster`. This PR also contains two small changes to fix the potential issues. ### Why are the changes needed? Currently, there's always "RpcEnv already stopped" error if we exit spark-shell with local-cluster mode: ``` 20/06/07 14:54:18 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message. org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:167) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150) at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:691) at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) ``` When we call stop on `StandaloneSchedulerBackend`, the backend will firstly send `UnregisterApplication` to `Master` and then call stop on `LocalSparkCluster` immediately. On the other side, `Master` will send messages to `Worker` when it receives `UnregisterApplication`. However, the rpcEnv of the `Worker` has been already stoped by the backend. Therefore, the error message shows when the `Worker` tries to handle the messages. It's only an error on shutdown, users would not like to care about it. So we could hide it in debug log and this is also what we've done previously in #18547. ### Does this PR introduce _any_ user-facing change? Yes, users will not see the error message after this PR. ### How was this patch tested? Tested manually. Closes #28746 from Ngone51/fix-spark-31922. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
What changes were proposed in this pull request?
The current code is very verbose on shutdown.
The changes I propose is to change the log level when the driver is shutting down and the RPC connections are closed (RpcEnvStoppedException).
How was this patch tested?
Tested with word count(deploy-mode = cluster, master = yarn, num-executors = 4) with 300GB of data.