-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28770][CORE][TEST] Fix ReplayListenerSuite tests that sometimes fail #25673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…s fail. testApplicationReplay fails if the application runs long enough for the driver to send an executor metrics update. This causes stage executor metrics to be written for the driver. However, executor metrics updates are not logged, and thus not replayed. Therefore no stage executor metrics for the driver is logged on replay.
| // For this reason, exclude stage executor metrics for the driver. | ||
| val filteredEvents = originalEvents.filter { e => | ||
| if (e.isInstanceOf[SparkListenerStageExecutorMetrics] && | ||
| e.asInstanceOf[SparkListenerStageExecutorMetrics].execId == "driver") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I am not understanding is that, why are both different? They make JSONs from the same instances. Are they different? Why is it flaky?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EventMonster extends EventLoggingListener which doesn't transparently log events, hence some possible cases could exist where both can be different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to explain briefly in the comment. Let me try to explain in more detail.
When the application is run, executor metrics update events may occur (if the application runs long enough). These events are not written to the event log, so they do not get replayed. That is the root cause of differences.
On executor metrics update, the EventLoggingListener updates a map it uses to track per-stage metrics, but does not log. The listener also receives metrics on task end. Because of this, it always has metrics for executors. On stage completion. the listener logs stage executor metrics, for any of the executors/driver it has metrics for. As explained, sometimes it will have metrics for the driver (if an executor metrics update arrived from the driver), but most of the time not.
To recap, SparkListenerStageExecutorMetrics events are not events received while the application is running, but secondary events derived from metrics received and written to the event log. (One other point not relevant to this bug but may help understanding - on replay, the SparkListenerStageExecutorMetrics events do get replayed but are ignored by the listener. The replay listener logs SparkListenerStageExecutorMetrics events on stage completion using the same internal logic as the original listener.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had written the same explanation in #25659 (comment).
core/src/test/scala/org/apache/spark/scheduler/ReplayListenerSuite.scala
Outdated
Show resolved
Hide resolved
|
Test build #110092 has finished for PR 25673 at commit
|
|
I'd rather say we may need to find the way to let EventMonster not extending EventLoggingListener and just buffering all events. We don't need to deal with such complicated case in EventLoggingListener, as the purpose of ReplayListenerSuite is to test the fact that "ReplayListener can read all events sequentially and pass to registered listeners correctly". spark/core/src/test/scala/org/apache/spark/scheduler/ReplayListenerSuite.scala Lines 236 to 253 in df39855
I'm a bit concerned about the javadoc of EventMonster - I expect each event are posted to be same to all listeners. If that's the understanding of us (Spark community), it would end up with inconsistency if we allow the event itself or any fields in event to be modified somewhere (e.g. #25672). |
I'd rather not take on reworking EventMonster here. I think it is sufficient to handle the special case of |
Suggested by Jungtaek Lim.
|
The test is going to be unnecessary complicated, as I commented earlier. We are required to know about detailed behavior of EventLoggingListener, though we are in ReplayListenerSuite. ReplayListener is even not used with EventLoggingListener so it's really only for testing which doesn't make sense to couple with. Even we fix it for this time, the test might be broken again if we tune EventLoggingListener once more. For sure, it's not your fault. You did a great analysis. The code looks to be ancient, so it's not their fault too. We just seem to miss the relevant codes while EventLoggingListener is evolving. But at least now we indicate the unnecessary complexity, it's your bet to just add band-aid or fix it on the right way. I prefer latter, and I'm OK if committers are OK with former and merge the patch as it is. Maybe I'll raise another PR to fix it then. |
|
Test build #110135 has finished for PR 25673 at commit
|
Instead of having EventMonster extend EventLoggingListener, have it extend SparkFirehoseListener and simply buffer all events it receives. This makes it much more suitable for verifying that the ReplyListnerBus replays all the events from event logs. With this new EventMonster, there is no need to have special exception handling for certain types of events.
|
@HeartSaVioR, on second thoughts, after offline discussion with @squito, I agree that the best way to fix the tests is to reimplement EventMonster. Instead of extending EventLoggingListener, EventMonster should simply buffer any and all events it receives. @squito pointed out that EventMonster was added when ReplayListenerSuite was first added, and that SparkFirehoseListener did not exist then. Extending EventLoggingListener and adding a mode where it simply appends events to a buffer (in its logEvent method) was an easy way of implementing EventMonster. However, as EventLoggingListener evolved, it did not call logEvent on every event, and it also wrote events as side effects that depended on its internal state, leading to the current difficulties. |
HeartSaVioR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Test build #110149 has finished for PR 25673 at commit
|
|
cc. @squito as he commented and discussed the approach of resolution in JIRA issue. |
core/src/test/scala/org/apache/spark/scheduler/ReplayListenerSuite.scala
Outdated
Show resolved
Hide resolved
|
lgtm. Also I see @bzhaoopenstack confirmed this does the trick on #25659. wing yew, can you do the rename? and update the PR descrpition to not mention EventMonster by name anymore |
|
Test build #110194 has finished for PR 25673 at commit
|
|
merged to master, thanks everyone |
…s fail ### What changes were proposed in this pull request? `ReplayListenerSuite` depends on a listener class to listen for replayed events. This class was implemented by extending `EventLoggingListener`. `EventLoggingListener` does not log executor metrics update events, but uses them to update internal state; on a stage completion event, it then logs stage executor metrics events using this internal state. As executor metrics update events do not get written to the event log, they do not get replayed. The internal state of the replay listener can therefore be different from the original listener, leading to different stage completion events being logged. We reimplement the replay listener to simply buffer each and every event it receives. This makes it a simpler yet better tool for verifying the events that get sent through the ReplayListenerBus. ### Why are the changes needed? As explained above. Tests sometimes fail due to events being received by the `EventLoggingListener` that do not get logged (and thus do not get replayed) but influence other events that get logged. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests. Closes apache#25673 from wypoon/SPARK-28770. Authored-by: Wing Yew Poon <wypoon@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>
…s fail `ReplayListenerSuite` depends on a listener class to listen for replayed events. This class was implemented by extending `EventLoggingListener`. `EventLoggingListener` does not log executor metrics update events, but uses them to update internal state; on a stage completion event, it then logs stage executor metrics events using this internal state. As executor metrics update events do not get written to the event log, they do not get replayed. The internal state of the replay listener can therefore be different from the original listener, leading to different stage completion events being logged. We reimplement the replay listener to simply buffer each and every event it receives. This makes it a simpler yet better tool for verifying the events that get sent through the ReplayListenerBus. As explained above. Tests sometimes fail due to events being received by the `EventLoggingListener` that do not get logged (and thus do not get replayed) but influence other events that get logged. No. Existing unit tests. Closes apache#25673 from wypoon/SPARK-28770. Authored-by: Wing Yew Poon <wypoon@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> Ref: LIHADOOP-56563 RB=2387940 BUG=LIHADOOP-56563 G=spark-reviewers R=ekrogen A=ekrogen
What changes were proposed in this pull request?
ReplayListenerSuitedepends on a listener class to listen for replayed events. This class was implemented by extendingEventLoggingListener.EventLoggingListenerdoes not log executor metrics update events, but uses them to update internal state; on a stage completion event, it then logs stage executor metrics events using this internal state. As executor metrics update events do not get written to the event log, they do not get replayed. The internal state of the replay listener can therefore be different from the original listener, leading to different stage completion events being logged.We reimplement the replay listener to simply buffer each and every event it receives. This makes it a simpler yet better tool for verifying the events that get sent through the ReplayListenerBus.
Why are the changes needed?
As explained above. Tests sometimes fail due to events being received by the
EventLoggingListenerthat do not get logged (and thus do not get replayed) but influence other events that get logged.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing unit tests.