[SPARK-28770][CORE][TEST] Fix ReplayListenerSuite tests that sometimes fail #25673

wypoon · 2019-09-04T04:30:44Z

What changes were proposed in this pull request?

ReplayListenerSuite depends on a listener class to listen for replayed events. This class was implemented by extending EventLoggingListener. EventLoggingListener does not log executor metrics update events, but uses them to update internal state; on a stage completion event, it then logs stage executor metrics events using this internal state. As executor metrics update events do not get written to the event log, they do not get replayed. The internal state of the replay listener can therefore be different from the original listener, leading to different stage completion events being logged.

We reimplement the replay listener to simply buffer each and every event it receives. This makes it a simpler yet better tool for verifying the events that get sent through the ReplayListenerBus.

Why are the changes needed?

As explained above. Tests sometimes fail due to events being received by the EventLoggingListener that do not get logged (and thus do not get replayed) but influence other events that get logged.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit tests.

…s fail. testApplicationReplay fails if the application runs long enough for the driver to send an executor metrics update. This causes stage executor metrics to be written for the driver. However, executor metrics updates are not logged, and thus not replayed. Therefore no stage executor metrics for the driver is logged on replay.

HyukjinKwon · 2019-09-04T06:17:53Z

core/src/test/scala/org/apache/spark/scheduler/ReplayListenerSuite.scala

+    // For this reason, exclude stage executor metrics for the driver.
+    val filteredEvents = originalEvents.filter { e =>
+      if (e.isInstanceOf[SparkListenerStageExecutorMetrics] &&
+        e.asInstanceOf[SparkListenerStageExecutorMetrics].execId == "driver") {


What I am not understanding is that, why are both different? They make JSONs from the same instances. Are they different? Why is it flaky?

EventMonster extends EventLoggingListener which doesn't transparently log events, hence some possible cases could exist where both can be different.

I tried to explain briefly in the comment. Let me try to explain in more detail.

When the application is run, executor metrics update events may occur (if the application runs long enough). These events are not written to the event log, so they do not get replayed. That is the root cause of differences.
On executor metrics update, the EventLoggingListener updates a map it uses to track per-stage metrics, but does not log. The listener also receives metrics on task end. Because of this, it always has metrics for executors. On stage completion. the listener logs stage executor metrics, for any of the executors/driver it has metrics for. As explained, sometimes it will have metrics for the driver (if an executor metrics update arrived from the driver), but most of the time not.
To recap, SparkListenerStageExecutorMetrics events are not events received while the application is running, but secondary events derived from metrics received and written to the event log. (One other point not relevant to this bug but may help understanding - on replay, the SparkListenerStageExecutorMetrics events do get replayed but are ignored by the listener. The replay listener logs SparkListenerStageExecutorMetrics events on stage completion using the same internal logic as the original listener.)

I had written the same explanation in #25659 (comment).

core/src/test/scala/org/apache/spark/scheduler/ReplayListenerSuite.scala

SparkQA · 2019-09-04T06:54:23Z

Test build #110092 has finished for PR 25673 at commit 0b0adb0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-09-04T08:23:21Z

I'd rather say we may need to find the way to let EventMonster not extending EventLoggingListener and just buffering all events. We don't need to deal with such complicated case in EventLoggingListener, as the purpose of ReplayListenerSuite is to test the fact that "ReplayListener can read all events sequentially and pass to registered listeners correctly".

spark/core/src/test/scala/org/apache/spark/scheduler/ReplayListenerSuite.scala

Lines 236 to 253 in df39855

    
             /** 
        
              * A simple listener that buffers all the events it receives. 
        
              * 
        
              * The event buffering functionality must be implemented within EventLoggingListener itself. 
        
              * This is because of the following race condition: the event may be mutated between being 
        
              * processed by one listener and being processed by another. Thus, in order to establish 
        
              * a fair comparison between the original events and the replayed events, both functionalities 
        
              * must be implemented within one listener (i.e. the EventLoggingListener). 
        
              * 
        
              * This child listener inherits only the event buffering functionality, but does not actually 
        
              * log the events. 
        
              */ 
        
             private class EventMonster(conf: SparkConf) 
        
               extends EventLoggingListener("test", None, new URI("testdir"), conf) { 
        
               override def start() { } 
        
             }

I'm a bit concerned about the javadoc of EventMonster - I expect each event are posted to be same to all listeners. If that's the understanding of us (Spark community), it would end up with inconsistency if we allow the event itself or any fields in event to be modified somewhere (e.g. #25672).

wypoon · 2019-09-04T16:22:54Z

I'd rather say we may need to find the way to let EventMonster not extending EventLoggingListener and just buffering all events. We don't need to deal with such complicated case in EventLoggingListener, as the purpose of ReplayListenerSuite is to test the fact that "ReplayListener can read all events sequentially and pass to registered listeners correctly".

I'd rather not take on reworking EventMonster here. I think it is sufficient to handle the special case of SparkListenerStageExecutorMetrics by filtering.

Suggested by Jungtaek Lim.

HeartSaVioR · 2019-09-04T18:23:57Z

The test is going to be unnecessary complicated, as I commented earlier. We are required to know about detailed behavior of EventLoggingListener, though we are in ReplayListenerSuite. ReplayListener is even not used with EventLoggingListener so it's really only for testing which doesn't make sense to couple with. Even we fix it for this time, the test might be broken again if we tune EventLoggingListener once more.

For sure, it's not your fault. You did a great analysis. The code looks to be ancient, so it's not their fault too. We just seem to miss the relevant codes while EventLoggingListener is evolving. But at least now we indicate the unnecessary complexity, it's your bet to just add band-aid or fix it on the right way. I prefer latter, and I'm OK if committers are OK with former and merge the patch as it is. Maybe I'll raise another PR to fix it then.

SparkQA · 2019-09-04T19:27:38Z

Test build #110135 has finished for PR 25673 at commit 7ba6b96.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Instead of having EventMonster extend EventLoggingListener, have it extend SparkFirehoseListener and simply buffer all events it receives. This makes it much more suitable for verifying that the ReplyListnerBus replays all the events from event logs. With this new EventMonster, there is no need to have special exception handling for certain types of events.

wypoon · 2019-09-04T22:30:43Z

@HeartSaVioR, on second thoughts, after offline discussion with @squito, I agree that the best way to fix the tests is to reimplement EventMonster. Instead of extending EventLoggingListener, EventMonster should simply buffer any and all events it receives. @squito pointed out that EventMonster was added when ReplayListenerSuite was first added, and that SparkFirehoseListener did not exist then. Extending EventLoggingListener and adding a mode where it simply appends events to a buffer (in its logEvent method) was an easy way of implementing EventMonster. However, as EventLoggingListener evolved, it did not call logEvent on every event, and it also wrote events as side effects that depended on its internal state, leading to the current difficulties.
I have reimplemented EventMonster (extending SparkFirehoseListener) to simply buffer each and every event it receives. This matches the needs of ReplayListenerSuite better. With this change, testApplicationReplay no longer needs any logic to filter any events.

HeartSaVioR

LGTM

SparkQA · 2019-09-05T00:48:14Z

Test build #110149 has finished for PR 25673 at commit 6fab21f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HeartSaVioR · 2019-09-05T01:02:36Z

cc. @squito as he commented and discussed the approach of resolution in JIRA issue.

core/src/test/scala/org/apache/spark/scheduler/ReplayListenerSuite.scala

squito · 2019-09-05T13:58:57Z

lgtm. Also I see @bzhaoopenstack confirmed this does the trick on #25659.

wing yew, can you do the rename? and update the PR descrpition to not mention EventMonster by name anymore

SparkQA · 2019-09-05T19:06:46Z

Test build #110194 has finished for PR 25673 at commit a0287a8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

squito · 2019-09-05T20:56:20Z

merged to master, thanks everyone

…s fail ### What changes were proposed in this pull request? `ReplayListenerSuite` depends on a listener class to listen for replayed events. This class was implemented by extending `EventLoggingListener`. `EventLoggingListener` does not log executor metrics update events, but uses them to update internal state; on a stage completion event, it then logs stage executor metrics events using this internal state. As executor metrics update events do not get written to the event log, they do not get replayed. The internal state of the replay listener can therefore be different from the original listener, leading to different stage completion events being logged. We reimplement the replay listener to simply buffer each and every event it receives. This makes it a simpler yet better tool for verifying the events that get sent through the ReplayListenerBus. ### Why are the changes needed? As explained above. Tests sometimes fail due to events being received by the `EventLoggingListener` that do not get logged (and thus do not get replayed) but influence other events that get logged. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests. Closes apache#25673 from wypoon/SPARK-28770. Authored-by: Wing Yew Poon <wypoon@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>

…s fail `ReplayListenerSuite` depends on a listener class to listen for replayed events. This class was implemented by extending `EventLoggingListener`. `EventLoggingListener` does not log executor metrics update events, but uses them to update internal state; on a stage completion event, it then logs stage executor metrics events using this internal state. As executor metrics update events do not get written to the event log, they do not get replayed. The internal state of the replay listener can therefore be different from the original listener, leading to different stage completion events being logged. We reimplement the replay listener to simply buffer each and every event it receives. This makes it a simpler yet better tool for verifying the events that get sent through the ReplayListenerBus. As explained above. Tests sometimes fail due to events being received by the `EventLoggingListener` that do not get logged (and thus do not get replayed) but influence other events that get logged. No. Existing unit tests. Closes apache#25673 from wypoon/SPARK-28770. Authored-by: Wing Yew Poon <wypoon@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com> Ref: LIHADOOP-56563 RB=2387940 BUG=LIHADOOP-56563 G=spark-reviewers R=ekrogen A=ekrogen

wypoon mentioned this pull request Sep 4, 2019

[SPARK-28770][CORE][TESTS]Ignore SparkListenerStageExecutorMetrics in testApplicationReplay test #25659

Closed

HyukjinKwon reviewed Sep 4, 2019

View reviewed changes

HeartSaVioR reviewed Sep 4, 2019

View reviewed changes

core/src/test/scala/org/apache/spark/scheduler/ReplayListenerSuite.scala Outdated Show resolved Hide resolved

[SPARK-28770][CORE][TEST] Simplify the filter.

7ba6b96

Suggested by Jungtaek Lim.

HeartSaVioR approved these changes Sep 4, 2019

View reviewed changes

srowen reviewed Sep 5, 2019

View reviewed changes

core/src/test/scala/org/apache/spark/scheduler/ReplayListenerSuite.scala Outdated Show resolved Hide resolved

[SPARK-28770][CORE][TEST] Rename EventMonster to EventBufferingListener.

a0287a8

dongjoon-hyun added the SPARK CORE label Sep 5, 2019

wypoon changed the title ~~[SPARK-28770][CORE][TEST] Fix ReplayListenerSuite tests that sometime…~~ [SPARK-28770][CORE][TEST] Fix ReplayListenerSuite tests that sometimes fail Sep 5, 2019

asfgit closed this in 151b954 Sep 5, 2019

[SPARK-28770][CORE][TEST] Fix ReplayListenerSuite tests that sometimes fail #25673

[SPARK-28770][CORE][TEST] Fix ReplayListenerSuite tests that sometimes fail #25673

Uh oh!

Conversation

wypoon commented Sep 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon Sep 4, 2019

Choose a reason for hiding this comment

Uh oh!

HeartSaVioR Sep 4, 2019

Choose a reason for hiding this comment

Uh oh!

wypoon Sep 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wypoon Sep 4, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Sep 4, 2019

Uh oh!

HeartSaVioR commented Sep 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wypoon commented Sep 4, 2019

Uh oh!

HeartSaVioR commented Sep 4, 2019

Uh oh!

SparkQA commented Sep 4, 2019

Uh oh!

wypoon commented Sep 4, 2019

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 5, 2019

Uh oh!

HeartSaVioR commented Sep 5, 2019

Uh oh!

Uh oh!

squito commented Sep 5, 2019

Uh oh!

SparkQA commented Sep 5, 2019

Uh oh!

squito commented Sep 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

wypoon commented Sep 4, 2019 •

edited

Loading

wypoon Sep 4, 2019 •

edited

Loading

HeartSaVioR commented Sep 4, 2019 •

edited

Loading