Skip to content

Conversation

@wypoon
Copy link
Contributor

@wypoon wypoon commented Sep 4, 2019

What changes were proposed in this pull request?

ReplayListenerSuite depends on a listener class to listen for replayed events. This class was implemented by extending EventLoggingListener. EventLoggingListener does not log executor metrics update events, but uses them to update internal state; on a stage completion event, it then logs stage executor metrics events using this internal state. As executor metrics update events do not get written to the event log, they do not get replayed. The internal state of the replay listener can therefore be different from the original listener, leading to different stage completion events being logged.

We reimplement the replay listener to simply buffer each and every event it receives. This makes it a simpler yet better tool for verifying the events that get sent through the ReplayListenerBus.

Why are the changes needed?

As explained above. Tests sometimes fail due to events being received by the EventLoggingListener that do not get logged (and thus do not get replayed) but influence other events that get logged.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit tests.

…s fail.

testApplicationReplay fails if the application runs long enough for the driver
to send an executor metrics update. This causes stage executor metrics to be
written for the driver. However, executor metrics updates are not logged, and
thus not replayed. Therefore no stage executor metrics for the driver is logged
on replay.
// For this reason, exclude stage executor metrics for the driver.
val filteredEvents = originalEvents.filter { e =>
if (e.isInstanceOf[SparkListenerStageExecutorMetrics] &&
e.asInstanceOf[SparkListenerStageExecutorMetrics].execId == "driver") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am not understanding is that, why are both different? They make JSONs from the same instances. Are they different? Why is it flaky?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EventMonster extends EventLoggingListener which doesn't transparently log events, hence some possible cases could exist where both can be different.

Copy link
Contributor Author

@wypoon wypoon Sep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to explain briefly in the comment. Let me try to explain in more detail.

When the application is run, executor metrics update events may occur (if the application runs long enough). These events are not written to the event log, so they do not get replayed. That is the root cause of differences.
On executor metrics update, the EventLoggingListener updates a map it uses to track per-stage metrics, but does not log. The listener also receives metrics on task end. Because of this, it always has metrics for executors. On stage completion. the listener logs stage executor metrics, for any of the executors/driver it has metrics for. As explained, sometimes it will have metrics for the driver (if an executor metrics update arrived from the driver), but most of the time not.
To recap, SparkListenerStageExecutorMetrics events are not events received while the application is running, but secondary events derived from metrics received and written to the event log. (One other point not relevant to this bug but may help understanding - on replay, the SparkListenerStageExecutorMetrics events do get replayed but are ignored by the listener. The replay listener logs SparkListenerStageExecutorMetrics events on stage completion using the same internal logic as the original listener.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had written the same explanation in #25659 (comment).

@SparkQA
Copy link

SparkQA commented Sep 4, 2019

Test build #110092 has finished for PR 25673 at commit 0b0adb0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Sep 4, 2019

I'd rather say we may need to find the way to let EventMonster not extending EventLoggingListener and just buffering all events. We don't need to deal with such complicated case in EventLoggingListener, as the purpose of ReplayListenerSuite is to test the fact that "ReplayListener can read all events sequentially and pass to registered listeners correctly".

/**
* A simple listener that buffers all the events it receives.
*
* The event buffering functionality must be implemented within EventLoggingListener itself.
* This is because of the following race condition: the event may be mutated between being
* processed by one listener and being processed by another. Thus, in order to establish
* a fair comparison between the original events and the replayed events, both functionalities
* must be implemented within one listener (i.e. the EventLoggingListener).
*
* This child listener inherits only the event buffering functionality, but does not actually
* log the events.
*/
private class EventMonster(conf: SparkConf)
extends EventLoggingListener("test", None, new URI("testdir"), conf) {
override def start() { }
}

I'm a bit concerned about the javadoc of EventMonster - I expect each event are posted to be same to all listeners. If that's the understanding of us (Spark community), it would end up with inconsistency if we allow the event itself or any fields in event to be modified somewhere (e.g. #25672).

@wypoon
Copy link
Contributor Author

wypoon commented Sep 4, 2019

I'd rather say we may need to find the way to let EventMonster not extending EventLoggingListener and just buffering all events. We don't need to deal with such complicated case in EventLoggingListener, as the purpose of ReplayListenerSuite is to test the fact that "ReplayListener can read all events sequentially and pass to registered listeners correctly".

I'd rather not take on reworking EventMonster here. I think it is sufficient to handle the special case of SparkListenerStageExecutorMetrics by filtering.

@HeartSaVioR
Copy link
Contributor

The test is going to be unnecessary complicated, as I commented earlier. We are required to know about detailed behavior of EventLoggingListener, though we are in ReplayListenerSuite. ReplayListener is even not used with EventLoggingListener so it's really only for testing which doesn't make sense to couple with. Even we fix it for this time, the test might be broken again if we tune EventLoggingListener once more.

For sure, it's not your fault. You did a great analysis. The code looks to be ancient, so it's not their fault too. We just seem to miss the relevant codes while EventLoggingListener is evolving. But at least now we indicate the unnecessary complexity, it's your bet to just add band-aid or fix it on the right way. I prefer latter, and I'm OK if committers are OK with former and merge the patch as it is. Maybe I'll raise another PR to fix it then.

@SparkQA
Copy link

SparkQA commented Sep 4, 2019

Test build #110135 has finished for PR 25673 at commit 7ba6b96.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Instead of having EventMonster extend EventLoggingListener, have it extend
SparkFirehoseListener and simply buffer all events it receives. This makes
it much more suitable for verifying that the ReplyListnerBus replays all
the events from event logs.
With this new EventMonster, there is no need to have special exception
handling for certain types of events.
@wypoon
Copy link
Contributor Author

wypoon commented Sep 4, 2019

@HeartSaVioR, on second thoughts, after offline discussion with @squito, I agree that the best way to fix the tests is to reimplement EventMonster. Instead of extending EventLoggingListener, EventMonster should simply buffer any and all events it receives. @squito pointed out that EventMonster was added when ReplayListenerSuite was first added, and that SparkFirehoseListener did not exist then. Extending EventLoggingListener and adding a mode where it simply appends events to a buffer (in its logEvent method) was an easy way of implementing EventMonster. However, as EventLoggingListener evolved, it did not call logEvent on every event, and it also wrote events as side effects that depended on its internal state, leading to the current difficulties.
I have reimplemented EventMonster (extending SparkFirehoseListener) to simply buffer each and every event it receives. This matches the needs of ReplayListenerSuite better. With this change, testApplicationReplay no longer needs any logic to filter any events.

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Sep 5, 2019

Test build #110149 has finished for PR 25673 at commit 6fab21f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HeartSaVioR
Copy link
Contributor

cc. @squito as he commented and discussed the approach of resolution in JIRA issue.

@squito
Copy link
Contributor

squito commented Sep 5, 2019

lgtm. Also I see @bzhaoopenstack confirmed this does the trick on #25659.

wing yew, can you do the rename? and update the PR descrpition to not mention EventMonster by name anymore

@SparkQA
Copy link

SparkQA commented Sep 5, 2019

Test build #110194 has finished for PR 25673 at commit a0287a8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wypoon wypoon changed the title [SPARK-28770][CORE][TEST] Fix ReplayListenerSuite tests that sometime… [SPARK-28770][CORE][TEST] Fix ReplayListenerSuite tests that sometimes fail Sep 5, 2019
@asfgit asfgit closed this in 151b954 Sep 5, 2019
@squito
Copy link
Contributor

squito commented Sep 5, 2019

merged to master, thanks everyone

PavithraRamachandran pushed a commit to PavithraRamachandran/spark that referenced this pull request Sep 15, 2019
…s fail

### What changes were proposed in this pull request?

`ReplayListenerSuite` depends on a listener class to listen for replayed events. This class was implemented by extending `EventLoggingListener`. `EventLoggingListener` does not log executor metrics update events, but uses them to update internal state; on a stage completion event, it then logs stage executor metrics events using this internal state. As executor metrics update events do not get written to the event log, they do not get replayed. The internal state of the replay listener can therefore be different from the original listener, leading to different stage completion events being logged.

We reimplement the replay listener to simply buffer each and every event it receives. This makes it a simpler yet better tool for verifying the events that get sent through the ReplayListenerBus.

### Why are the changes needed?

As explained above. Tests sometimes fail due to events being received by the `EventLoggingListener` that do not get logged (and thus do not get replayed) but influence other events that get logged.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing unit tests.

Closes apache#25673 from wypoon/SPARK-28770.

Authored-by: Wing Yew Poon <wypoon@cloudera.com>
Signed-off-by: Imran Rashid <irashid@cloudera.com>
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
…s fail

`ReplayListenerSuite` depends on a listener class to listen for replayed events. This class was implemented by extending `EventLoggingListener`. `EventLoggingListener` does not log executor metrics update events, but uses them to update internal state; on a stage completion event, it then logs stage executor metrics events using this internal state. As executor metrics update events do not get written to the event log, they do not get replayed. The internal state of the replay listener can therefore be different from the original listener, leading to different stage completion events being logged.

We reimplement the replay listener to simply buffer each and every event it receives. This makes it a simpler yet better tool for verifying the events that get sent through the ReplayListenerBus.

As explained above. Tests sometimes fail due to events being received by the `EventLoggingListener` that do not get logged (and thus do not get replayed) but influence other events that get logged.

No.

Existing unit tests.

Closes apache#25673 from wypoon/SPARK-28770.

Authored-by: Wing Yew Poon <wypoon@cloudera.com>
Signed-off-by: Imran Rashid <irashid@cloudera.com>

Ref: LIHADOOP-56563

RB=2387940
BUG=LIHADOOP-56563
G=spark-reviewers
R=ekrogen
A=ekrogen
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants