Skip to content

[Spark-14230][STREAMING] Config the start time (jitter) for streaming… #12026

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

liyintang
Copy link

What changes were proposed in this pull request?

Currently, RecurringTimer will normalize the start time. For instance, if batch duration is 1 min, all the job will start exactly at 1 min boundary.
This actually adds some burden to the streaming source. Assuming the source is Kafka, and there is a list of streaming jobs with 1 min batch duration, then at first few seconds of each min, high network traffic will be observed in Kafka. This makes Kafka capacity planning tricky.
It will be great to have an option in the streaming context to set the job start time. In this way, user can add a jitter for the start time for each, and make Kafka fetch_request much smooth across the duration window.

How was this patch tested?

Unit test: A test case added.
Integration test.

@jerryshao
Copy link
Contributor

Is it better to handle this by back-pressure or something like flow control mechanism? I'm just wondering if this jitter will break the internal semantics of Spark Streaming.

@liyintang
Copy link
Author

I thought the back pressure/flow control handles how many message to fetch, not when to start to generate the job. IMHO, adding the jitter in the start time is more deterministic than adding the jitter in the flow control.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@tdas
Copy link
Contributor

tdas commented Oct 24, 2016

Hi @liyintang, thanks for this PR. I apologize for not providing feedback on this earlier. This is indeed a practical problem in production, but I am not sure how adding the jitter would be affect downstream stuff - rate calculation, flow control, etc. Spark Streaming was not designed keep such things in mind. In the new Structured Streaming, this would not be a problem, as the batch interval (called trigger interval there) is optional, and if it is not specified, it will start next batch when previous batch finishes. That means that multiple jobs wont synchronize. I suggest you try out Structured Streaming.

srowen added a commit to srowen/spark that referenced this pull request Oct 31, 2016
@asfgit asfgit closed this in 26b07f1 Oct 31, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants