[SPARK-18044][STREAMING] FileStreamSource should not infer partitions in every batch #15581

cloud-fan · 2016-10-21T08:36:13Z

What changes were proposed in this pull request?

In FileStreamSource.getBatch, we will create a DataSource with specified schema, to avoid inferring the schema again and again. However, we don't pass the partition columns, and will infer the partition again and again.

This PR fixes it by keeping the partition columns in FileStreamSource, like schema.

How was this patch tested?

N/A

cloud-fan · 2016-10-21T08:39:30Z

CC @zsxwing @brkyvz @yhuai

SparkQA · 2016-10-21T10:38:52Z

Test build #67331 has finished for PR 15581 at commit ad7ef81.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class SourceInfo(name: String, schema: StructType, partitionColumns: Seq[String])

zsxwing · 2016-10-21T22:28:03Z

LGTM. Merging to master ~~and 2.0~~. Thanks!

zsxwing · 2016-10-21T22:29:16Z

@cloud-fan there are conflicts with 2.0. Could you submit another PR for that?

cloud-fan · 2016-10-22T01:44:26Z

yea of course, I'll do it soon

cloud-fan · 2016-10-24T05:45:28Z

@zsxwing shall we backport this first? Seems in 2.0 we don't support partitioned file source.

zsxwing · 2016-10-24T17:52:11Z

@zsxwing shall we backport this first? Seems in 2.0 we don't support partitioned file source.

Done. I also merged this one into branch 2.0.

… in every batch ## What changes were proposed in this pull request? In `FileStreamSource.getBatch`, we will create a `DataSource` with specified schema, to avoid inferring the schema again and again. However, we don't pass the partition columns, and will infer the partition again and again. This PR fixes it by keeping the partition columns in `FileStreamSource`, like schema. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #15581 from cloud-fan/stream.

… in every batch ## What changes were proposed in this pull request? In `FileStreamSource.getBatch`, we will create a `DataSource` with specified schema, to avoid inferring the schema again and again. However, we don't pass the partition columns, and will infer the partition again and again. This PR fixes it by keeping the partition columns in `FileStreamSource`, like schema. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes apache#15581 from cloud-fan/stream.

FileStreamSource should not infer partitions in every batch

ad7ef81

asfgit closed this in 1405702 Oct 21, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-18044][STREAMING] FileStreamSource should not infer partitions in every batch #15581

[SPARK-18044][STREAMING] FileStreamSource should not infer partitions in every batch #15581

Uh oh!

cloud-fan commented Oct 21, 2016

Uh oh!

cloud-fan commented Oct 21, 2016

Uh oh!

SparkQA commented Oct 21, 2016

Uh oh!

zsxwing commented Oct 21, 2016 •

edited

Loading

Uh oh!

zsxwing commented Oct 21, 2016

Uh oh!

cloud-fan commented Oct 22, 2016

Uh oh!

cloud-fan commented Oct 24, 2016

Uh oh!

zsxwing commented Oct 24, 2016

Uh oh!

Uh oh!

[SPARK-18044][STREAMING] FileStreamSource should not infer partitions in every batch #15581

[SPARK-18044][STREAMING] FileStreamSource should not infer partitions in every batch #15581

Uh oh!

Conversation

cloud-fan commented Oct 21, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Oct 21, 2016

Uh oh!

SparkQA commented Oct 21, 2016

Uh oh!

zsxwing commented Oct 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zsxwing commented Oct 21, 2016

Uh oh!

cloud-fan commented Oct 22, 2016

Uh oh!

cloud-fan commented Oct 24, 2016

Uh oh!

zsxwing commented Oct 24, 2016

Uh oh!

Uh oh!

zsxwing commented Oct 21, 2016 •

edited

Loading