Skip to content

[SPARK-18044][STREAMING] FileStreamSource should not infer partitions in every batch #15581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

In FileStreamSource.getBatch, we will create a DataSource with specified schema, to avoid inferring the schema again and again. However, we don't pass the partition columns, and will infer the partition again and again.

This PR fixes it by keeping the partition columns in FileStreamSource, like schema.

How was this patch tested?

N/A

@cloud-fan
Copy link
Contributor Author

CC @zsxwing @brkyvz @yhuai

@SparkQA
Copy link

SparkQA commented Oct 21, 2016

Test build #67331 has finished for PR 15581 at commit ad7ef81.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class SourceInfo(name: String, schema: StructType, partitionColumns: Seq[String])

@zsxwing
Copy link
Member

zsxwing commented Oct 21, 2016

LGTM. Merging to master and 2.0. Thanks!

@zsxwing
Copy link
Member

zsxwing commented Oct 21, 2016

@cloud-fan there are conflicts with 2.0. Could you submit another PR for that?

@asfgit asfgit closed this in 1405702 Oct 21, 2016
@cloud-fan
Copy link
Contributor Author

yea of course, I'll do it soon

@cloud-fan
Copy link
Contributor Author

@zsxwing shall we backport this first? Seems in 2.0 we don't support partitioned file source.

@zsxwing
Copy link
Member

zsxwing commented Oct 24, 2016

@zsxwing shall we backport this first? Seems in 2.0 we don't support partitioned file source.

Done. I also merged this one into branch 2.0.

asfgit pushed a commit that referenced this pull request Oct 24, 2016
… in every batch

## What changes were proposed in this pull request?

In `FileStreamSource.getBatch`, we will create a `DataSource` with specified schema, to avoid inferring the schema again and again. However, we don't pass the partition columns, and will infer the partition again and again.

This PR fixes it by keeping the partition columns in `FileStreamSource`, like schema.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #15581 from cloud-fan/stream.
robert3005 pushed a commit to palantir/spark that referenced this pull request Nov 1, 2016
… in every batch

## What changes were proposed in this pull request?

In `FileStreamSource.getBatch`, we will create a `DataSource` with specified schema, to avoid inferring the schema again and again. However, we don't pass the partition columns, and will infer the partition again and again.

This PR fixes it by keeping the partition columns in `FileStreamSource`, like schema.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#15581 from cloud-fan/stream.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
… in every batch

## What changes were proposed in this pull request?

In `FileStreamSource.getBatch`, we will create a `DataSource` with specified schema, to avoid inferring the schema again and again. However, we don't pass the partition columns, and will infer the partition again and again.

This PR fixes it by keeping the partition columns in `FileStreamSource`, like schema.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes apache#15581 from cloud-fan/stream.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants