Skip to content

SPARK-12619 Combine small files in a hadoop directory into single split #10572

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

navis
Copy link
Contributor

@navis navis commented Jan 4, 2016

When a directory contains too many (small) files, whole spark cluster will be exhausted scheduling tasks created for each file. Custom input format can handle that but if you're using hive metastore, it could hardly be an option.

@SparkQA
Copy link

SparkQA commented Jan 4, 2016

Test build #48661 has finished for PR 10572 at commit 055f613.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class SimpleCombiner<K, V> implements InputFormat<K, V>
    • public static class InputSplits implements InputSplit, Configurable

@SparkQA
Copy link

SparkQA commented Jan 6, 2016

Test build #48804 has finished for PR 10572 at commit e056332.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class SimpleCombiner<K, V> implements InputFormat<K, V>
    • public static class InputSplits implements InputSplit, Configurable

@HyukjinKwon
Copy link
Member

Maybe we might have to correct the title just like the others, [SPARK-XXXX][SQL] (this is described in https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark).

@davies
Copy link
Contributor

davies commented Jun 6, 2016

This is fixed in 2.0, could you close this PR?

@cerisier
Copy link

@davies do you have the commit that fixes this in 2.0 ?

@HyukjinKwon
Copy link
Member

Is that #12095?

@jinxing64
Copy link

@HyukjinKwon
To merge small files, should I tune spark.sql.files.maxPartitionBytes? But IIUC it only works for FileSourceScanExec. So when I select from hive table, it doesn't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants