[Parquet Vectorized Reads] Fix reading of files with mix of dictionary and non-dictionary encoded row groups #1388

samarthjain · 2020-08-26T21:03:13Z

…

…y and non-dictionary encoded row groups

rdblue · 2020-08-28T00:38:04Z

...pache/iceberg/spark/data/parquet/vectorized/TestParquetDictionaryEncodedVectorizedReads.java

+    Assert.assertTrue("Delete should succeed", mixedFile.delete());
+    OutputFile outputFile = Files.localOutput(mixedFile);
+    int rowGroupSize = Integer.parseInt(PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT);
+    ParquetFileWriter writer = new ParquetFileWriter(


What about adding a Parquet.concat util method? I don't think it is a good idea to make ParquetIO public just for this test case. But it would be nice to have a concat method somewhere that could concatenate Parquet files.

rdblue · 2020-08-28T00:41:23Z

+1 overall. I'd prefer not to expose ParquetIO, but if you think that building a concat helper is too much work for this PR, then we can do it in a follow-up.

rdblue · 2020-08-28T17:00:30Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+    OutputFile file = Files.localOutput(outputFile);
+    ParquetFileWriter writer = new ParquetFileWriter(
+            ParquetIO.file(file), ParquetSchemaUtil.convert(schema, "table"),
+            ParquetFileWriter.Mode.CREATE, rowGroupSize, 0);


We can use the default row group size from table properties here. It will be ignored when appending files because row groups are appended directly and not rewritten.

rdblue · 2020-08-28T17:01:40Z

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

+   * @param schema       the schema of the data
+   * @param metadata     extraMetadata to write at the footer of the @param outputFile
+   */
+  public static void concat(Iterable<File> inputFiles, File outputFile, int rowGroupSize, Schema schema,


I think the input files and output file should use InputFile and OutputFile. That way this isn't limited to just the local FS.

rdblue · 2020-08-28T17:03:09Z

Thanks for the quick fix, @samarthjain! Nice work.

probot-autolabeler bot added arrow parquet spark labels Aug 26, 2020

[Parquet Vectorized Reads] Fix reading of files with mix of dictionar…

fedca8f

…y and non-dictionary encoded row groups

samarthjain force-pushed the fix_mixed_reads branch from 82cc088 to fedca8f Compare August 26, 2020 21:20

rdblue reviewed Aug 28, 2020

View reviewed changes

Add a utility to concatenate parquet files

41c87bc

rdblue reviewed Aug 28, 2020

View reviewed changes

rdblue merged commit e815318 into apache:master Aug 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parquet Vectorized Reads] Fix reading of files with mix of dictionary and non-dictionary encoded row groups #1388

[Parquet Vectorized Reads] Fix reading of files with mix of dictionary and non-dictionary encoded row groups #1388

samarthjain commented Aug 26, 2020

rdblue Aug 28, 2020

rdblue commented Aug 28, 2020

rdblue Aug 28, 2020

rdblue Aug 28, 2020

rdblue commented Aug 28, 2020

[Parquet Vectorized Reads] Fix reading of files with mix of dictionary and non-dictionary encoded row groups #1388

[Parquet Vectorized Reads] Fix reading of files with mix of dictionary and non-dictionary encoded row groups #1388

Conversation

samarthjain commented Aug 26, 2020

rdblue Aug 28, 2020

Choose a reason for hiding this comment

rdblue commented Aug 28, 2020

rdblue Aug 28, 2020

Choose a reason for hiding this comment

rdblue Aug 28, 2020

Choose a reason for hiding this comment

rdblue commented Aug 28, 2020