-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Parquet Vectorized Reads] Fix reading of files with mix of dictionary and non-dictionary encoded row groups #1388
Conversation
…y and non-dictionary encoded row groups
82cc088
to
fedca8f
Compare
Assert.assertTrue("Delete should succeed", mixedFile.delete()); | ||
OutputFile outputFile = Files.localOutput(mixedFile); | ||
int rowGroupSize = Integer.parseInt(PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT); | ||
ParquetFileWriter writer = new ParquetFileWriter( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about adding a Parquet.concat
util method? I don't think it is a good idea to make ParquetIO
public just for this test case. But it would be nice to have a concat
method somewhere that could concatenate Parquet files.
+1 overall. I'd prefer not to expose |
OutputFile file = Files.localOutput(outputFile); | ||
ParquetFileWriter writer = new ParquetFileWriter( | ||
ParquetIO.file(file), ParquetSchemaUtil.convert(schema, "table"), | ||
ParquetFileWriter.Mode.CREATE, rowGroupSize, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use the default row group size from table properties here. It will be ignored when appending files because row groups are appended directly and not rewritten.
* @param schema the schema of the data | ||
* @param metadata extraMetadata to write at the footer of the @param outputFile | ||
*/ | ||
public static void concat(Iterable<File> inputFiles, File outputFile, int rowGroupSize, Schema schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the input files and output file should use InputFile
and OutputFile
. That way this isn't limited to just the local FS.
Thanks for the quick fix, @samarthjain! Nice work. |
…