Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet Vectorized Reads] Fix reading of files with mix of dictionary and non-dictionary encoded row groups #1388

Merged
merged 2 commits into from
Aug 28, 2020

Conversation

samarthjain
Copy link
Collaborator

Assert.assertTrue("Delete should succeed", mixedFile.delete());
OutputFile outputFile = Files.localOutput(mixedFile);
int rowGroupSize = Integer.parseInt(PARQUET_ROW_GROUP_SIZE_BYTES_DEFAULT);
ParquetFileWriter writer = new ParquetFileWriter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about adding a Parquet.concat util method? I don't think it is a good idea to make ParquetIO public just for this test case. But it would be nice to have a concat method somewhere that could concatenate Parquet files.

@rdblue
Copy link
Contributor

rdblue commented Aug 28, 2020

+1 overall. I'd prefer not to expose ParquetIO, but if you think that building a concat helper is too much work for this PR, then we can do it in a follow-up.

OutputFile file = Files.localOutput(outputFile);
ParquetFileWriter writer = new ParquetFileWriter(
ParquetIO.file(file), ParquetSchemaUtil.convert(schema, "table"),
ParquetFileWriter.Mode.CREATE, rowGroupSize, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use the default row group size from table properties here. It will be ignored when appending files because row groups are appended directly and not rewritten.

* @param schema the schema of the data
* @param metadata extraMetadata to write at the footer of the @param outputFile
*/
public static void concat(Iterable<File> inputFiles, File outputFile, int rowGroupSize, Schema schema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the input files and output file should use InputFile and OutputFile. That way this isn't limited to just the local FS.

@rdblue rdblue merged commit e815318 into apache:master Aug 28, 2020
@rdblue
Copy link
Contributor

rdblue commented Aug 28, 2020

Thanks for the quick fix, @samarthjain! Nice work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants