Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: missing required field ColumnIndex.null_pages when loading page indexes #6464

Open
alamb opened this issue Sep 26, 2024 · 4 comments
Open
Labels

Comments

@alamb
Copy link
Contributor

alamb commented Sep 26, 2024

Describe the bug
If the ParquetMetadataReader tries to read metadata written by ParquetMetaDataWriter without first loading the page indexes, you get an error like "missing required field ColumnIndex.null_pages"

Nite this depends on #6463

To Reproduce
The full reproducer is in #6463. Here is the relevant piece

        let parquet_bytes = create_parquet_file();

        // read the metadata from the file WITHOUT the page index structures
        let original_metadata = ParquetMetaDataReader::new()
            .parse_and_finish(&parquet_bytes)
            .unwrap();

        // read metadata back from the serialized bytes requesting to read the offsets
        let metadata_bytes = metadata_to_bytes(&original_metadata);
        let roundtrip_metadata = ParquetMetaDataReader::new()
            .with_page_indexes(true) // there are no page indexes in the metadata
            .parse_and_finish(&metadata_bytes)
            .unwrap(); // <******* This fails

Expected behavior
The reader should not error

I am not sure if the right fix is to

  1. change the ParquetMetadataWriter to clear the index offset fields befor writing them
  2. change the ParquetMetadataReader to ignore bogus offsets
  3. SOmething else

Additional context
@etseidl has added the APIs in #6431

@alamb
Copy link
Contributor Author

alamb commented Sep 26, 2024

@etseidl predicted this error in https://github.com/apache/arrow-rs/pull/6081/files#r1774020124

I wonder if the metadata writer needs to modify the page index offsets/lengths in the ColumnMetaData if the indexes are not present in the ParquetMetaData. Then again, I could see wanting to preserve the page index offsets of the original file if you only want to save the footer metadata externally...perhaps an option on the metadata writer to preserve page index offsets if desired?

@alamb alamb changed the title Error: missing required field ColumnIndex.null_pages when Error: missing required field ColumnIndex.null_pages when loading page indexes Sep 28, 2024
@etseidl
Copy link
Contributor

etseidl commented Oct 2, 2024

The issue with this test is that the second parse_and_finish call does not allow the reader to know the true size of the original file, so it cannot know that the page index bounds are outside the bounds of metadata_bytes. For instance,
passing the original file size to try_parse_sized errors as expected (it asks for a larger slice of the original file)

        let mut reader = ParquetMetaDataReader::new().with_page_indexes(true);
        let err = reader.try_parse_sized(&metadata_bytes, parquet_bytes.len()).err().unwrap();
        assert_eq!(err.to_string(), "Index 386 out of bound: 468");

It's also possible to split the requests between data sources.

        let mut reader = ParquetMetaDataReader::new();
        // get metadata from serialized bytes
        reader.try_parse(&metadata_bytes).unwrap();
        // get page indexes from original file
        reader.read_page_indexes(&parquet_bytes).unwrap();
        let roundtrip_metadata = reader.finish().unwrap();
        assert_eq!(original_metadata, roundtrip_metadata);

So part of this issue is a documentation thing. We probably have to be clearer about the intended use of the read and write APIs and explain the pitfalls.

That said, I do think an option to have ParquetMetaDataWriter reset the page index offsets if page indexes are not present would be helpful. ParquetMetaDataReader could be modified some to detect the range mismatch, at least in the parse_and_finish case, if it kept the metadata size as part of its state. Calling ParquetMetaDataReader::read_page_indexes directly could get dicey, however.

@alamb
Copy link
Contributor Author

alamb commented Oct 3, 2024

That said, I do think an option to have ParquetMetaDataWriter reset the page index offsets if page indexes are not present would be helpful.

I agree -- I would even argue that the default should be for the writer not to write page index offsets if there are no page indexes in memory (in fact, maybe the rust structs shouldn't even have the offsets 🤔 )

ParquetMetaDataReader could be modified some to detect the range mismatch, at least in the parse_and_finish case, if it kept the metadata size as part of its state.

I think that ParquetMetaDataReader::try_parsed_sized should fail if

Specifically I would expect this to fail

  1. It was instructed to load the page indexes,
  2. The page index offsets were populated to not None
  3. The needed offset was not present

The rationale is that in this case something is incorrect / mismatched between the serialized metadata and what is provided to the reader.

@etseidl
Copy link
Contributor

etseidl commented Oct 3, 2024

Added check that will return EOF error for this case. #6507

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants