Skip to content

Commit 7a9ba61

Browse files
shyamwesm
authored andcommitted
PARQUET-1402: [C++] Parquet files with dictionary page offset as 0 is not readable
…adable pyarrow needs to handle dictionary page offset = 0 as a special case to be compatible with java parquet reader. Author: shyam <shyam@dremio.com> Closes #4359 from shyambits2004/5322 and squashes the following commits: f47762a <shyam> Parquet files with dictionary page offset as 0 is not readable
1 parent 828d18e commit 7a9ba61

File tree

3 files changed

+9
-2
lines changed

3 files changed

+9
-2
lines changed

cpp/src/parquet/arrow/arrow-reader-writer-test.cc

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2354,6 +2354,12 @@ TEST(TestArrowReaderAdHoc, CorruptedSchema) {
23542354
TryReadDataFile(path, ::arrow::StatusCode::IOError);
23552355
}
23562356

2357+
TEST(TestArrowReaderAdHoc, HandleDictPageOffsetZero) {
2358+
// PARQUET-1402: parquet-mr writes files this way which tripped up
2359+
// some business logic
2360+
TryReadDataFile(test::get_data_file("dict-page-offset-zero.parquet"));
2361+
}
2362+
23572363
class TestArrowReaderAdHocSparkAndHvr
23582364
: public ::testing::TestWithParam<
23592365
std::tuple<std::string, std::shared_ptr<::DataType>>> {};

cpp/src/parquet/file_reader.cc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -94,7 +94,8 @@ class SerializedRowGroup : public RowGroupReader::Contents {
9494
auto col = row_group_metadata_->ColumnChunk(i);
9595

9696
int64_t col_start = col->data_page_offset();
97-
if (col->has_dictionary_page() && col_start > col->dictionary_page_offset()) {
97+
if (col->has_dictionary_page() && col->dictionary_page_offset() > 0 &&
98+
col_start > col->dictionary_page_offset()) {
9899
col_start = col->dictionary_page_offset();
99100
}
100101

0 commit comments

Comments
 (0)