GH-47628: [C++][Parquet] Implement basic parquet file rewriter by HuaHuaY · Pull Request #47775 · apache/arrow

HuaHuaY · 2025-10-10T06:58:56Z

Rationale for this change

Allow to rewrite parquet files in binary data formats instead of reading, decoding all values and writing them.

What changes are included in this PR?

Add class ParquetFileRewriter and RewriterProperties.
Add some to_thrift and SetXXX methods to help me copy the metadata.
Add CopyStream methods to call memcpy between ArrowInputStream and ArrowOutputStream.
Add RowGroupMetaDataBuilder::NextColumnChunk(std::unique_ptr<ColumnChunkMetaData> cc_metadata, int64_t shift) which allows to add column metadata without creating ColumnChunkMetaDataBuilder.

Are these changes tested?

Yes

Are there any user-facing changes?

Add some new classes and methods mentioned above.
ReaderProperties::GetStream is changed to a const method. Only the signature has been changed. Its original implementation allows it to be declared as a const method.

GitHub Issue: [C++][Parquet] Provide a rewriter to rewrite parquet files without decoding all the row groups/pages #47628

github-actions · 2025-10-10T06:59:20Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

Pull request overview

Draft implementation of a C++ Parquet “file rewriter” that can rewrite/concatenate/join Parquet files by copying encoded bytes and re-emitting metadata (optionally including page indexes and bloom filters), avoiding full decode/re-encode.

Changes:

Add ParquetFileRewriter + RewriterProperties and implement basic concat (horizontal) + join (vertical) rewriting.
Add metadata copying helpers (to_thrift, new builder entrypoints, new ToThrift overloads) to support fast metadata reconstruction.
Add Arrow-based tests for roundtrip rewriting scenarios (simple, concat, join, concat+join) and wire into CMake.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
cpp/src/parquet/thrift_internal.h	Add `ToThrift` helpers for page-index-related structs.
cpp/src/parquet/properties.h	Make `ReaderProperties::GetStream` const; add `RewriterProperties` + default factory decl.
cpp/src/parquet/properties.cc	Update `ReaderProperties::GetStream` definition to const.
cpp/src/parquet/page_index.h	Extend `PageIndexBuilder` API to allow setting pre-built indexes.
cpp/src/parquet/page_index.cc	Implement new `PageIndexBuilder` setters and mixed builder/prebuilt serialization.
cpp/src/parquet/metadata.h	Add `start_offset()`, expose `to_thrift()`, extend `RowGroupMetaDataBuilder`.
cpp/src/parquet/metadata.cc	Implement new metadata accessors and new row-group column-chunk injection path.
cpp/src/parquet/file_rewriter.h	New public rewriter API (`ParquetFileRewriter`).
cpp/src/parquet/file_rewriter.cc	Core rewriting implementation (copy streams, concat/join logic, index/bloom handling).
cpp/src/parquet/bloom_filter_writer.h	Extend `BloomFilterBuilder` with `InsertBloomFilter`.
cpp/src/parquet/bloom_filter_writer.cc	Implement `InsertBloomFilter`.
cpp/src/parquet/arrow/test_util.h	Add helper to write a table into a Parquet buffer for tests.
cpp/src/parquet/arrow/arrow_rewriter_test.cc	Add Arrow-level rewriter roundtrip tests.
cpp/src/parquet/CMakeLists.txt	Add new source (`file_rewriter.cc`) and new test target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cpp/src/parquet/bloom_filter_writer.cc

cpp/src/parquet/CMakeLists.txt

cpp/src/parquet/bloom_filter_writer.h

Copilot · 2026-02-09T12:53:55Z

cpp/src/parquet/file_rewriter.cc

+      if (page_index_reader_ != nullptr && page_index_builder != nullptr) {
+        auto column_index = page_index_reader_->GetColumnIndex(column_ordinal_);
+        auto offset_index = page_index_reader_->GetOffsetIndex(column_ordinal_);
+        if (column_index != nullptr) {
+          page_index_builder->SetColumnIndex(column_ordinal_, column_index);
+        }
+        if (offset_index != nullptr) {
+          page_index_builder->SetOffsetIndex(column_ordinal_, offset_index, shift);
+        }
+      }
+
+      if (bloom_filter_reader_ != nullptr && bloom_filter_builder != nullptr) {
+        auto bloom_filter = bloom_filter_reader_->GetColumnBloomFilter(column_ordinal_);
+        if (bloom_filter != nullptr) {
+          bloom_filter_builder->InsertBloomFilter(column_ordinal_,
+                                                  std::move(bloom_filter));
+        }
+      }


The new rewriter code path copies page indexes (SetColumnIndex/SetOffsetIndex) and bloom filters (InsertBloomFilter) when enabled, but the added Arrow-level roundtrip tests only validate the decoded table contents. Consider extending tests to assert that rewritten files actually contain the expected page index / bloom filter structures (e.g., via ParquetFileReader::GetPageIndexReader / GetBloomFilterReader and checking non-null indexes/locations), so regressions in metadata copying/offset shifting are caught.

cpp/src/parquet/metadata.cc

cpp/src/parquet/file_rewriter.cc

Copilot · 2026-02-09T12:53:57Z

cpp/src/parquet/properties.h

  std::shared_ptr<ArrowInputStream> GetStream(std::shared_ptr<ArrowInputFile> source,
-                                              int64_t start, int64_t num_bytes);
+                                              int64_t start, int64_t num_bytes) const;



Changing ReaderProperties::GetStream to be const changes the mangled symbol and will break ABI for downstream code compiled against earlier versions of the library. If ABI compatibility matters here, consider keeping the old non-const overload (forwarding to the const implementation) instead of changing the existing signature in-place.

Suggested change

// Non-const overload kept for ABI compatibility. It forwards to the const

// implementation introduced in a later version.

std::shared_ptr<ArrowInputStream> GetStream(std::shared_ptr<ArrowInputFile> source,

int64_t start, int64_t num_bytes) {

return static_cast<const ReaderProperties*>(this)->GetStream(source, start, num_bytes);

}

Copilot · 2026-02-09T12:53:57Z

cpp/src/parquet/metadata.h

+  const void* to_thrift() const;
+


ColumnChunkMetaData::to_thrift() exposes an untyped pointer to an internal Thrift struct as part of the public metadata API, which couples consumers to internal representation and is easy to misuse/UB (wrong cast / lifetime assumptions). Prefer an internal-only accessor, or return a typed reference/pointer to the concrete thrift type in an internal header, or provide a dedicated cloning/copy helper on the builder to avoid exposing raw thrift at all.

cpp/src/parquet/page_index.h

cpp/src/parquet/metadata.cc

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cpp/src/parquet/file_rewriter.cc

cpp/src/parquet/metadata.cc

cpp/src/parquet/bloom_filter_writer.cc

cpp/src/parquet/file_rewriter.cc

cpp/src/parquet/meson.build

github-actions · 2026-02-10T10:44:56Z

⚠️ GitHub issue #47628 has been automatically assigned in GitHub to PR creator.

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cpp/src/parquet/file_rewriter.cc

cpp/src/parquet/page_index.cc

cpp/src/parquet/file_rewriter.cc

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cpp/src/parquet/arrow/test_util.h

Copilot · 2026-02-13T15:38:54Z

cpp/src/parquet/file_rewriter.h

+class PARQUET_EXPORT ParquetFileRewriter {
+ public:
+  struct PARQUET_EXPORT Contents {
+    virtual ~Contents() = default;
+    virtual void Close() = 0;
+    virtual void Rewrite() = 0;
+  };
+
+  ParquetFileRewriter();
+  ~ParquetFileRewriter();
+
+  static std::unique_ptr<ParquetFileRewriter> Open(
+      std::vector<std::vector<std::shared_ptr<ArrowInputFile>>> sources,
+      std::shared_ptr<ArrowOutputStream> sink,
+      std::vector<std::vector<std::shared_ptr<FileMetaData>>> sources_metadata,
+      std::shared_ptr<const ::arrow::KeyValueMetadata> sink_metadata = NULLPTR,
+      std::shared_ptr<RewriterProperties> props = default_rewriter_properties());
+
+  void Open(std::unique_ptr<Contents> contents);
+  void Close();
+
+  void Rewrite();


The ParquetFileRewriter class and its public methods (Open, Close, Rewrite) lack documentation comments. Since this is a new public API (marked with PARQUET_EXPORT), these methods should have documentation describing their purpose, parameters, return values, and any exceptions that may be thrown. This is especially important for the Open method with its complex nested vector parameters.

Copilot · 2026-02-13T15:38:54Z

cpp/src/parquet/metadata.h

@@ -485,6 +488,7 @@ class PARQUET_EXPORT RowGroupMetaDataBuilder {
  ~RowGroupMetaDataBuilder();

  ColumnChunkMetaDataBuilder* NextColumnChunk();


The new overload of NextColumnChunk that takes ColumnChunkMetaData and a shift value lacks documentation. This method allows adding column metadata without creating a builder, which is a key feature for the rewriter's fast-copy optimization. Documentation should explain the parameters (especially shift for offset adjustment) and when to use this method versus the builder-based approach.

Suggested change

ColumnChunkMetaDataBuilder* NextColumnChunk();

ColumnChunkMetaDataBuilder* NextColumnChunk();

// Add an existing column chunk metadata object to the row group without

// constructing it through a ColumnChunkMetaDataBuilder.

//

// This overload is intended for fast-copy / rewriting scenarios where

// column chunks are reused from another file and their metadata has

// already been finalized.

//

// Parameters:

// - cc_metadata: Ownership of the provided ColumnChunkMetaData is

// transferred to the RowGroupMetaDataBuilder.

// - shift: Byte offset delta to apply to all file-relative offsets in

// cc_metadata (for example, when appending row groups at a different

// position in the target file). Use 0 if no adjustment is needed.

//

// For column chunks produced by this writer, prefer the builder-based

// NextColumnChunk() API above.

cpp/src/parquet/file_rewriter.cc

cpp/src/parquet/metadata.cc

cpp/src/parquet/page_index.cc

Copilot · 2026-02-13T15:38:55Z

cpp/src/parquet/properties.h

+class PARQUET_EXPORT RewriterProperties {
+ public:
+  class Builder {
+   public:
+    Builder()
+        : pool_(::arrow::default_memory_pool()),
+          writer_properties_(default_writer_properties()),
+          reader_properties_(default_reader_properties()) {}
+
+    explicit Builder(const RewriterProperties& properties)
+        : pool_(properties.memory_pool()),
+          writer_properties_(properties.writer_properties()),
+          reader_properties_(properties.reader_properties()) {}
+
+    virtual ~Builder() = default;
+
+    /// Specify the memory pool for the rewriter. Default default_memory_pool.
+    Builder* memory_pool(MemoryPool* pool) {
+      pool_ = pool;
+      return this;
+    }
+
+    /// Set the writer properties.
+    Builder* writer_properties(std::shared_ptr<WriterProperties> properties) {
+      writer_properties_ = std::move(properties);
+      return this;
+    }
+
+    /// Set the reader properties.
+    Builder* reader_properties(ReaderProperties properties) {
+      reader_properties_ = std::move(properties);
+      return this;
+    }
+
+    /// Build the RewriterProperties with the builder parameters.
+    std::shared_ptr<RewriterProperties> build() {
+      return std::shared_ptr<RewriterProperties>(new RewriterProperties(
+          pool_, std::move(writer_properties_), std::move(reader_properties_)));
+    }
+
+   private:
+    MemoryPool* pool_;
+    std::shared_ptr<WriterProperties> writer_properties_;
+    ReaderProperties reader_properties_;
+  };
+
+  MemoryPool* memory_pool() const { return pool_; }
+
+  const std::shared_ptr<WriterProperties>& writer_properties() const {
+    return writer_properties_;
+  }
+
+  const ReaderProperties& reader_properties() const { return reader_properties_; }
+
+ private:
+  explicit RewriterProperties(MemoryPool* pool,
+                              std::shared_ptr<WriterProperties> writer_properties,
+                              ReaderProperties reader_properties)
+      : pool_(pool),
+        writer_properties_(std::move(writer_properties)),
+        reader_properties_(std::move(reader_properties)) {}
+
+  MemoryPool* pool_;
+  std::shared_ptr<WriterProperties> writer_properties_;
+  ReaderProperties reader_properties_;
+};


The RewriterProperties class lacks documentation. As a new public API class (marked with PARQUET_EXPORT), it should have comprehensive documentation explaining its purpose, how it relates to WriterProperties and ReaderProperties, and what configuration options it provides for the rewriter.

Copilot · 2026-02-13T15:38:55Z

cpp/src/parquet/page_index.h


+  virtual void SetColumnIndex(int32_t i,
+                              const std::shared_ptr<ColumnIndex>& column_index) = 0;
+


The new methods SetColumnIndex and SetOffsetIndex lack documentation comments. These are part of the public PageIndexBuilder interface and should include documentation explaining their purpose, parameters (especially the shift parameter in SetOffsetIndex), and when they should be used instead of the corresponding builder methods.

Suggested change

virtual void SetColumnIndex(int32_t i,

const std::shared_ptr<ColumnIndex>& column_index) = 0;

/// \brief Set a fully constructed ColumnIndex for a column.

///

/// This method can be used instead of GetColumnIndexBuilder() when the caller

/// already has a materialized ColumnIndex instance (for example, when reusing

/// or transforming an existing index) and does not need to build it through

/// the ColumnIndexBuilder interface.

///

/// \param i Column ordinal.

/// \param column_index The ColumnIndex to associate with the given column.

virtual void SetColumnIndex(int32_t i,

const std::shared_ptr<ColumnIndex>& column_index) = 0;

/// \brief Set a fully constructed OffsetIndex for a column.

///

/// This method can be used instead of GetOffsetIndexBuilder() when the caller

/// already has a materialized OffsetIndex instance.

///

/// The \p shift parameter is applied to all page offsets contained in the

/// provided OffsetIndex. It allows reusing an index whose offsets are relative

/// to a different file position (for example, when concatenating data or

/// writing the index at a different location) by shifting all stored offsets

/// by a constant amount.

///

/// \param i Column ordinal.

/// \param offset_index The OffsetIndex to associate with the given column.

/// \param shift A byte offset added to each page offset stored in \p offset_index.

cpp/src/parquet/file_rewriter.cc

Copilot

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cpp/src/parquet/arrow/arrow_rewriter_test.cc

Copilot · 2026-02-25T03:43:27Z

cpp/src/parquet/file_rewriter.cc

+  std::shared_ptr<ResizableBuffer> buffer =
+      AllocateBuffer(pool, kDefaultOutputStreamSize);
+  while (bytes_copied < size) {
+    PARQUET_ASSIGN_OR_THROW(
+        auto read_size,
+        from->Read(std::min(size - bytes_copied, buffer->size()), &buffer));
+    if (read_size == 0) {
+      throw ParquetException("Unexpected end of stream at ", bytes_copied);
+    }
+    PARQUET_THROW_NOT_OK(to->Write(buffer->data(), read_size));
+    bytes_copied += read_size;
+  }


CopyStream allocates a 1KB buffer (kDefaultOutputStreamSize) for the non-zero-copy path. For large column chunks this will cause excessive small reads/writes and significantly degrade rewrite performance. Consider using a substantially larger chunk size (e.g., 64KB+ or an Arrow IO default) for the copy buffer.

cpp/src/parquet/arrow/arrow_rewriter_test.cc

cpp/src/parquet/bloom_filter_writer.h

cpp/src/parquet/file_rewriter.cc

cpp/src/parquet/bloom_filter_writer.cc

wgtmac · 2026-02-26T07:40:05Z

PR #47775 Review: GH-47628: [C++][Parquet] Implement basic parquet file rewriter

🤖 This review was generated by an AI agent (GitHub Copilot, Claude Opus 4.6). Findings should be validated by a human reviewer.

Summary

This PR adds a ParquetFileRewriter class that can rewrite Parquet files at the binary level without decoding row data. It supports concat (appending row groups from multiple files with the same schema), join (side-by-side column merge from files with different columns but matching row counts), and optionally re-compressing column data. The implementation is substantial (~810 lines in file_rewriter.cc) with supporting changes to metadata builders, page index builders, bloom filter builders, thrift serialization, and properties.

Critical Issues 🔴

1. [BLOCKING] <ranges> header portability
cpp/src/parquet/file_rewriter.cc#L23: #include <ranges> and usage of std::ranges::all_of, std::ranges::any_of, std::ranges::for_each, std::views::drop, std::views::iota throughout the file.

While Arrow targets C++20 (confirmed in SetupCxxFlags.cmake), <ranges> is not used anywhere else in the Arrow or Parquet codebase. The <ranges> header has historically had inconsistent support across compilers (especially older GCC/Clang versions in CI). The only precedent is std::ranges::equal in cpp/src/arrow/sparse_tensor.cc.

Suggested fix: Replace std::ranges::* / std::views::* with standard algorithms (std::all_of, std::any_of, std::for_each) and simple loops/iterators. This is safer for CI portability.

2. [BLOCKING] to_thrift() returns const void* — leaks Thrift internals through public API
cpp/src/parquet/metadata.h adds const void* to_thrift() const; to ColumnChunkMetaData. This:

Exposes a raw void* in a PARQUET_EXPORT-ed public header, violating type safety
Requires the caller to static_cast to format::ColumnMetaData*, which is a Thrift-generated internal type
The Thrift types are internal to the Parquet library and should never be exposed in public headers

Suggested fix: Instead of exposing to_thrift() publicly, make RowGroupMetaDataBuilder::NextColumnChunk(unique_ptr<ColumnChunkMetaData>, int64_t) a friend or use an internal-only accessor. Consider a CopyMetadataFrom(const ColumnChunkMetaData&, int64_t shift) method on RowGroupMetaDataBuilder that does the thrift copy internally.

3. [BLOCKING] RowGroupRewriter fast-copy path relies on RowGroupMetaData::file_offset() which is unreliable
In file_rewriter.cc (the RowGroupRewriter::WriteRowGroupData method), fast-copy of the entire row group is predicated on metadata_->file_offset() != 0. However:

file_offset is a deprecated optional Thrift field — it may be 0 in older files or files written by other libraries
Even when set, total_compressed_size at the row group level may be 0 (it's also optional per the spec)
Thus CopyStream(stream, sink_, metadata_->total_compressed_size(), ...) could copy 0 bytes or the wrong amount

Suggested fix: For the row-group-level fast copy, compute the byte range from individual column chunk offsets (min start_offset → max end offset). Consider falling back to column-by-column copy when row group level metadata is unreliable.

4. [BLOCKING] ColumnChunkMetaData::start_offset() does not account for crypto_metadata offset
cpp/src/parquet/metadata.cc — the new start_offset() returns has_dictionary_page() ? dictionary_page_offset() : data_page_offset(). But for encrypted columns, ColumnChunk.crypto_metadata can precede the pages. If fast-copy is used on encrypted data in the future, this offset would be wrong.

Suggested fix: Document this limitation clearly or account for it. At minimum, add a comment noting this only works for unencrypted columns.

Major Suggestions 🟡

5. Missing Doxygen documentation on all new public APIs
cpp/src/parquet/file_rewriter.h — ParquetFileRewriter, Open(), Rewrite(), Close() have minimal or no documentation. The 2D vector parameter semantics (sources[i][j] means "block i, file j for concatenation") are non-obvious and must be documented.

Similarly, RewriterProperties in properties.h needs class-level documentation explaining its purpose.

metadata.h — start_offset(), to_thrift(), and NextColumnChunk(unique_ptr, int64_t) need Doxygen /// comments.

6. RowGroupMetaDataBuilder::Finish null-checks column_builders_[i] — fragile design
In metadata.cc, the new NextColumnChunk(cc_metadata, shift) pushes NULLPTR into column_builders_. Then Finish must check for NULLPTR:

if (column_builders_[i] == NULLPTR) {
    total_compressed_size += row_group_->columns[i].meta_data.total_compressed_size;
    continue;
}

This is fragile — any future code that iterates column_builders_ without null-checking will crash.

Suggestion: Consider using std::optional or a variant type to make the two code paths explicit, or split into two builder modes.

7. PageIndexBuilder::SetColumnIndex / SetOffsetIndex lack documentation
cpp/src/parquet/page_index.h — The new virtual methods SetColumnIndex and SetOffsetIndex have no Doxygen comments. They should document:

What shift means for SetOffsetIndex
That these are for copying existing indices (vs building new ones)
Thread safety guarantees
Interaction with existing GetColumnIndexBuilder/GetOffsetIndexBuilder

8. BloomFilterBuilder::InsertBloomFilter silently returns when opts has no value
cpp/src/parquet/bloom_filter_writer.cc — InsertBloomFilter checks if bloom filter options are enabled for the column. If not, it silently drops the bloom filter. This might be surprising to a caller who expects the bloom filter to always be inserted when passed in. At minimum, the doc comment on the header should note this behavior.

9. WriteFile helper in test_util.h is not marked inline — potential ODR violation
cpp/src/parquet/arrow/test_util.h — The new WriteFile() function is defined in a header but is not marked inline. If this header is included in multiple translation units, it will cause linker errors (multiple definitions).

Suggested fix: Mark it inline, or move it to a .cc file shared by tests.

10. CopyStream uses from->Read() with &buffer overload which may replace the buffer
cpp/src/parquet/file_rewriter.cc — In the non-zero-copy path:

from->Read(std::min(size - bytes_copied, buffer->size()), &buffer)

The InputStream::Read(int64_t, std::shared_ptr<Buffer>*) overload may return a buffer of different size or replace the pointer entirely. This is fragile.

Suggestion: Use Read(int64_t nbytes, void* out) overload for the non-zero-copy path to write directly into the pre-allocated buffer.

Minor Suggestions 🟢

11. [NIT] RewriterProperties::Builder has virtual ~Builder() = default; — unnecessary. There's no reason for the builder to be polymorphic. No other Builder in Arrow/Parquet does this. Remove virtual.

12. [NIT] vecToString lambda in JoinRewriter — consider using arrow::internal::JoinStrings or similar existing utility instead of ad-hoc formatting.

13. [NIT] Missing test for encryption rejection. The code throws ParquetException::NYI for encrypted files. There should be a test verifying this error path.

14. [NIT] Missing test for page index and bloom filter round-trip. While enable_write_page_index is set in some tests, there's no explicit verification that page indices and bloom filters survive the rewrite. Consider adding assertions on metadata post-rewrite to verify index locations are present.

AI Code Smell 🤖

None confidently detected. The code has a consistent style, uses domain-appropriate terminology, and the abstractions (PagesRewriter → ColumnChunkRewriter → RowGroupRewriter → SingleFileRewriter → ConcatRewriter → JoinRewriter) mirror the Parquet file structure naturally. The TODO comments reference specific usernames/futures which is genuine.

Positive Notes 👍

Well-structured layered architecture: The hierarchy of rewriters (Pages → ColumnChunk → RowGroup → SingleFile → Concat → Join) cleanly mirrors the Parquet format hierarchy.
Smart fast-copy optimization: Detecting when compression matches to skip re-coding and directly memcpy binary data is the right approach.
Good error messages: Validation in JoinRewriter includes row count vectors in the error message, helpful for debugging.
ReaderProperties::GetStream const-correctness fix: Making it const is a correct improvement — the implementation was already const-safe.
Comprehensive test coverage for the main paths: Simple, concat, join, concat+join, mismatched row counts, invalid dimensions, add/change compression.

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cpp/src/parquet/file_rewriter.cc

Copilot · 2026-02-26T09:27:41Z

cpp/src/parquet/file_rewriter.cc

+    auto first_row_index =
+        original_offset_index_
+            ? std::optional{original_offset_index_->page_locations()[page_no]
+                                .first_row_index}
+            : std::nullopt;
+    SizeStatistics size_statistics;
+    size_statistics.unencoded_byte_array_data_bytes =
+        original_offset_index_ &&
+                !original_offset_index_->unencoded_byte_array_data_bytes().empty()
+            ? std::optional{original_offset_index_
+                                ->unencoded_byte_array_data_bytes()[page_no]}
+            : std::nullopt;


There is no bounds checking when accessing page_locations()[page_no] and unencoded_byte_array_data_bytes()[page_no]. If the original_offset_index has fewer entries than the number of data pages (due to file corruption or malformed data), this will result in out-of-bounds access. Consider adding validation like: if (page_no >= original_offset_index_->page_locations().size()) before accessing the arrays.

Suggested change

auto first_row_index =

original_offset_index_

? std::optional{original_offset_index_->page_locations()[page_no]

.first_row_index}

: std::nullopt;

SizeStatistics size_statistics;

size_statistics.unencoded_byte_array_data_bytes =

original_offset_index_ &&

!original_offset_index_->unencoded_byte_array_data_bytes().empty()

? std::optional{original_offset_index_

->unencoded_byte_array_data_bytes()[page_no]}

: std::nullopt;

std::optional<int64_t> first_row_index;

if (original_offset_index_) {

const auto& locations = original_offset_index_->page_locations();

if (page_no < locations.size()) {

first_row_index = locations[page_no].first_row_index;

}

}

SizeStatistics size_statistics;

if (original_offset_index_) {

const auto& unencoded_bytes =

original_offset_index_->unencoded_byte_array_data_bytes();

if (!unencoded_bytes.empty() && page_no < unencoded_bytes.size()) {

size_statistics.unencoded_byte_array_data_bytes =

std::optional{unencoded_bytes[page_no]};

}

}

Copilot · 2026-02-26T09:27:41Z

cpp/src/parquet/file_rewriter.cc

+        original_offset_index_
+            ? std::optional{original_offset_index_->page_locations()[page_no]
+                                .first_row_index}
+            : std::nullopt;
+    SizeStatistics size_statistics;
+    size_statistics.unencoded_byte_array_data_bytes =
+        original_offset_index_ &&
+                !original_offset_index_->unencoded_byte_array_data_bytes().empty()


There is no bounds checking when accessing page_locations()[page_no] and unencoded_byte_array_data_bytes()[page_no]. If the original_offset_index has fewer entries than the number of data pages (due to file corruption or malformed data), this will result in out-of-bounds access. Consider adding validation like: if (page_no >= original_offset_index_->page_locations().size()) before accessing the arrays.

Suggested change

original_offset_index_

? std::optional{original_offset_index_->page_locations()[page_no]

.first_row_index}

: std::nullopt;

SizeStatistics size_statistics;

size_statistics.unencoded_byte_array_data_bytes =

original_offset_index_ &&

!original_offset_index_->unencoded_byte_array_data_bytes().empty()

original_offset_index_ &&

page_no < original_offset_index_->page_locations().size()

? std::optional{original_offset_index_->page_locations()[page_no]

.first_row_index}

: std::nullopt;

SizeStatistics size_statistics;

size_statistics.unencoded_byte_array_data_bytes =

original_offset_index_ &&

!original_offset_index_->unencoded_byte_array_data_bytes().empty() &&

page_no < original_offset_index_->unencoded_byte_array_data_bytes().size()

HuaHuaY · 2026-02-26T09:27:55Z

I think this PR has implemented basic functions. Please take a look @pitrou @adamreeve

Copilot

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

cpp/src/parquet/thrift_internal.h

Copilot · 2026-02-26T13:05:38Z

cpp/src/parquet/file_rewriter.h

+  static std::unique_ptr<ParquetFileRewriter> Open(
+      std::vector<std::vector<std::shared_ptr<ArrowInputFile>>> sources,
+      std::shared_ptr<ArrowOutputStream> sink,
+      std::vector<std::vector<std::shared_ptr<FileMetaData>>> sources_metadata,
+      std::shared_ptr<const ::arrow::KeyValueMetadata> sink_metadata = NULLPTR,
+      std::shared_ptr<RewriterProperties> props = default_rewriter_properties());


The Open method parameters would benefit from documentation explaining the 2D vector structure. The outer vector represents groups of files to join (horizontally combine columns), and the inner vector represents files to concatenate (vertically stack row groups). Consider adding parameter documentation to clarify this structure and usage.

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Oct 10, 2025

HuaHuaY changed the title ~~[C++][Parquet] Implement basic parquet file rewriter~~ [C++][Parquet] GH-47628: Implement basic parquet file rewriter Oct 10, 2025

HuaHuaY force-pushed the fix_issue_47664 branch from e4de469 to c216849 Compare October 10, 2025 07:36

mapleFU reviewed Oct 13, 2025

View reviewed changes

cpp/src/parquet/properties.h Show resolved Hide resolved

cpp/src/parquet/page_index.cc Outdated Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 13, 2025

wgtmac reviewed Oct 13, 2025

View reviewed changes

adamreeve reviewed Oct 15, 2025

View reviewed changes

cpp/src/parquet/file_rewriter.h Show resolved Hide resolved

wgtmac reviewed Oct 16, 2025

View reviewed changes

HuaHuaY force-pushed the fix_issue_47664 branch from e037be7 to 253f281 Compare October 23, 2025 10:49

HuaHuaY force-pushed the fix_issue_47664 branch 2 times, most recently from b70917f to 439103e Compare January 15, 2026 04:36

HuaHuaY force-pushed the fix_issue_47664 branch from 439103e to 641ab8c Compare February 9, 2026 08:15

HuaHuaY marked this pull request as ready for review February 9, 2026 12:36

Copilot AI review requested due to automatic review settings February 9, 2026 12:36

Copilot started reviewing on behalf of HuaHuaY February 9, 2026 12:43 View session

Copilot AI reviewed Feb 9, 2026

View reviewed changes

HuaHuaY force-pushed the fix_issue_47664 branch from 7be60a0 to 41248c8 Compare February 10, 2026 09:35

HuaHuaY requested a review from Copilot February 10, 2026 09:35

Copilot started reviewing on behalf of HuaHuaY February 10, 2026 09:38 View session

Copilot AI reviewed Feb 10, 2026

View reviewed changes

HuaHuaY force-pushed the fix_issue_47664 branch from 41248c8 to 2140284 Compare February 10, 2026 10:14

HuaHuaY changed the title ~~[C++][Parquet] GH-47628: Implement basic parquet file rewriter~~ GH-47628: [C++][Parquet] Implement basic parquet file rewriter Feb 10, 2026

HuaHuaY force-pushed the fix_issue_47664 branch from 2140284 to 648c87b Compare February 13, 2026 11:45

HuaHuaY requested a review from wgtmac February 13, 2026 11:46

HuaHuaY requested a review from Copilot February 13, 2026 11:46

Copilot started reviewing on behalf of HuaHuaY February 13, 2026 11:47 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

cpp/src/parquet/file_rewriter.cc Show resolved Hide resolved

cpp/src/parquet/file_rewriter.cc Show resolved Hide resolved

cpp/src/parquet/page_index.cc Show resolved Hide resolved

cpp/src/parquet/file_rewriter.cc Show resolved Hide resolved

HuaHuaY force-pushed the fix_issue_47664 branch from 648c87b to ec89836 Compare February 13, 2026 15:28

HuaHuaY requested a review from Copilot February 13, 2026 15:30

Copilot started reviewing on behalf of HuaHuaY February 13, 2026 15:31 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

HuaHuaY force-pushed the fix_issue_47664 branch from ec89836 to 4541d11 Compare February 25, 2026 03:31

HuaHuaY requested a review from Copilot February 25, 2026 03:32

Copilot started reviewing on behalf of HuaHuaY February 25, 2026 03:33 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

HuaHuaY force-pushed the fix_issue_47664 branch 2 times, most recently from 57df653 to a7b8576 Compare February 25, 2026 03:51

HuaHuaY force-pushed the fix_issue_47664 branch 3 times, most recently from d5876fc to ab74095 Compare February 26, 2026 09:14

HuaHuaY requested a review from Copilot February 26, 2026 09:15

Copilot started reviewing on behalf of HuaHuaY February 26, 2026 09:16 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

HuaHuaY force-pushed the fix_issue_47664 branch from 5ba79f3 to 8c5bda0 Compare February 26, 2026 12:50

HuaHuaY requested a review from Copilot February 26, 2026 12:50

Copilot started reviewing on behalf of HuaHuaY February 26, 2026 12:51 View session

Copilot AI reviewed Feb 26, 2026

View reviewed changes

Implement basic parquet file rewriter

06fb05c

HuaHuaY force-pushed the fix_issue_47664 branch from 8c5bda0 to 06fb05c Compare February 27, 2026 02:36

+                RowGroupRewriter(std::shared_ptr<ArrowInputFile> source,
+                                 std::shared_ptr<ArrowOutputStream> sink,
+                                 const RewriterProperties* props,
+                                 std::shared_ptr<RowGroupReader> row_group_reader,

+  // Non-const overload kept for ABI compatibility. It forwards to the const
+  // implementation introduced in a later version.
+  std::shared_ptr<ArrowInputStream> GetStream(std::shared_ptr<ArrowInputFile> source,
+                                              int64_t start, int64_t num_bytes) {
+    return static_cast<const ReaderProperties*>(this)->GetStream(source, start, num_bytes);
+  }

		@@ -485,6 +488,7 @@ class PARQUET_EXPORT RowGroupMetaDataBuilder {
		~RowGroupMetaDataBuilder();

		ColumnChunkMetaDataBuilder* NextColumnChunk();

-  ColumnChunkMetaDataBuilder* NextColumnChunk();
+  ColumnChunkMetaDataBuilder* NextColumnChunk();
+  // Add an existing column chunk metadata object to the row group without
+  // constructing it through a ColumnChunkMetaDataBuilder.
+  //
+  // This overload is intended for fast-copy / rewriting scenarios where
+  // column chunks are reused from another file and their metadata has
+  // already been finalized.
+  //
+  // Parameters:
+  //  - cc_metadata: Ownership of the provided ColumnChunkMetaData is
+  //    transferred to the RowGroupMetaDataBuilder.
+  //  - shift: Byte offset delta to apply to all file-relative offsets in
+  //    cc_metadata (for example, when appending row groups at a different
+  //    position in the target file). Use 0 if no adjustment is needed.
+  //
+  // For column chunks produced by this writer, prefer the builder-based
+  // NextColumnChunk() API above.


		virtual void SetColumnIndex(int32_t i,
		const std::shared_ptr<ColumnIndex>& column_index) = 0;

-  virtual void SetColumnIndex(int32_t i,
-                              const std::shared_ptr<ColumnIndex>& column_index) = 0;
+  /// \brief Set a fully constructed ColumnIndex for a column.
+  ///
+  /// This method can be used instead of GetColumnIndexBuilder() when the caller
+  /// already has a materialized ColumnIndex instance (for example, when reusing
+  /// or transforming an existing index) and does not need to build it through
+  /// the ColumnIndexBuilder interface.
+  ///
+  /// \param i Column ordinal.
+  /// \param column_index The ColumnIndex to associate with the given column.
+  virtual void SetColumnIndex(int32_t i,
+                              const std::shared_ptr<ColumnIndex>& column_index) = 0;
+  /// \brief Set a fully constructed OffsetIndex for a column.
+  ///
+  /// This method can be used instead of GetOffsetIndexBuilder() when the caller
+  /// already has a materialized OffsetIndex instance.
+  ///
+  /// The \p shift parameter is applied to all page offsets contained in the
+  /// provided OffsetIndex. It allows reusing an index whose offsets are relative
+  /// to a different file position (for example, when concatenating data or
+  /// writing the index at a different location) by shifting all stored offsets
+  /// by a constant amount.
+  ///
+  /// \param i Column ordinal.
+  /// \param offset_index The OffsetIndex to associate with the given column.
+  /// \param shift A byte offset added to each page offset stored in \p offset_index.

Conversation

HuaHuaY commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Oct 10, 2025

Uh oh!

HuaHuaY commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mapleFU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wgtmac Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

HuaHuaY Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 10, 2026

Uh oh!

HuaHuaY commented Oct 10, 2025 •

edited

Loading

HuaHuaY commented Oct 13, 2025 •

edited

Loading