Skip to content

GH-47628: [C++][Parquet] Implement basic parquet file rewriter#47775

Open
HuaHuaY wants to merge 1 commit intoapache:mainfrom
HuaHuaY:fix_issue_47664
Open

GH-47628: [C++][Parquet] Implement basic parquet file rewriter#47775
HuaHuaY wants to merge 1 commit intoapache:mainfrom
HuaHuaY:fix_issue_47664

Conversation

@HuaHuaY
Copy link
Contributor

@HuaHuaY HuaHuaY commented Oct 10, 2025

Rationale for this change

Allow to rewrite parquet files in binary data formats instead of reading, decoding all values and writing them.

What changes are included in this PR?

  1. Add class ParquetFileRewriter and RewriterProperties.
  2. Add some to_thrift and SetXXX methods to help me copy the metadata.
  3. Add CopyStream methods to call memcpy between ArrowInputStream and ArrowOutputStream.
  4. Add RowGroupMetaDataBuilder::NextColumnChunk(std::unique_ptr<ColumnChunkMetaData> cc_metadata, int64_t shift) which allows to add column metadata without creating ColumnChunkMetaDataBuilder.

Are these changes tested?

Yes

Are there any user-facing changes?

  1. Add some new classes and methods mentioned above.
  2. ReaderProperties::GetStream is changed to a const method. Only the signature has been changed. Its original implementation allows it to be declared as a const method.

@github-actions
Copy link

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

@HuaHuaY HuaHuaY changed the title [C++][Parquet] Implement basic parquet file rewriter [C++][Parquet] GH-47628: Implement basic parquet file rewriter Oct 10, 2025
@HuaHuaY
Copy link
Contributor Author

HuaHuaY commented Oct 13, 2025

@pitrou @adamreeve @mapleFU Do you have any suggestions about this draft? Is there any efficient way to merge two parquet files' schema?

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emm I'm thinking that just reuse the current code a ok way, since these logic in current impl would be a bit hacking with current interface...

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 13, 2025
Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed all the changes yet and will progressively post my comments.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general workflow of the rewriter looks good to me. However, I don't believe we should directly manipulate the thrift objects.

RowGroupRewriter(std::shared_ptr<ArrowInputFile> source,
std::shared_ptr<ArrowOutputStream> sink,
const RewriterProperties* props,
std::shared_ptr<RowGroupReader> row_group_reader,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps introduce a RowGroupContext to hold all row group xxx readers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it doesn't bring much benefit and requires one more step to unwrap of the wrapper class.

@HuaHuaY HuaHuaY force-pushed the fix_issue_47664 branch 2 times, most recently from b70917f to 439103e Compare January 15, 2026 04:36
@HuaHuaY HuaHuaY marked this pull request as ready for review February 9, 2026 12:36
Copilot AI review requested due to automatic review settings February 9, 2026 12:36
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Draft implementation of a C++ Parquet “file rewriter” that can rewrite/concatenate/join Parquet files by copying encoded bytes and re-emitting metadata (optionally including page indexes and bloom filters), avoiding full decode/re-encode.

Changes:

  • Add ParquetFileRewriter + RewriterProperties and implement basic concat (horizontal) + join (vertical) rewriting.
  • Add metadata copying helpers (to_thrift, new builder entrypoints, new ToThrift overloads) to support fast metadata reconstruction.
  • Add Arrow-based tests for roundtrip rewriting scenarios (simple, concat, join, concat+join) and wire into CMake.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
cpp/src/parquet/thrift_internal.h Add ToThrift helpers for page-index-related structs.
cpp/src/parquet/properties.h Make ReaderProperties::GetStream const; add RewriterProperties + default factory decl.
cpp/src/parquet/properties.cc Update ReaderProperties::GetStream definition to const.
cpp/src/parquet/page_index.h Extend PageIndexBuilder API to allow setting pre-built indexes.
cpp/src/parquet/page_index.cc Implement new PageIndexBuilder setters and mixed builder/prebuilt serialization.
cpp/src/parquet/metadata.h Add start_offset(), expose to_thrift(), extend RowGroupMetaDataBuilder.
cpp/src/parquet/metadata.cc Implement new metadata accessors and new row-group column-chunk injection path.
cpp/src/parquet/file_rewriter.h New public rewriter API (ParquetFileRewriter).
cpp/src/parquet/file_rewriter.cc Core rewriting implementation (copy streams, concat/join logic, index/bloom handling).
cpp/src/parquet/bloom_filter_writer.h Extend BloomFilterBuilder with InsertBloomFilter.
cpp/src/parquet/bloom_filter_writer.cc Implement InsertBloomFilter.
cpp/src/parquet/arrow/test_util.h Add helper to write a table into a Parquet buffer for tests.
cpp/src/parquet/arrow/arrow_rewriter_test.cc Add Arrow-level rewriter roundtrip tests.
cpp/src/parquet/CMakeLists.txt Add new source (file_rewriter.cc) and new test target.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 118 to 135
if (page_index_reader_ != nullptr && page_index_builder != nullptr) {
auto column_index = page_index_reader_->GetColumnIndex(column_ordinal_);
auto offset_index = page_index_reader_->GetOffsetIndex(column_ordinal_);
if (column_index != nullptr) {
page_index_builder->SetColumnIndex(column_ordinal_, column_index);
}
if (offset_index != nullptr) {
page_index_builder->SetOffsetIndex(column_ordinal_, offset_index, shift);
}
}

if (bloom_filter_reader_ != nullptr && bloom_filter_builder != nullptr) {
auto bloom_filter = bloom_filter_reader_->GetColumnBloomFilter(column_ordinal_);
if (bloom_filter != nullptr) {
bloom_filter_builder->InsertBloomFilter(column_ordinal_,
std::move(bloom_filter));
}
}
Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new rewriter code path copies page indexes (SetColumnIndex/SetOffsetIndex) and bloom filters (InsertBloomFilter) when enabled, but the added Arrow-level roundtrip tests only validate the decoded table contents. Consider extending tests to assert that rewritten files actually contain the expected page index / bloom filter structures (e.g., via ParquetFileReader::GetPageIndexReader / GetBloomFilterReader and checking non-null indexes/locations), so regressions in metadata copying/offset shifting are caught.

Copilot uses AI. Check for mistakes.
std::shared_ptr<ArrowInputStream> GetStream(std::shared_ptr<ArrowInputFile> source,
int64_t start, int64_t num_bytes);
int64_t start, int64_t num_bytes) const;

Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing ReaderProperties::GetStream to be const changes the mangled symbol and will break ABI for downstream code compiled against earlier versions of the library. If ABI compatibility matters here, consider keeping the old non-const overload (forwarding to the const implementation) instead of changing the existing signature in-place.

Suggested change
// Non-const overload kept for ABI compatibility. It forwards to the const
// implementation introduced in a later version.
std::shared_ptr<ArrowInputStream> GetStream(std::shared_ptr<ArrowInputFile> source,
int64_t start, int64_t num_bytes) {
return static_cast<const ReaderProperties*>(this)->GetStream(source, start, num_bytes);
}

Copilot uses AI. Check for mistakes.
Comment on lines 164 to 165
const void* to_thrift() const;

Copy link

Copilot AI Feb 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnChunkMetaData::to_thrift() exposes an untyped pointer to an internal Thrift struct as part of the public metadata API, which couples consumers to internal representation and is easy to misuse/UB (wrong cast / lifetime assumptions). Prefer an internal-only accessor, or return a typed reference/pointer to the concrete thrift type in an internal header, or provide a dedicated cloning/copy helper on the builder to avoid exposing raw thrift at all.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@HuaHuaY HuaHuaY changed the title [C++][Parquet] GH-47628: Implement basic parquet file rewriter GH-47628: [C++][Parquet] Implement basic parquet file rewriter Feb 10, 2026
@github-actions
Copy link

⚠️ GitHub issue #47628 has been automatically assigned in GitHub to PR creator.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 29 to 50
class PARQUET_EXPORT ParquetFileRewriter {
public:
struct PARQUET_EXPORT Contents {
virtual ~Contents() = default;
virtual void Close() = 0;
virtual void Rewrite() = 0;
};

ParquetFileRewriter();
~ParquetFileRewriter();

static std::unique_ptr<ParquetFileRewriter> Open(
std::vector<std::vector<std::shared_ptr<ArrowInputFile>>> sources,
std::shared_ptr<ArrowOutputStream> sink,
std::vector<std::vector<std::shared_ptr<FileMetaData>>> sources_metadata,
std::shared_ptr<const ::arrow::KeyValueMetadata> sink_metadata = NULLPTR,
std::shared_ptr<RewriterProperties> props = default_rewriter_properties());

void Open(std::unique_ptr<Contents> contents);
void Close();

void Rewrite();
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ParquetFileRewriter class and its public methods (Open, Close, Rewrite) lack documentation comments. Since this is a new public API (marked with PARQUET_EXPORT), these methods should have documentation describing their purpose, parameters, return values, and any exceptions that may be thrown. This is especially important for the Open method with its complex nested vector parameters.

Copilot uses AI. Check for mistakes.
@@ -485,6 +488,7 @@ class PARQUET_EXPORT RowGroupMetaDataBuilder {
~RowGroupMetaDataBuilder();

ColumnChunkMetaDataBuilder* NextColumnChunk();
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new overload of NextColumnChunk that takes ColumnChunkMetaData and a shift value lacks documentation. This method allows adding column metadata without creating a builder, which is a key feature for the rewriter's fast-copy optimization. Documentation should explain the parameters (especially shift for offset adjustment) and when to use this method versus the builder-based approach.

Suggested change
ColumnChunkMetaDataBuilder* NextColumnChunk();
ColumnChunkMetaDataBuilder* NextColumnChunk();
// Add an existing column chunk metadata object to the row group without
// constructing it through a ColumnChunkMetaDataBuilder.
//
// This overload is intended for fast-copy / rewriting scenarios where
// column chunks are reused from another file and their metadata has
// already been finalized.
//
// Parameters:
// - cc_metadata: Ownership of the provided ColumnChunkMetaData is
// transferred to the RowGroupMetaDataBuilder.
// - shift: Byte offset delta to apply to all file-relative offsets in
// cc_metadata (for example, when appending row groups at a different
// position in the target file). Use 0 if no adjustment is needed.
//
// For column chunks produced by this writer, prefer the builder-based
// NextColumnChunk() API above.

Copilot uses AI. Check for mistakes.
Comment on lines 1529 to 1594
class PARQUET_EXPORT RewriterProperties {
public:
class Builder {
public:
Builder()
: pool_(::arrow::default_memory_pool()),
writer_properties_(default_writer_properties()),
reader_properties_(default_reader_properties()) {}

explicit Builder(const RewriterProperties& properties)
: pool_(properties.memory_pool()),
writer_properties_(properties.writer_properties()),
reader_properties_(properties.reader_properties()) {}

virtual ~Builder() = default;

/// Specify the memory pool for the rewriter. Default default_memory_pool.
Builder* memory_pool(MemoryPool* pool) {
pool_ = pool;
return this;
}

/// Set the writer properties.
Builder* writer_properties(std::shared_ptr<WriterProperties> properties) {
writer_properties_ = std::move(properties);
return this;
}

/// Set the reader properties.
Builder* reader_properties(ReaderProperties properties) {
reader_properties_ = std::move(properties);
return this;
}

/// Build the RewriterProperties with the builder parameters.
std::shared_ptr<RewriterProperties> build() {
return std::shared_ptr<RewriterProperties>(new RewriterProperties(
pool_, std::move(writer_properties_), std::move(reader_properties_)));
}

private:
MemoryPool* pool_;
std::shared_ptr<WriterProperties> writer_properties_;
ReaderProperties reader_properties_;
};

MemoryPool* memory_pool() const { return pool_; }

const std::shared_ptr<WriterProperties>& writer_properties() const {
return writer_properties_;
}

const ReaderProperties& reader_properties() const { return reader_properties_; }

private:
explicit RewriterProperties(MemoryPool* pool,
std::shared_ptr<WriterProperties> writer_properties,
ReaderProperties reader_properties)
: pool_(pool),
writer_properties_(std::move(writer_properties)),
reader_properties_(std::move(reader_properties)) {}

MemoryPool* pool_;
std::shared_ptr<WriterProperties> writer_properties_;
ReaderProperties reader_properties_;
};
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RewriterProperties class lacks documentation. As a new public API class (marked with PARQUET_EXPORT), it should have comprehensive documentation explaining its purpose, how it relates to WriterProperties and ReaderProperties, and what configuration options it provides for the rewriter.

Copilot uses AI. Check for mistakes.
Comment on lines 373 to +376

virtual void SetColumnIndex(int32_t i,
const std::shared_ptr<ColumnIndex>& column_index) = 0;

Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new methods SetColumnIndex and SetOffsetIndex lack documentation comments. These are part of the public PageIndexBuilder interface and should include documentation explaining their purpose, parameters (especially the shift parameter in SetOffsetIndex), and when they should be used instead of the corresponding builder methods.

Suggested change
virtual void SetColumnIndex(int32_t i,
const std::shared_ptr<ColumnIndex>& column_index) = 0;
/// \brief Set a fully constructed ColumnIndex for a column.
///
/// This method can be used instead of GetColumnIndexBuilder() when the caller
/// already has a materialized ColumnIndex instance (for example, when reusing
/// or transforming an existing index) and does not need to build it through
/// the ColumnIndexBuilder interface.
///
/// \param i Column ordinal.
/// \param column_index The ColumnIndex to associate with the given column.
virtual void SetColumnIndex(int32_t i,
const std::shared_ptr<ColumnIndex>& column_index) = 0;
/// \brief Set a fully constructed OffsetIndex for a column.
///
/// This method can be used instead of GetOffsetIndexBuilder() when the caller
/// already has a materialized OffsetIndex instance.
///
/// The \p shift parameter is applied to all page offsets contained in the
/// provided OffsetIndex. It allows reusing an index whose offsets are relative
/// to a different file position (for example, when concatenating data or
/// writing the index at a different location) by shifting all stored offsets
/// by a constant amount.
///
/// \param i Column ordinal.
/// \param offset_index The OffsetIndex to associate with the given column.
/// \param shift A byte offset added to each page offset stored in \p offset_index.

Copilot uses AI. Check for mistakes.
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 15 out of 15 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 66 to 77
std::shared_ptr<ResizableBuffer> buffer =
AllocateBuffer(pool, kDefaultOutputStreamSize);
while (bytes_copied < size) {
PARQUET_ASSIGN_OR_THROW(
auto read_size,
from->Read(std::min(size - bytes_copied, buffer->size()), &buffer));
if (read_size == 0) {
throw ParquetException("Unexpected end of stream at ", bytes_copied);
}
PARQUET_THROW_NOT_OK(to->Write(buffer->data(), read_size));
bytes_copied += read_size;
}
Copy link

Copilot AI Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CopyStream allocates a 1KB buffer (kDefaultOutputStreamSize) for the non-zero-copy path. For large column chunks this will cause excessive small reads/writes and significantly degrade rewrite performance. Consider using a substantially larger chunk size (e.g., 64KB+ or an Arrow IO default) for the copy buffer.

Copilot uses AI. Check for mistakes.
@HuaHuaY HuaHuaY force-pushed the fix_issue_47664 branch 2 times, most recently from 57df653 to a7b8576 Compare February 25, 2026 03:51
@wgtmac
Copy link
Member

wgtmac commented Feb 26, 2026

PR #47775 Review: GH-47628: [C++][Parquet] Implement basic parquet file rewriter

🤖 This review was generated by an AI agent (GitHub Copilot, Claude Opus 4.6). Findings should be validated by a human reviewer.

Summary

This PR adds a ParquetFileRewriter class that can rewrite Parquet files at the binary level without decoding row data. It supports concat (appending row groups from multiple files with the same schema), join (side-by-side column merge from files with different columns but matching row counts), and optionally re-compressing column data. The implementation is substantial (~810 lines in file_rewriter.cc) with supporting changes to metadata builders, page index builders, bloom filter builders, thrift serialization, and properties.


Critical Issues 🔴

1. [BLOCKING] <ranges> header portability
cpp/src/parquet/file_rewriter.cc#L23: #include <ranges> and usage of std::ranges::all_of, std::ranges::any_of, std::ranges::for_each, std::views::drop, std::views::iota throughout the file.

While Arrow targets C++20 (confirmed in SetupCxxFlags.cmake), <ranges> is not used anywhere else in the Arrow or Parquet codebase. The <ranges> header has historically had inconsistent support across compilers (especially older GCC/Clang versions in CI). The only precedent is std::ranges::equal in cpp/src/arrow/sparse_tensor.cc.

Suggested fix: Replace std::ranges::* / std::views::* with standard algorithms (std::all_of, std::any_of, std::for_each) and simple loops/iterators. This is safer for CI portability.


2. [BLOCKING] to_thrift() returns const void* — leaks Thrift internals through public API
cpp/src/parquet/metadata.h adds const void* to_thrift() const; to ColumnChunkMetaData. This:

  • Exposes a raw void* in a PARQUET_EXPORT-ed public header, violating type safety
  • Requires the caller to static_cast to format::ColumnMetaData*, which is a Thrift-generated internal type
  • The Thrift types are internal to the Parquet library and should never be exposed in public headers

Suggested fix: Instead of exposing to_thrift() publicly, make RowGroupMetaDataBuilder::NextColumnChunk(unique_ptr<ColumnChunkMetaData>, int64_t) a friend or use an internal-only accessor. Consider a CopyMetadataFrom(const ColumnChunkMetaData&, int64_t shift) method on RowGroupMetaDataBuilder that does the thrift copy internally.


3. [BLOCKING] RowGroupRewriter fast-copy path relies on RowGroupMetaData::file_offset() which is unreliable
In file_rewriter.cc (the RowGroupRewriter::WriteRowGroupData method), fast-copy of the entire row group is predicated on metadata_->file_offset() != 0. However:

  • file_offset is a deprecated optional Thrift field — it may be 0 in older files or files written by other libraries
  • Even when set, total_compressed_size at the row group level may be 0 (it's also optional per the spec)
  • Thus CopyStream(stream, sink_, metadata_->total_compressed_size(), ...) could copy 0 bytes or the wrong amount

Suggested fix: For the row-group-level fast copy, compute the byte range from individual column chunk offsets (min start_offset → max end offset). Consider falling back to column-by-column copy when row group level metadata is unreliable.


4. [BLOCKING] ColumnChunkMetaData::start_offset() does not account for crypto_metadata offset
cpp/src/parquet/metadata.cc — the new start_offset() returns has_dictionary_page() ? dictionary_page_offset() : data_page_offset(). But for encrypted columns, ColumnChunk.crypto_metadata can precede the pages. If fast-copy is used on encrypted data in the future, this offset would be wrong.

Suggested fix: Document this limitation clearly or account for it. At minimum, add a comment noting this only works for unencrypted columns.


Major Suggestions 🟡

5. Missing Doxygen documentation on all new public APIs
cpp/src/parquet/file_rewriter.hParquetFileRewriter, Open(), Rewrite(), Close() have minimal or no documentation. The 2D vector parameter semantics (sources[i][j] means "block i, file j for concatenation") are non-obvious and must be documented.

Similarly, RewriterProperties in properties.h needs class-level documentation explaining its purpose.

metadata.hstart_offset(), to_thrift(), and NextColumnChunk(unique_ptr, int64_t) need Doxygen /// comments.


6. RowGroupMetaDataBuilder::Finish null-checks column_builders_[i] — fragile design
In metadata.cc, the new NextColumnChunk(cc_metadata, shift) pushes NULLPTR into column_builders_. Then Finish must check for NULLPTR:

if (column_builders_[i] == NULLPTR) {
    total_compressed_size += row_group_->columns[i].meta_data.total_compressed_size;
    continue;
}

This is fragile — any future code that iterates column_builders_ without null-checking will crash.

Suggestion: Consider using std::optional or a variant type to make the two code paths explicit, or split into two builder modes.


7. PageIndexBuilder::SetColumnIndex / SetOffsetIndex lack documentation
cpp/src/parquet/page_index.h — The new virtual methods SetColumnIndex and SetOffsetIndex have no Doxygen comments. They should document:

  • What shift means for SetOffsetIndex
  • That these are for copying existing indices (vs building new ones)
  • Thread safety guarantees
  • Interaction with existing GetColumnIndexBuilder/GetOffsetIndexBuilder

8. BloomFilterBuilder::InsertBloomFilter silently returns when opts has no value
cpp/src/parquet/bloom_filter_writer.ccInsertBloomFilter checks if bloom filter options are enabled for the column. If not, it silently drops the bloom filter. This might be surprising to a caller who expects the bloom filter to always be inserted when passed in. At minimum, the doc comment on the header should note this behavior.


9. WriteFile helper in test_util.h is not marked inline — potential ODR violation
cpp/src/parquet/arrow/test_util.h — The new WriteFile() function is defined in a header but is not marked inline. If this header is included in multiple translation units, it will cause linker errors (multiple definitions).

Suggested fix: Mark it inline, or move it to a .cc file shared by tests.


10. CopyStream uses from->Read() with &buffer overload which may replace the buffer
cpp/src/parquet/file_rewriter.cc — In the non-zero-copy path:

from->Read(std::min(size - bytes_copied, buffer->size()), &buffer)

The InputStream::Read(int64_t, std::shared_ptr<Buffer>*) overload may return a buffer of different size or replace the pointer entirely. This is fragile.

Suggestion: Use Read(int64_t nbytes, void* out) overload for the non-zero-copy path to write directly into the pre-allocated buffer.


Minor Suggestions 🟢

11. [NIT] RewriterProperties::Builder has virtual ~Builder() = default; — unnecessary. There's no reason for the builder to be polymorphic. No other Builder in Arrow/Parquet does this. Remove virtual.

12. [NIT] vecToString lambda in JoinRewriter — consider using arrow::internal::JoinStrings or similar existing utility instead of ad-hoc formatting.

13. [NIT] Missing test for encryption rejection. The code throws ParquetException::NYI for encrypted files. There should be a test verifying this error path.

14. [NIT] Missing test for page index and bloom filter round-trip. While enable_write_page_index is set in some tests, there's no explicit verification that page indices and bloom filters survive the rewrite. Consider adding assertions on metadata post-rewrite to verify index locations are present.


AI Code Smell 🤖

None confidently detected. The code has a consistent style, uses domain-appropriate terminology, and the abstractions (PagesRewriter → ColumnChunkRewriter → RowGroupRewriter → SingleFileRewriter → ConcatRewriter → JoinRewriter) mirror the Parquet file structure naturally. The TODO comments reference specific usernames/futures which is genuine.


Positive Notes 👍

  • Well-structured layered architecture: The hierarchy of rewriters (Pages → ColumnChunk → RowGroup → SingleFile → Concat → Join) cleanly mirrors the Parquet format hierarchy.
  • Smart fast-copy optimization: Detecting when compression matches to skip re-coding and directly memcpy binary data is the right approach.
  • Good error messages: Validation in JoinRewriter includes row count vectors in the error message, helpful for debugging.
  • ReaderProperties::GetStream const-correctness fix: Making it const is a correct improvement — the implementation was already const-safe.
  • Comprehensive test coverage for the main paths: Simple, concat, join, concat+join, mismatched row counts, invalid dimensions, add/change compression.

@HuaHuaY HuaHuaY force-pushed the fix_issue_47664 branch 3 times, most recently from d5876fc to ab74095 Compare February 26, 2026 09:14
@HuaHuaY HuaHuaY requested a review from Copilot February 26, 2026 09:15
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +154 to +165
auto first_row_index =
original_offset_index_
? std::optional{original_offset_index_->page_locations()[page_no]
.first_row_index}
: std::nullopt;
SizeStatistics size_statistics;
size_statistics.unencoded_byte_array_data_bytes =
original_offset_index_ &&
!original_offset_index_->unencoded_byte_array_data_bytes().empty()
? std::optional{original_offset_index_
->unencoded_byte_array_data_bytes()[page_no]}
: std::nullopt;
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no bounds checking when accessing page_locations()[page_no] and unencoded_byte_array_data_bytes()[page_no]. If the original_offset_index has fewer entries than the number of data pages (due to file corruption or malformed data), this will result in out-of-bounds access. Consider adding validation like: if (page_no >= original_offset_index_->page_locations().size()) before accessing the arrays.

Suggested change
auto first_row_index =
original_offset_index_
? std::optional{original_offset_index_->page_locations()[page_no]
.first_row_index}
: std::nullopt;
SizeStatistics size_statistics;
size_statistics.unencoded_byte_array_data_bytes =
original_offset_index_ &&
!original_offset_index_->unencoded_byte_array_data_bytes().empty()
? std::optional{original_offset_index_
->unencoded_byte_array_data_bytes()[page_no]}
: std::nullopt;
std::optional<int64_t> first_row_index;
if (original_offset_index_) {
const auto& locations = original_offset_index_->page_locations();
if (page_no < locations.size()) {
first_row_index = locations[page_no].first_row_index;
}
}
SizeStatistics size_statistics;
if (original_offset_index_) {
const auto& unencoded_bytes =
original_offset_index_->unencoded_byte_array_data_bytes();
if (!unencoded_bytes.empty() && page_no < unencoded_bytes.size()) {
size_statistics.unencoded_byte_array_data_bytes =
std::optional{unencoded_bytes[page_no]};
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +201 to +208
original_offset_index_
? std::optional{original_offset_index_->page_locations()[page_no]
.first_row_index}
: std::nullopt;
SizeStatistics size_statistics;
size_statistics.unencoded_byte_array_data_bytes =
original_offset_index_ &&
!original_offset_index_->unencoded_byte_array_data_bytes().empty()
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no bounds checking when accessing page_locations()[page_no] and unencoded_byte_array_data_bytes()[page_no]. If the original_offset_index has fewer entries than the number of data pages (due to file corruption or malformed data), this will result in out-of-bounds access. Consider adding validation like: if (page_no >= original_offset_index_->page_locations().size()) before accessing the arrays.

Suggested change
original_offset_index_
? std::optional{original_offset_index_->page_locations()[page_no]
.first_row_index}
: std::nullopt;
SizeStatistics size_statistics;
size_statistics.unencoded_byte_array_data_bytes =
original_offset_index_ &&
!original_offset_index_->unencoded_byte_array_data_bytes().empty()
original_offset_index_ &&
page_no < original_offset_index_->page_locations().size()
? std::optional{original_offset_index_->page_locations()[page_no]
.first_row_index}
: std::nullopt;
SizeStatistics size_statistics;
size_statistics.unencoded_byte_array_data_bytes =
original_offset_index_ &&
!original_offset_index_->unencoded_byte_array_data_bytes().empty() &&
page_no < original_offset_index_->unencoded_byte_array_data_bytes().size()

Copilot uses AI. Check for mistakes.
@HuaHuaY
Copy link
Contributor Author

HuaHuaY commented Feb 26, 2026

I think this PR has implemented basic functions. Please take a look @pitrou @adamreeve

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +40 to +45
static std::unique_ptr<ParquetFileRewriter> Open(
std::vector<std::vector<std::shared_ptr<ArrowInputFile>>> sources,
std::shared_ptr<ArrowOutputStream> sink,
std::vector<std::vector<std::shared_ptr<FileMetaData>>> sources_metadata,
std::shared_ptr<const ::arrow::KeyValueMetadata> sink_metadata = NULLPTR,
std::shared_ptr<RewriterProperties> props = default_rewriter_properties());
Copy link

Copilot AI Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Open method parameters would benefit from documentation explaining the 2D vector structure. The outer vector represents groups of files to join (horizontally combine columns), and the inner vector represents files to concatenate (vertically stack row groups). Consider adding parameter documentation to clarify this structure and usage.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants