GH-38865 [C++][Parquet] support passing a RowRange to RecordBatchReader #39608

binmahone · 2024-01-15T13:07:52Z

Rationale for this change

This is #38867 for main branch

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Closes: [C++][Parquet] support passing a RowRange to RecordBatchReader #38865

…chReader

wgtmac · 2024-01-18T01:42:43Z

cpp/src/parquet/column_reader.h

@@ -22,6 +22,7 @@
 #include <utility>
 #include <vector>

+#include "page_index.h"


Is this required? Could we use forward declaration instead?

wgtmac · 2024-01-18T01:43:23Z

cpp/src/parquet/column_reader.h

@@ -302,8 +303,150 @@ class TypedColumnReader : public ColumnReader {
                                          int32_t* dict_len) = 0;
 };

+// Represent a range to read. The range is inclusive on both ends.
+struct IntervalRange {


It would be good to move all row range stuff to a separate parquet/arrow/row_range.h

wgtmac · 2024-01-18T01:56:42Z

cpp/src/parquet/column_reader.h

+ public:
+  RowRanges() = default;
+  virtual ~RowRanges() = default;
+  virtual size_t RowCount() const = 0;


Suggested change

virtual size_t RowCount() const = 0;

/// \brief Total number of rows in the row ranges.

virtual size_t num_rows() const = 0;

Trivial getter functions like this should use snake case. And we need add docstring to user-faced public api.

Same for similar APIs below.

wgtmac · 2024-01-18T02:27:46Z

cpp/src/parquet/column_reader.h

+  RowRanges() = default;
+  virtual ~RowRanges() = default;
+  virtual size_t RowCount() const = 0;
+  virtual int64_t LastRow() const = 0;


Suggested change

virtual int64_t LastRow() const = 0;

virtual int64_t last_row() const = 0;

For completeness, should we also provide first_row() ?

wgtmac · 2024-01-18T02:29:41Z

cpp/src/parquet/column_reader.h

+  virtual ~RowRanges() = default;
+  virtual size_t RowCount() const = 0;
+  virtual int64_t LastRow() const = 0;
+  virtual bool IsValid() const = 0;


Do we actually need IsValid()? Is it possible to prohibit constructing invalid row ranges from the constructor?

wgtmac · 2024-01-18T03:14:58Z

cpp/src/parquet/column_reader.h

@@ -302,8 +303,150 @@ class TypedColumnReader : public ColumnReader {
                                          int32_t* dict_len) = 0;
 };

+// Represent a range to read. The range is inclusive on both ends.
+struct IntervalRange {
+  static IntervalRange Intersection(const IntervalRange& left,


My personal preference is to simply define it as below

struct IntervalRange { int64_t start; int64_t end; };

Then move all operations to a separate IntervalRangeUtil class. Users do not care about these operations.

wgtmac · 2024-01-18T03:15:44Z

cpp/src/parquet/column_reader.h

+// Represent a set of ranges to read. The ranges are sorted and non-overlapping.
+class RowRanges {
+ public:
+  RowRanges() = default;


Remove the default ctor?

BTW, we need some utility function to make it easy for users to create row ranges in the common case.

wgtmac · 2024-01-18T03:17:20Z

cpp/src/parquet/column_reader.h

+
+};
+
+class IntervalRanges : public RowRanges {


What about adding a separate row_range_internal.h to hold this class and its friends?

To me, IntervalRanges is not very "internal". Clients need to initialize their own IntervalRanges with classes like IntervalRange to pass into the API

Clients are expected to pass RowRanges (not IntervalRanges) and we should support API like below to facilitate creating RowRanges:

std::unique_ptr<RowRanges> RowRanges::Make(const std::vector<IntervalRange>& ranges);

wgtmac · 2024-01-18T03:19:07Z

cpp/src/parquet/column_reader.h

@@ -424,6 +567,10 @@ class PARQUET_EXPORT RecordReader {
  /// \brief True if reading dense for nullable columns.
  bool read_dense_for_nullable() const { return read_dense_for_nullable_; }

+  void reset_current_rg_processed_records() { current_rg_processed_records_ = 0; }
+
+  void set_record_skipper(const std::shared_ptr<RecordSkipper>& skipper) { skipper_ = skipper; }


Suggested change

void set_record_skipper(const std::shared_ptr<RecordSkipper>& skipper) { skipper_ = skipper; }

void set_record_skipper(std::shared_ptr<RecordSkipper> skipper) { skipper_ = std::move(skipper); }

wgtmac · 2024-01-18T03:19:28Z

cpp/src/parquet/column_reader.h

 namespace internal {

+// A RecordSkipper is used to skip uncessary rows within each pages.
+class PARQUET_EXPORT RecordSkipper {


Seems we can use forward declaration here and move it to the cpp file?

tried this, will cause " invalid application of 'sizeof' to an incomplete type" (https://zhuanlan.zhihu.com/p/321947743) .

You may want to apply method III by moving the definition of ~RecordReader() into column_reader.cc

wgtmac

Sorry for the delay. I was too busy this week.

wgtmac · 2024-01-26T14:19:07Z

cpp/src/parquet/CMakeLists.txt

@@ -162,6 +162,7 @@ set(PARQUET_SRCS
    arrow/writer.cc
    bloom_filter.cc
    bloom_filter_reader.cc
+    row_range.cc


Please sort it in alphabetical order.

wgtmac · 2024-01-26T14:25:05Z

cpp/src/parquet/column_reader.h

 namespace internal {

+// A RecordSkipper is used to skip uncessary rows within each pages.
+class PARQUET_EXPORT RecordSkipper {


You may want to apply method III by moving the definition of ~RecordReader() into column_reader.cc

wgtmac · 2024-01-27T12:18:55Z

cpp/src/parquet/row_range.h

+#pragma once
+#include <variant>


Suggested change

#pragma once

#include <variant>

#pragma once

#include <variant>

We need to leave a blank line here.

wgtmac · 2024-01-27T12:51:18Z

cpp/src/parquet/row_range.h

+
+namespace parquet {
+
+// Represent a range to read. The range is inclusive on both ends.


Suggested change

// Represent a range to read. The range is inclusive on both ends.

// Represent an interval row range, which is inclusive on both ends.

wgtmac · 2024-01-27T12:57:36Z

cpp/src/parquet/row_range.h

+  }
+
+  // inclusive
+  int64_t start = -1;


Do you mean [-1,-1] is an invalid range? It looks a little bit weird to define an invalid range by default.

What about marking an invalid range simply by checking if start >= end? Or we can define a special invalid range like constexpr IntervalRange kInvalidIntervalRange = {-1, -1}; and do not allow creating any other invalid range via the constructor.

wgtmac · 2024-01-27T15:15:12Z

cpp/src/parquet/column_reader.cc

+
+  AdjustRanges(skip_pages, orig_row_ranges, row_ranges_);
+  range_iter_ = row_ranges_->NewIterator();
+  current_range_variant = range_iter_->NextRange();


Suggested change

current_range_variant = range_iter_->NextRange();

current_range_ = range_iter_->NextRange();

wgtmac · 2024-01-27T15:15:43Z

cpp/src/parquet/column_reader.cc

+  const auto ret = current_range.end - current_rg_processed + 1;
+  return ret;


Suggested change

const auto ret = current_range.end - current_rg_processed + 1;

return ret;

return current_range.end - current_rg_processed + 1;

wgtmac · 2024-01-27T15:17:19Z

cpp/src/parquet/arrow/reader.cc

@@ -325,19 +331,61 @@ class FileReaderImpl : public FileReader {
    return ReadRowGroup(i, Iota(reader_->metadata()->num_columns()), table);
  }

+  // This is a internal API owned by FileReaderImpl, not exposed in FileReader


Please move it under anonymous namespace.

wgtmac · 2024-01-27T15:27:10Z

cpp/src/parquet/arrow/reader.cc

+    }
+    // We'll assign a RowRanges for each RG, even if it's not required to return any rows
+    std::vector<std::unique_ptr<RowRanges>> row_ranges_per_rg =
+        rows_to_return.SplitByRowRange(rows_per_rg);


Is it too early to split the RowRanges into row groups? We can probably do this lazily for each row group. For example, we can start with row group 0 and test if it falls into the range. If true, compute its overlapping range and move all remaining ranges for the next round (row group 1, 2, etc.)

In this way, we can make GetFieldReaders much simpler and less confusing.

wgtmac · 2024-01-27T15:30:13Z

cpp/src/parquet/arrow/reader.cc

-                        const std::shared_ptr<std::unordered_set<int>>& included_leaves,
-                        const std::vector<int>& row_groups,
-                        std::unique_ptr<ColumnReaderImpl>* out) {
+  Status GetFieldReader(


This function signature now looks strange to me since it contains conflicting parameters const std::vector<int>& row_groups and const std::shared_ptr<std::vector<std::unique_ptr<RowRanges>>>& row_ranges_per_rg.

What about splitting file-based row ranges lazily? Then we can re-define this as

// RowGroups can be either std::vector<int> or RowRanges template <typename RowGroups> Status GetFieldReader(int i, const std::shared_ptr<std::unordered_set<int>>& included_leaves, const RowGroups& row_groups, std::unique_ptr<ColumnReaderImpl>* out);

Or if you still want to split file-based row ranges eagerly, we can do this:

// RowGroups can be either std::vector<int> or std::map<int, RowRanges> template <typename RowGroups> Status GetFieldReader(int i, const std::shared_ptr<std::unordered_set<int>>& included_leaves, const RowGroups& row_groups, std::unique_ptr<ColumnReaderImpl>* out);

Then we can limit the scope of refactering work to overload SomeRowGroupsFactory below at line 221:

ctx->iterator_factory = SomeRowGroupsFactory(row_groups);

apacheGH-38865 [C++][Parquet] support passing a RowRange to RecordBat…

c98cfb7

…chReader

binmahone requested a review from wgtmac as a code owner January 15, 2024 13:07

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Jan 15, 2024

wgtmac requested changes Jan 18, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jan 18, 2024

fix comments

60db7df

binmahone mentioned this pull request Jan 22, 2024

GH-38865 [C++][Parquet] support passing a RowRange to RecordBatchReader #39731

Open

wgtmac requested changes Jan 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-38865 [C++][Parquet] support passing a RowRange to RecordBatchReader #39608

GH-38865 [C++][Parquet] support passing a RowRange to RecordBatchReader #39608

binmahone commented Jan 15, 2024 •

edited by github-actions bot

Loading

wgtmac Jan 18, 2024

wgtmac Jan 18, 2024

wgtmac Jan 18, 2024

wgtmac Jan 18, 2024

wgtmac Jan 18, 2024

wgtmac Jan 18, 2024

wgtmac Jan 18, 2024

wgtmac Jan 18, 2024

wgtmac Jan 18, 2024

binmahone Jan 22, 2024

wgtmac Jan 27, 2024

wgtmac Jan 18, 2024

wgtmac Jan 18, 2024

binmahone Jan 22, 2024

wgtmac Jan 26, 2024

wgtmac left a comment

wgtmac Jan 26, 2024

wgtmac Jan 26, 2024

wgtmac Jan 27, 2024

wgtmac Jan 27, 2024

wgtmac Jan 27, 2024

wgtmac Jan 27, 2024

wgtmac Jan 27, 2024

wgtmac Jan 27, 2024

wgtmac Jan 27, 2024

wgtmac Jan 27, 2024

wgtmac Jan 27, 2024

	virtual size_t RowCount() const = 0;
	/// \brief Total number of rows in the row ranges.
	virtual size_t num_rows() const = 0;

	virtual int64_t LastRow() const = 0;
	virtual int64_t last_row() const = 0;

	void set_record_skipper(const std::shared_ptr<RecordSkipper>& skipper) { skipper_ = skipper; }
	void set_record_skipper(std::shared_ptr<RecordSkipper> skipper) { skipper_ = std::move(skipper); }


		namespace parquet {

		// Represent a range to read. The range is inclusive on both ends.

	// Represent a range to read. The range is inclusive on both ends.
	// Represent an interval row range, which is inclusive on both ends.

	current_range_variant = range_iter_->NextRange();
	current_range_ = range_iter_->NextRange();

		const auto ret = current_range.end - current_rg_processed + 1;
		return ret;

	const auto ret = current_range.end - current_rg_processed + 1;
	return ret;
	return current_range.end - current_rg_processed + 1;

GH-38865 [C++][Parquet] support passing a RowRange to RecordBatchReader #39608

Are you sure you want to change the base?

GH-38865 [C++][Parquet] support passing a RowRange to RecordBatchReader #39608

Conversation

binmahone commented Jan 15, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

binmahone commented Jan 15, 2024 •

edited by github-actions bot

Loading