Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types #35825

Open
wants to merge 69 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
69 commits
Select commit Hold shift + click to select a range
e5e96ec
able to read the file
arthurpassos May 26, 2023
b9b48f8
remove diff out
arthurpassos May 26, 2023
ae62954
intermediate stage, not working properly anymore..
arthurpassos May 29, 2023
34917d5
still not working
arthurpassos May 29, 2023
835b07d
able to read the file again
arthurpassos May 29, 2023
50427c6
move use_binary_large_variants to arrowreaderproperties
arthurpassos May 30, 2023
df65ce7
cleanup a bit
arthurpassos May 30, 2023
e826b8e
back fromByteArray string & binary with setting
arthurpassos May 30, 2023
764ef98
some more adjustments
arthurpassos May 30, 2023
c6244ea
revert some stuff
arthurpassos May 30, 2023
5a4bbb0
revert some stuff
arthurpassos May 30, 2023
2d84e57
improvement
arthurpassos May 30, 2023
90f14df
remove dictionary64
arthurpassos May 30, 2023
b88b024
use 64bit on largebytearray class and initialize binary_large_variant…
arthurpassos May 31, 2023
0b53b05
add chunked string map test
arthurpassos May 31, 2023
f574e2e
add boolean comment
arthurpassos May 31, 2023
295e062
Make ChunkedRecordReader generic by using templates
arthurpassos May 31, 2023
25d7815
Make ByteArrayDictionaryReader generic with the use of templates
arthurpassos May 31, 2023
fe8d67b
make arrowbinaryhelper generic
arthurpassos May 31, 2023
35e5835
Make PlainByteArrayDecoder generic
arthurpassos May 31, 2023
9aff2f3
remove use_binary_large_variant from parquet reader properties
arthurpassos Jun 1, 2023
eb850c4
removed parquet::type::large_Byte_array
arthurpassos Jun 5, 2023
c2aab63
small adjustment
arthurpassos Jun 5, 2023
837ed6c
remove largebytearray class
arthurpassos Jun 6, 2023
35cdb99
simplify largebytearraytype a bit
arthurpassos Jun 6, 2023
a5000e1
simplify dictbytearraydecoderimpl a bit
arthurpassos Jun 6, 2023
eb71c17
remove one default argument
arthurpassos Jun 6, 2023
686a3f7
remove junk code
arthurpassos Jun 6, 2023
a61fc32
move use_binary_large_variant check inside frombytearray
arthurpassos Jun 6, 2023
e2600d0
simplify chunkedrecordreader a bit
arthurpassos Jun 6, 2023
3b86e23
simplify DictionaryRecordReaderImpl and fix DebugPrintState
arthurpassos Jun 6, 2023
cc027b7
simplify PlainByteArrayDecoderBase
arthurpassos Jun 6, 2023
177db7a
remove some todos
arthurpassos Jun 7, 2023
66223ee
Add comment explaining why struct LargeByteArrayType instead of alias
arthurpassos Jun 7, 2023
5cd39d8
address some pr comments
arthurpassos Jun 8, 2023
1089010
address a few more comments
arthurpassos Jun 8, 2023
a6c42ee
remove arrow-type include & move binarylimit trait
arthurpassos Jun 8, 2023
15be2a2
consolidate setdict
arthurpassos Jun 8, 2023
8d5ba3d
apply clangformat
arthurpassos Jun 8, 2023
fd8f979
removed todos
arthurpassos Jun 9, 2023
a5736d5
a bit more renaming
arthurpassos Jun 9, 2023
b4ecd0d
address one mor comment
arthurpassos Jun 9, 2023
9e9dff9
add overflow check in dict
arthurpassos Jun 9, 2023
ae1db20
address a few comments
arthurpassos Jun 12, 2023
09a9eaf
use int32_t explicitly
arthurpassos Jun 14, 2023
1664983
use template directly
arthurpassos Jun 14, 2023
322319e
use offset_type
arthurpassos Jun 15, 2023
1775a7a
address comments
arthurpassos Jun 15, 2023
7f6e2bf
address a few minor comments
arthurpassos Jun 16, 2023
75fb615
fix DictDecoderImpl
arthurpassos Jun 16, 2023
0801267
add non overflow test
arthurpassos Jun 16, 2023
7f09a16
string test
arthurpassos Jun 19, 2023
a8d20a4
address minor comments
arthurpassos Jun 20, 2023
5fcf4e1
use raw filereaderbuilder instead of adding a new openfile function
arthurpassos Jun 21, 2023
8901cbc
rename test
arthurpassos Jun 21, 2023
dff017a
update test file name
arthurpassos Jun 21, 2023
232e01f
update submodule?
arthurpassos Jun 21, 2023
d7d76c6
aply clang-format
arthurpassos Jun 21, 2023
90ceb07
address minor comments
arthurpassos Jun 22, 2023
0394963
delta & delta length for large*
arthurpassos Jun 22, 2023
a8df2e7
fix wrong if statements
arthurpassos Jun 22, 2023
2bb3b14
Template member variable as well
arthurpassos Jun 23, 2023
c114d44
add docstring
arthurpassos Jun 23, 2023
d1d5798
add LargeStringDictionary32Builder
arthurpassos Jun 23, 2023
0eaa60f
address a few comments
arthurpassos Jun 26, 2023
1e642fa
clang format
arthurpassos Jun 26, 2023
b299497
add binarypacked test for largebinaryvariant
arthurpassos Jun 26, 2023
2c23dd7
Revert "add binarypacked test for largebinaryvariant"
arthurpassos Jun 27, 2023
eca9d6f
only run largebinary tests if system is 64bit
arthurpassos Jul 6, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions cpp/src/arrow/array/builder_dict.h
Original file line number Diff line number Diff line change
Expand Up @@ -724,6 +724,8 @@ using BinaryDictionaryBuilder = DictionaryBuilder<BinaryType>;
using StringDictionaryBuilder = DictionaryBuilder<StringType>;
using BinaryDictionary32Builder = Dictionary32Builder<BinaryType>;
using StringDictionary32Builder = Dictionary32Builder<StringType>;
using LargeBinaryDictionary32Builder = Dictionary32Builder<LargeBinaryType>;
arthurpassos marked this conversation as resolved.
Show resolved Hide resolved
using LargeStringDictionary32Builder = Dictionary32Builder<LargeStringType>;

/// @}

Expand Down
2 changes: 2 additions & 0 deletions cpp/src/arrow/type.h
Original file line number Diff line number Diff line change
Expand Up @@ -678,6 +678,8 @@ class ARROW_EXPORT BaseBinaryType : public DataType {

constexpr int64_t kBinaryMemoryLimit = std::numeric_limits<int32_t>::max() - 1;

constexpr int64_t kLargeBinaryMemoryLimit = std::numeric_limits<int64_t>::max() - 1;

/// \addtogroup binary-datatypes
///
/// @{
Expand Down
111 changes: 97 additions & 14 deletions cpp/src/parquet/arrow/arrow_reader_writer_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -438,11 +438,11 @@ void CheckConfiguredRoundtrip(
void DoSimpleRoundtrip(const std::shared_ptr<Table>& table, bool use_threads,
int64_t row_group_size, const std::vector<int>& column_subset,
std::shared_ptr<Table>* out,
const std::shared_ptr<ArrowWriterProperties>& arrow_properties =
default_arrow_writer_properties()) {
const std::shared_ptr<ArrowWriterProperties>&
arrow_writer_properties = default_arrow_writer_properties()) {
std::shared_ptr<Buffer> buffer;
ASSERT_NO_FATAL_FAILURE(
WriteTableToBuffer(table, row_group_size, arrow_properties, &buffer));
WriteTableToBuffer(table, row_group_size, arrow_writer_properties, &buffer));

std::unique_ptr<FileReader> reader;
ASSERT_OK_NO_THROW(OpenFile(std::make_shared<BufferReader>(buffer),
Expand Down Expand Up @@ -610,9 +610,18 @@ class ParquetIOTestBase : public ::testing::Test {
}

void ReaderFromSink(std::unique_ptr<FileReader>* out) {
return ReaderFromSink(out, default_arrow_reader_properties());
}

void ReaderFromSink(std::unique_ptr<FileReader>* out,
const ArrowReaderProperties& arrow_reader_properties) {
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
ASSERT_OK_AND_ASSIGN(auto buffer, sink_->Finish());
ASSERT_OK_NO_THROW(OpenFile(std::make_shared<BufferReader>(buffer),
::arrow::default_memory_pool(), out));

FileReaderBuilder builder;
ASSERT_OK_NO_THROW(builder.Open(std::make_shared<BufferReader>(buffer)));
ASSERT_OK_NO_THROW(builder.properties(arrow_reader_properties)
->memory_pool(::arrow::default_memory_pool())
->Build(out));
}

void ReadSingleColumnFile(std::unique_ptr<FileReader> file_reader,
Expand Down Expand Up @@ -660,18 +669,20 @@ class ParquetIOTestBase : public ::testing::Test {

void RoundTripSingleColumn(
const std::shared_ptr<Array>& values, const std::shared_ptr<Array>& expected,
const std::shared_ptr<::parquet::ArrowWriterProperties>& arrow_properties,
const std::shared_ptr<::parquet::ArrowWriterProperties>& arrow_writer_properties,
const ArrowReaderProperties& arrow_reader_properties =
default_arrow_reader_properties(),
bool nullable = true) {
std::shared_ptr<Table> table = MakeSimpleTable(values, nullable);
this->ResetSink();
ASSERT_OK_NO_THROW(WriteTable(*table, ::arrow::default_memory_pool(), this->sink_,
values->length(), default_writer_properties(),
arrow_properties));
arrow_writer_properties));

std::shared_ptr<Table> out;
std::unique_ptr<FileReader> reader;
ASSERT_NO_FATAL_FAILURE(this->ReaderFromSink(&reader));
const bool expect_metadata = arrow_properties->store_schema();
ASSERT_NO_FATAL_FAILURE(this->ReaderFromSink(&reader, arrow_reader_properties));
const bool expect_metadata = arrow_writer_properties->store_schema();
ASSERT_NO_FATAL_FAILURE(
this->ReadTableFromFile(std::move(reader), expect_metadata, &out));
ASSERT_EQ(1, out->num_columns());
Expand Down Expand Up @@ -1342,6 +1353,23 @@ TEST_F(TestUInt32ParquetIO, Parquet_1_0_Compatibility) {

using TestStringParquetIO = TestParquetIO<::arrow::StringType>;

#if defined(_WIN64) || defined(__LP64__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this condition. Which platforms is it excluding and why?

Large binary data is supposed to work on every platform, so there should be no reason to skip some platforms here.

TEST_F(TestStringParquetIO, SmallStringWithLargeBinaryVariantSetting) {
auto values = ArrayFromJSON(::arrow::utf8(), R"(["foo", "", null, "bar"])");

this->RoundTripSingleColumn(values, values, default_arrow_writer_properties());

ArrowReaderProperties arrow_reader_properties;
arrow_reader_properties.set_use_large_binary_variants(true);

ASSERT_OK_AND_ASSIGN(std::shared_ptr<Array> casted,
::arrow::compute::Cast(*values, ::arrow::large_utf8()));

this->RoundTripSingleColumn(values, casted, default_arrow_writer_properties(),
arrow_reader_properties);
}
#endif

TEST_F(TestStringParquetIO, EmptyStringColumnRequiredWrite) {
std::shared_ptr<Array> values;
::arrow::StringBuilder builder;
Expand Down Expand Up @@ -1369,6 +1397,7 @@ TEST_F(TestStringParquetIO, EmptyStringColumnRequiredWrite) {

using TestLargeBinaryParquetIO = TestParquetIO<::arrow::LargeBinaryType>;

#if defined(_WIN64) || defined(__LP64__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, it does not seem right that you are restricting tests that used to work on every platform (and that have no obvious reason to fail on some platforms).

TEST_F(TestLargeBinaryParquetIO, Basics) {
const char* json = "[\"foo\", \"\", null, \"\xff\"]";

Expand All @@ -1388,6 +1417,13 @@ TEST_F(TestLargeBinaryParquetIO, Basics) {
const auto arrow_properties =
::parquet::ArrowWriterProperties::Builder().store_schema()->build();
this->RoundTripSingleColumn(large_array, large_array, arrow_properties);

ArrowReaderProperties arrow_reader_properties;
arrow_reader_properties.set_use_large_binary_variants(true);
// Input is narrow array, but expected output is large array, opposite of the above
// tests. This validates narrow arrays can be read as large arrays.
this->RoundTripSingleColumn(narrow_array, large_array,
default_arrow_writer_properties(), arrow_reader_properties);
}

using TestLargeStringParquetIO = TestParquetIO<::arrow::LargeStringType>;
Expand All @@ -1412,6 +1448,7 @@ TEST_F(TestLargeStringParquetIO, Basics) {
::parquet::ArrowWriterProperties::Builder().store_schema()->build();
this->RoundTripSingleColumn(large_array, large_array, arrow_properties);
}
#endif

using TestNullParquetIO = TestParquetIO<::arrow::NullType>;

Expand Down Expand Up @@ -3834,13 +3871,14 @@ TEST(TestImpalaConversion, ArrowTimestampToImpalaTimestamp) {
ASSERT_EQ(expected, calculated);
}

void TryReadDataFile(const std::string& path,
::arrow::StatusCode expected_code = ::arrow::StatusCode::OK) {
void TryReadDataFileWithProperties(
const std::string& path, const ArrowReaderProperties& properties,
::arrow::StatusCode expected_code = ::arrow::StatusCode::OK) {
auto pool = ::arrow::default_memory_pool();

std::unique_ptr<FileReader> arrow_reader;
Status s =
FileReader::Make(pool, ParquetFileReader::OpenFile(path, false), &arrow_reader);
Status s = FileReader::Make(pool, ParquetFileReader::OpenFile(path, false), properties,
&arrow_reader);
if (s.ok()) {
std::shared_ptr<::arrow::Table> table;
s = arrow_reader->ReadTable(&table);
Expand All @@ -3851,6 +3889,11 @@ void TryReadDataFile(const std::string& path,
<< ", but got " << s.ToString();
}

void TryReadDataFile(const std::string& path,
::arrow::StatusCode expected_code = ::arrow::StatusCode::OK) {
TryReadDataFileWithProperties(path, default_arrow_reader_properties(), expected_code);
}

TEST(TestArrowReaderAdHoc, Int96BadMemoryAccess) {
// PARQUET-995
TryReadDataFile(test::get_data_file("alltypes_plain.parquet"));
Expand All @@ -3862,6 +3905,19 @@ TEST(TestArrowReaderAdHoc, CorruptedSchema) {
TryReadDataFile(path, ::arrow::StatusCode::IOError);
}

#if defined(ARROW_WITH_BROTLI) && defined(__LP64__)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand what __LP64__ is doing here. If you really want to single out 64-bit platforms, you could instead do something like:

  if (sizeof(void*) < 8) {
    GTEST_SKIP() << "Test only runs on 64-bit platforms as it allocates more than 2GB RAM";
  }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also see that this test takes 18 seconds in debug mode. This seems a bit excessive :-/

TEST(TestArrowParquet, LargeByteArray) {
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
arthurpassos marked this conversation as resolved.
Show resolved Hide resolved
auto path = test::get_data_file("large_string_map.brotli.parquet");
TryReadDataFile(path, ::arrow::StatusCode::NotImplemented);
ArrowReaderProperties reader_properties;
reader_properties.set_use_large_binary_variants(true);
reader_properties.set_read_dictionary(0, false);
arthurpassos marked this conversation as resolved.
Show resolved Hide resolved
TryReadDataFileWithProperties(path, reader_properties);
reader_properties.set_read_dictionary(0, true);
TryReadDataFileWithProperties(path, reader_properties);
}
#endif

TEST(TestArrowReaderAdHoc, LARGE_MEMORY_TEST(LargeStringColumn)) {
// ARROW-3762
::arrow::StringBuilder builder;
Expand Down Expand Up @@ -4548,16 +4604,22 @@ TEST(TestArrowWriteDictionaries, NestedSubfield) {
class TestArrowReadDeltaEncoding : public ::testing::Test {
public:
void ReadTableFromParquetFile(const std::string& file_name,
const ArrowReaderProperties& properties,
std::shared_ptr<Table>* out) {
auto file = test::get_data_file(file_name);
auto pool = ::arrow::default_memory_pool();
std::unique_ptr<FileReader> parquet_reader;
ASSERT_OK(FileReader::Make(pool, ParquetFileReader::OpenFile(file, false),
ASSERT_OK(FileReader::Make(pool, ParquetFileReader::OpenFile(file, false), properties,
&parquet_reader));
ASSERT_OK(parquet_reader->ReadTable(out));
ASSERT_OK((*out)->ValidateFull());
}

void ReadTableFromParquetFile(const std::string& file_name,
std::shared_ptr<Table>* out) {
return ReadTableFromParquetFile(file_name, default_arrow_reader_properties(), out);
}

void ReadTableFromCSVFile(const std::string& file_name,
const ::arrow::csv::ConvertOptions& convert_options,
std::shared_ptr<Table>* out) {
Expand Down Expand Up @@ -4605,6 +4667,27 @@ TEST_F(TestArrowReadDeltaEncoding, DeltaByteArray) {
::arrow::AssertTablesEqual(*actual_table, *expect_table, false);
}

TEST_F(TestArrowReadDeltaEncoding, DeltaByteArrayWithLargeBinaryVariant) {
std::shared_ptr<::arrow::Table> actual_table, expect_table;
ArrowReaderProperties properties;
properties.set_use_large_binary_variants(true);

ReadTableFromParquetFile("delta_byte_array.parquet", properties, &actual_table);

auto convert_options = ::arrow::csv::ConvertOptions::Defaults();
std::vector<std::string> column_names = {
"c_customer_id", "c_salutation", "c_first_name",
"c_last_name", "c_preferred_cust_flag", "c_birth_country",
"c_login", "c_email_address", "c_last_review_date"};
for (auto name : column_names) {
convert_options.column_types[name] = ::arrow::large_utf8();
}
convert_options.strings_can_be_null = true;
ReadTableFromCSVFile("delta_byte_array_expect.csv", convert_options, &expect_table);
Comment on lines +4677 to +4686
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you could factor this out in the test fixture.


::arrow::AssertTablesEqual(*actual_table, *expect_table, false);
}

TEST_F(TestArrowReadDeltaEncoding, IncrementalDecodeDeltaByteArray) {
auto file = test::get_data_file("delta_byte_array.parquet");
auto pool = ::arrow::default_memory_pool();
Expand Down
5 changes: 4 additions & 1 deletion cpp/src/parquet/arrow/reader.cc
Original file line number Diff line number Diff line change
Expand Up @@ -219,6 +219,7 @@ class FileReaderImpl : public FileReader {
ctx->iterator_factory = SomeRowGroupsFactory(row_groups);
ctx->filter_leaves = true;
ctx->included_leaves = included_leaves;
ctx->use_large_binary_variants = reader_properties_.use_large_binary_variants();
wgtmac marked this conversation as resolved.
Show resolved Hide resolved
return GetReader(manifest_.schema_fields[i], ctx, out);
}

Expand Down Expand Up @@ -462,7 +463,8 @@ class LeafReader : public ColumnReaderImpl {
input_(std::move(input)),
descr_(input_->descr()) {
record_reader_ = RecordReader::Make(
arthurpassos marked this conversation as resolved.
Show resolved Hide resolved
descr_, leaf_info, ctx_->pool, field_->type()->id() == ::arrow::Type::DICTIONARY);
descr_, leaf_info, ctx_->pool, field_->type()->id() == ::arrow::Type::DICTIONARY,
/*read_dense_for_nullable*/ false, ctx_->use_large_binary_variants);
NextRowGroup();
}

Expand Down Expand Up @@ -1218,6 +1220,7 @@ Status FileReaderImpl::GetColumn(int i, FileColumnIteratorFactory iterator_facto
ctx->pool = pool_;
ctx->iterator_factory = iterator_factory;
ctx->filter_leaves = false;
ctx->use_large_binary_variants = reader_properties_.use_large_binary_variants();
std::unique_ptr<ColumnReaderImpl> result;
RETURN_NOT_OK(GetReader(manifest_.schema_fields[i], ctx, &result));
*out = std::move(result);
Expand Down
5 changes: 3 additions & 2 deletions cpp/src/parquet/arrow/reader_internal.cc
Original file line number Diff line number Diff line change
Expand Up @@ -487,8 +487,9 @@ Status TransferBinary(RecordReader* reader, MemoryPool* pool,
auto chunks = binary_reader->GetBuilderChunks();
for (auto& chunk : chunks) {
if (!chunk->type()->Equals(*logical_type_field->type())) {
// XXX: if a LargeBinary chunk is larger than 2GB, the MSBs of offsets
// will be lost because they are first created as int32 and then cast to int64.
// If a LargeBinary chunk is larger than 2GB and use_large_binary_variants
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep the XXX because it is a gotcha.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know something new :)

// is not set, the MSBs of offsets will be lost because they are first created
// as int32 and then cast to int64.
ARROW_ASSIGN_OR_RAISE(
chunk,
::arrow::compute::Cast(*chunk, logical_type_field->type(), cast_options, &ctx));
Expand Down
1 change: 1 addition & 0 deletions cpp/src/parquet/arrow/reader_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,7 @@ struct ReaderContext {
FileColumnIteratorFactory iterator_factory;
bool filter_leaves;
std::shared_ptr<std::unordered_set<int>> included_leaves;
bool use_large_binary_variants = false;

bool IncludesLeaf(int leaf_index) const {
if (this->filter_leaves) {
Expand Down
7 changes: 5 additions & 2 deletions cpp/src/parquet/arrow/schema.cc
Original file line number Diff line number Diff line change
Expand Up @@ -462,7 +462,9 @@ struct SchemaTreeContext {

bool IsDictionaryReadSupported(const ArrowType& type) {
// Only supported currently for BYTE_ARRAY types
return type.id() == ::arrow::Type::BINARY || type.id() == ::arrow::Type::STRING;
return type.id() == ::arrow::Type::BINARY || type.id() == ::arrow::Type::STRING ||
type.id() == ::arrow::Type::LARGE_BINARY ||
type.id() == ::arrow::Type::LARGE_STRING;
}

// ----------------------------------------------------------------------
Expand All @@ -473,7 +475,8 @@ ::arrow::Result<std::shared_ptr<ArrowType>> GetTypeForNode(
SchemaTreeContext* ctx) {
ASSIGN_OR_RAISE(
std::shared_ptr<ArrowType> storage_type,
GetArrowType(primitive_node, ctx->properties.coerce_int96_timestamp_unit()));
GetArrowType(primitive_node, ctx->properties.coerce_int96_timestamp_unit(),
ctx->properties.use_large_binary_variants()));
if (ctx->properties.read_dictionary(column_index) &&
IsDictionaryReadSupported(*storage_type)) {
return ::arrow::dictionary(::arrow::int32(), storage_type);
Expand Down
16 changes: 9 additions & 7 deletions cpp/src/parquet/arrow/schema_internal.cc
Original file line number Diff line number Diff line change
Expand Up @@ -110,17 +110,18 @@ Result<std::shared_ptr<ArrowType>> MakeArrowTimestamp(const LogicalType& logical
}
}

Result<std::shared_ptr<ArrowType>> FromByteArray(const LogicalType& logical_type) {
Result<std::shared_ptr<ArrowType>> FromByteArray(const LogicalType& logical_type,
bool use_large_binary_variants) {
switch (logical_type.type()) {
case LogicalType::Type::STRING:
return ::arrow::utf8();
return use_large_binary_variants ? ::arrow::large_utf8() : ::arrow::utf8();
case LogicalType::Type::DECIMAL:
return MakeArrowDecimal(logical_type);
case LogicalType::Type::NONE:
case LogicalType::Type::ENUM:
case LogicalType::Type::JSON:
case LogicalType::Type::BSON:
return ::arrow::binary();
return use_large_binary_variants ? ::arrow::large_binary() : ::arrow::binary();
default:
return Status::NotImplemented("Unhandled logical logical_type ",
logical_type.ToString(), " for binary array");
Expand Down Expand Up @@ -181,7 +182,7 @@ Result<std::shared_ptr<ArrowType>> FromInt64(const LogicalType& logical_type) {

Result<std::shared_ptr<ArrowType>> GetArrowType(
Type::type physical_type, const LogicalType& logical_type, int type_length,
const ::arrow::TimeUnit::type int96_arrow_time_unit) {
const ::arrow::TimeUnit::type int96_arrow_time_unit, bool use_large_binary_variants) {
if (logical_type.is_invalid() || logical_type.is_null()) {
return ::arrow::null();
}
Expand All @@ -200,7 +201,7 @@ Result<std::shared_ptr<ArrowType>> GetArrowType(
case ParquetType::DOUBLE:
return ::arrow::float64();
case ParquetType::BYTE_ARRAY:
return FromByteArray(logical_type);
return FromByteArray(logical_type, use_large_binary_variants);
case ParquetType::FIXED_LEN_BYTE_ARRAY:
return FromFLBA(logical_type, type_length);
default: {
Expand All @@ -213,9 +214,10 @@ Result<std::shared_ptr<ArrowType>> GetArrowType(

Result<std::shared_ptr<ArrowType>> GetArrowType(
const schema::PrimitiveNode& primitive,
const ::arrow::TimeUnit::type int96_arrow_time_unit) {
const ::arrow::TimeUnit::type int96_arrow_time_unit, bool use_large_binary_variants) {
return GetArrowType(primitive.physical_type(), *primitive.logical_type(),
primitive.type_length(), int96_arrow_time_unit);
primitive.type_length(), int96_arrow_time_unit,
use_large_binary_variants);
}

} // namespace arrow
Expand Down
13 changes: 9 additions & 4 deletions cpp/src/parquet/arrow/schema_internal.h
Original file line number Diff line number Diff line change
Expand Up @@ -29,23 +29,28 @@ namespace arrow {

using ::arrow::Result;

Result<std::shared_ptr<::arrow::DataType>> FromByteArray(const LogicalType& logical_type);
Result<std::shared_ptr<::arrow::DataType>> FromByteArray(const LogicalType& logical_type,
bool use_large_binary_variants);

Result<std::shared_ptr<::arrow::DataType>> FromFLBA(const LogicalType& logical_type,
int32_t physical_length);
Result<std::shared_ptr<::arrow::DataType>> FromInt32(const LogicalType& logical_type);
Result<std::shared_ptr<::arrow::DataType>> FromInt64(const LogicalType& logical_type);

Result<std::shared_ptr<::arrow::DataType>> GetArrowType(Type::type physical_type,
const LogicalType& logical_type,
int type_length);
int type_length,
bool use_large_binary_variants);

Result<std::shared_ptr<::arrow::DataType>> GetArrowType(
Type::type physical_type, const LogicalType& logical_type, int type_length,
::arrow::TimeUnit::type int96_arrow_time_unit = ::arrow::TimeUnit::NANO);
::arrow::TimeUnit::type int96_arrow_time_unit = ::arrow::TimeUnit::NANO,
bool use_large_binary_variants = false);

Result<std::shared_ptr<::arrow::DataType>> GetArrowType(
const schema::PrimitiveNode& primitive,
::arrow::TimeUnit::type int96_arrow_time_unit = ::arrow::TimeUnit::NANO);
::arrow::TimeUnit::type int96_arrow_time_unit = ::arrow::TimeUnit::NANO,
bool use_large_binary_variants = false);

} // namespace arrow
} // namespace parquet
Loading