-
Notifications
You must be signed in to change notification settings - Fork 64
[DWIO] refactor the reader of dwrf/orc #261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Do we plan to open a PR to Velox upstream for this change? |
| } | ||
| } | ||
|
|
||
| void DwrfData::initDwrf(StripeStreams& stripe) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we use a little template magic to reduce code duplication?
void DwrfData::init(StripeStreams& stripe) {
auto format = stripe.format();
if (format == DwrfFormat::kDwrf) {
init<DwrfFormat::kDwrf>(stripe);
} else {
VELOX_CHECK(format == DwrfFormat::kOrc);
init<DwrfFormat::kOrc>(stripe);
}
}
template <DwrfFormat kFormat>
void DwrfData::init(StripeStreams& stripe) {
EncodingKey encodingKey{nodeType_->id, flatMapContext_.sequence};
auto kind = kFormat == DwrfFormat::kDwrf ? proto::Stream_Kind_PRESENT : proto::orc::Stream_Kind_PRESENT;
std::unique_ptr<dwio::common::SeekableInputStream> stream = stripe.getStream(
encodingKey.forKind(kind), false);
if (stream) {
notNullDecoder_ = createBooleanRleDecoder(std::move(stream), encodingKey);
}
// We always initialize indexStream_ because indices are needed as
// soon as there is a single filter that can trigger row group skips
// anywhere in the reader tree. This is not known at construct time
// because the first filter can come from a hash join or other run
// time pushdown.
kind = kFormat == DwrfFormat::kDwrf ? proto::Stream_Kind_ROW_INDEX : proto::orc::Stream_Kind_ROW_INDEX;
indexStream_ = stripe.getStream(
encodingKey.forKind(proto::orc::Stream_Kind_ROW_INDEX), false);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the operator of condition ? a : b requires that type a is the same as type b, while proto::Stream_Kind_PRESENT is a different type from proto::orc::Stream_Kind_PRESENT, so it can't be compiled successfully.
let's keep it this way, a method for dwrf and another method for orc.
It's clean although it seems poor skills.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid these duplications as it would increase maintenance burden. We can do it like this instead:
template <DwrfFormat kFormat>
void DwrfData::init(StripeStreams& stripe) {
EncodingKey encodingKey{nodeType_->id, flatMapContext_.sequence};
auto id = kFormat == DwrfFormat::kDwrf ? encodingKey.forKind(proto::Stream_Kind_PRESENT) : encodingKey.forKind(proto::orc::Stream_Kind_PRESENT);
std::unique_ptr<dwio::common::SeekableInputStream> stream = stripe.getStream(id, false);
if (stream) {
notNullDecoder_ = createBooleanRleDecoder(std::move(stream), encodingKey);
}
// We always initialize indexStream_ because indices are needed as
// soon as there is a single filter that can trigger row group skips
// anywhere in the reader tree. This is not known at construct time
// because the first filter can come from a hash join or other run
// time pushdown.
id = kFormat == DwrfFormat::kDwrf ? encodingKey.forKind(proto::Stream_Kind_ROW_INDEX) : encodingKey.forKind(proto::orc::Stream_Kind_ROW_INDEX);
indexStream_ = stripe.getStream(
encodingKey.forKind(proto::orc::Stream_Kind_ROW_INDEX), false);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let‘s reduce the duplications in a new PR, I will ask my partner to do this for me in the following days.
He is a new member of our team, so it's a good start for him to make such changes.
Yes |
|
if(NOT DEFINED CMAKE_INSTALL_LIBDIR) |
velox/dwio/dwrf/common/Common.h
Outdated
| */ | ||
| std::string writerVersionToString(WriterVersion kind); | ||
|
|
||
| // Stream Kind of Dwrf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment starts with capitalized letter and ends with .. Same for below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
Co-authored-by: zuochunwei <zuochunwei@meituan.com>
Co-authored-by: zuochunwei <zuochunwei@meituan.com>
Co-authored-by: zuochunwei <zuochunwei@meituan.com>
Co-authored-by: zuochunwei <zuochunwei@meituan.com>
Co-authored-by: zuochunwei <zuochunwei@meituan.com>
Co-authored-by: zuochunwei <zuochunwei@meituan.com>
Co-authored-by: zuochunwei <zuochunwei@meituan.com>
Co-authored-by: zuochunwei <zuochunwei@meituan.com>
Co-authored-by: zuochunwei <zuochunwei@meituan.com>
Co-authored-by: zuochunwei <zuochunwei@meituan.com>
Co-authored-by: zuochunwei <zuochunwei@meituan.com>
Co-authored-by: zuochunwei <zuochunwei@meituan.com>
…tpch workload (oap-project#261) * add the workload for tpch * Add the readme to tpch workload * refine readme Signed-off-by: Yuan Zhou <yuan.zhou@intel.com> Co-authored-by: Yuan Zhou <yuan.zhou@intel.com>
Why I do this? please refer to discussion
We use the
proto::Stream_Kind_xxxboth fordwrfandorc, and this usage is dangerous, so I refactor this part by the following modifications:proto::orc::StreamforDwrfStreamIdentifierforKind(const proto::orc::Stream_Kind kind)forEncodingKeyproto::orc::StripeFooter* footerOrc_for orcStripeStreams::getEncodingOrc()fororc, the response toStripeStreams::getEncoding()fordwrfinitfor sub-classes ofSelectiveColumnReader, check the format ininit(), callinitDwrffor dwrf and callinitOrcfor orc