Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Spark-4.0 widening type support #604

Merged
merged 12 commits into from
Jul 18, 2024

Conversation

kazuyukitanimura
Copy link
Contributor

Which issue does this PR close?

Part of #372 and #551

Rationale for this change

To be ready for Spark 4.0

What changes are included in this PR?

This PR adds type widening feature support introduced in Spark-4.0

How are these changes tested?

Enabled ParquetTypeWideningSuite

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kazuyukitanimura kazuyukitanimura marked this pull request as ready for review July 12, 2024 23:11
@kazuyukitanimura
Copy link
Contributor Author

This is ready for review @andygrove @comphead @huaxingao @viirya
@parthchandra already approved it.

if dst_scale > src_scale {
let mul = (10 as $dst_type).pow(dst_scale - src_scale);
for i in 0..num {
unsafe {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add SAFETY comments for the unsafe blocks that explain how/why they are actually safe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not an expert on this area but the changes seem reasonable to me. I think it would be good practice for us to document any unsafe code that we are adding.

{
let mut offset = src.offset;
for _ in 0..num {
let v = &src_data[offset..offset + byte_width] as *const [u8] as *const u8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we have to use as *const [u8] as *const u8 as *const i32 rather than .as_ptr() as *const i32?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@parthchandra
Copy link
Contributor

btw, this is really a feat not a fix :)

@@ -54,7 +54,7 @@ fn bench(c: &mut Criterion) {
);
b.iter(|| {
let cd = ColumnDescriptor::new(t.clone(), 0, 0, ColumnPath::from(Vec::new()));
let promition_info = TypePromotionInfo::new(PhysicalType::INT32, -1, -1);
let promition_info = TypePromotionInfo::new(PhysicalType::INT32, -1, -1, 32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let promition_info = TypePromotionInfo::new(PhysicalType::INT32, -1, -1, 32);
let promotion_info = TypePromotionInfo::new(PhysicalType::INT32, -1, -1, 32);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -48,6 +52,7 @@ make_type!(FLBADecimal32Type);
make_type!(FLBADecimal64Type);
make_type!(FLBAType);
make_type!(Int32DateType);
make_type!(Int32TimestampMicrosType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have Int32TimestampMillisType as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a usage for mills for now I think

make_int_variant_dict_impl!(Int32To64Type, i32, i64);
make_int_variant_dict_impl!(Int32ToDecimal64Type, i32, i64);
make_int_variant_dict_impl!(Int32ToDoubleType, i32, f64);
make_int_variant_dict_impl!(Int32TimestampMicrosType, i32, i64);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int32 Millis?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a usage for mills for now I think


// TODO: optimize this further as checking value one by one is not very efficient
unsafe {
if unlikely(v.read_unaligned() < JULIAN_GREGORIAN_SWITCH_OFF_DAY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use unlikely as its rust nightly feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if unlikely(v.read_unaligned() < JULIAN_GREGORIAN_SWITCH_OFF_DAY) {
panic!(
"Encountered timestamp value {}, which is before 1582-10-15 (counting \
backwards from Unix eopch date 1970-01-01), and could be ambigous \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
backwards from Unix eopch date 1970-01-01), and could be ambigous \
backwards from Unix epoch date 1970-01-01), and could be ambiguous \

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

generate_cast_to_signed!(copy_f32_to_f64, f32, f64);

fn copy_i64_to_i64(src: &[u8], dst: &mut [u8], num: usize) {
debug_assert!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering can it be a single check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is debug_assert! so should be fine to check every time

"Destination slice is too small"
);

bit::memcpy_value(src, 8 * num, dst);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 here is std::mem::size_of:: ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kazuyukitanimura I'm wondering should be have tests to read those types?

@kazuyukitanimura kazuyukitanimura changed the title fix: Spark-4.0 widening type support feat: Spark-4.0 widening type support Jul 17, 2024
@kazuyukitanimura
Copy link
Contributor Author

btw, this is really a feat not a fix :)

updated

@kazuyukitanimura
Copy link
Contributor Author

Thanks @kazuyukitanimura I'm wondering should be have tests to read those types?

Thanks @comphead
Spark's ParquetTypeWideningSuite has a good coverage

@kazuyukitanimura
Copy link
Contributor Author

@parthchandra @andygrove
I have added a relevant change to fix one more test after your approvals 3659f0a
Do you mind taking another look?

Copy link
Contributor

@parthchandra parthchandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Except one comment (mostly for my understanding)

@kazuyukitanimura kazuyukitanimura merged commit 64b5f3c into apache:main Jul 18, 2024
75 checks passed
@kazuyukitanimura
Copy link
Contributor Author

Merged, thank you @parthchandra @andygrove @comphead

himadripal pushed a commit to himadripal/datafusion-comet that referenced this pull request Sep 7, 2024
## Rationale for this change

To be ready for Spark 4.0

## What changes are included in this PR?

This PR adds type widening feature support introduced in Spark-4.0

## How are these changes tested?

Enabled ParquetTypeWideningSuite
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants