Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

range end index 294912 out of range for slice of length 147456 #540

Closed
andygrove opened this issue Jun 7, 2024 · 11 comments · Fixed by #584
Closed

range end index 294912 out of range for slice of length 147456 #540

andygrove opened this issue Jun 7, 2024 · 11 comments · Fixed by #584
Assignees
Labels
bug Something isn't working
Milestone

Comments

@andygrove
Copy link
Member

Describe the bug

I just saw this when running benchmarks with latest from main and with xxhash64 disabled.

Caused by: org.apache.comet.CometNativeException: General execution error with reason org.apache.comet.CometNativeException: range end index 294912 out of range for slice of length 147456
        at comet::errors::init::{{closure}}(__internal__:0)
        at std::panicking::rust_panic_with_hook(__internal__:0)
        at std::panicking::begin_panic_handler::{{closure}}(__internal__:0)
        at std::sys_common::backtrace::__rust_end_short_backtrace(__internal__:0)
        at rust_begin_unwind(__internal__:0)
        at core::panicking::panic_fmt(__internal__:0)
        at core::slice::index::slice_end_index_len_fail(__internal__:0)
        at arrow_data::transform::variable_size::build_extend::{{closure}}(__internal__:0)
        at arrow_data::transform::MutableArrayData::extend(__internal__:0)
        at comet::execution::operators::copy_array(__internal__:0)
        at comet::execution::operators::copy_or_cast_array(__internal__:0)
        at <comet::execution::operators::copy::CopyStream as futures_core::stream::Stream>::poll_next(__internal__:0)
        at <datafusion_physical_plan::stream::RecordBatchStreamAdapter<S> as futures_core::stream::Stream>::poll_next(__internal__:0)
        at Java_org_apache_comet_Native_executePlan(__internal__:0)
        at <unknown>(__internal__:0).
	at org.apache.comet.Native.executePlan(Native Method)
	at org.apache.comet.CometExecIterator.getNextBatch(CometExecIterator.scala:102)
	at org.apache.comet.CometExecIterator.hasNext(CometExecIterator.scala:125)
	at org.apache.spark.sql.comet.execution.shuffle.CometShuffleWriteProcessor.write(CometShuffleExchangeExec.scala:492)

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

@andygrove andygrove added the bug Something isn't working label Jun 7, 2024
@andygrove andygrove added this to the 0.1.0 milestone Jun 7, 2024
@viirya
Copy link
Member

viirya commented Jun 7, 2024

@andygrove Is there any reproducible example?

@viirya
Copy link
Member

viirya commented Jun 7, 2024

Btw, it doesn't fail in shuffle writer.

From the stack trace, it looks like when the writer to pull next batch from its upstream, the error happened.

@andygrove
Copy link
Member Author

@andygrove Is there any reproducible example?

I was running TPC-H @ 100gb

@andygrove andygrove changed the title shuffle write fails with out of range slice range end index 294912 out of range for slice of length 147456 Jun 8, 2024
@andygrove
Copy link
Member Author

It failed on q10. I am adding debug logging to see if I can track this down.

@andygrove
Copy link
Member Author

The error is happening in unsafe code in arrow-rs.

Here is some debug output showing the calls leading up the the error:

copy_or_cast_array() len=8192, type=Utf8
copy_array(typeUtf8, len=8192) before mutable.extend(0, 0, 8192) data len = 8192
getNextBatch id=2284, plan=135761185342032 failed during native execution: range end index 294912 out of range for slice of length 147456

Note that there are many earlier calls that look identical and do not fail.

The error happens in arrow_data::transform::variable_size::build_extend:

        at core::panicking::panic_fmt(__internal__:0)
        at core::slice::index::slice_end_index_len_fail(__internal__:0)
        at arrow_data::transform::variable_size::build_extend::{{closure}}(__internal__:0)
        at arrow_data::transform::MutableArrayData::extend(__internal__:0)
        at comet::execution::operators::copy_array(__internal__:0)
        at comet::execution::operators::copy_or_cast_array(__internal__:0)

This function calls get_last_offset to get the last offset and there are some documented assumptions. I wonder if we are violating any of these?

pub(super) unsafe fn get_last_offset<T: ArrowNativeType>(offset_buffer: &MutableBuffer) -> T {
    // JUSTIFICATION
    //  Benefit
    //      20% performance improvement extend of variable sized arrays (see bench `mutable_array`)
    //  Soundness
    //      * offset buffer is always extended in slices of T and aligned accordingly.
    //      * Buffer[0] is initialized with one element, 0, and thus `mutable_offsets.len() - 1` is always valid.
    let (prefix, offsets, suffix) = offset_buffer.as_slice().align_to::<T>();
    debug_assert!(prefix.is_empty() && suffix.is_empty());
    *offsets.get_unchecked(offsets.len() - 1)
}

@andygrove
Copy link
Member Author

I ran with a debug build and got a more detailed stack trace:

        at core::panicking::panic_fmt(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/panicking.rs:72)
        at core::slice::index::slice_end_index_len_fail_rt(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/slice/index.rs:76)
        at core::slice::index::slice_end_index_len_fail(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/slice/index.rs:68)
        at <core::ops::range::Range<usize> as core::slice::index::SliceIndex<[T]>>::index(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/slice/index.rs:394)
        at core::slice::index::<impl core::ops::index::Index<I> for [T]>::index(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/core/src/slice/index.rs:18)
        at arrow_data::transform::variable_size::extend_offset_values(/home/andy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-52.0.0/src/transform/variable_size.rs:38)
        at arrow_data::transform::variable_size::build_extend::{{closure}}(/home/andy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-52.0.0/src/transform/variable_size.rs:57)
        at <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call(/rustc/25ef9e3d85d934b27d9dada2f9dd52b1dc63bb04/library/alloc/src/boxed.rs:2029)
        at arrow_data::transform::MutableArrayData::extend(/home/andy/.cargo/registry/src/index.crates.io-6f17d22bba15001f/arrow-data-52.0.0/src/transform/mod.rs:649)
        at comet::execution::operators::copy_array(/home/andy/git/apache/datafusion-comet/core/src/execution/operators/mod.rs:70)

The error happens in extend_offset_values on this line:

let new_values = &values[start_values..end_values];

@andygrove
Copy link
Member Author

Here is some more debug info, showing the size of the buffers in the array data and the value in the first element of each buffer:

copy_array(typeInt64, len=8192) after mutable.extend(0, 0, 8192) data len = 8192
copy_or_cast_array() len=8192, type=Utf8
copy_array num buffers=2
BUF[0] len=32772 cap=32772
buf[0][0] = 0
BUF[1] len=147456 cap=147456
buf[1][0] = 67
getNextBatch id=2284, plan=126232496596768 failed during native execution: range end index 294912 out of range for slice of length 147456

@viirya
Copy link
Member

viirya commented Jun 9, 2024

Ok. I tried TPC-H @ 100gb and TPC-H @ 1gb. Only TPC-H @ 100gb can reproduce it. I will look into this.

@viirya viirya self-assigned this Jun 9, 2024
@viirya
Copy link
Member

viirya commented Jun 11, 2024

I got some hint (I know which operator causes it) after debugging it for a while, though I don't get the root cause yet.

@viirya
Copy link
Member

viirya commented Jun 13, 2024

I've just figured out where the root cause is. I will go to propose a fix to arrow-rs.

EDIT: actually the issue is happened in Java Arrow instead of arrow-rs.

@viirya
Copy link
Member

viirya commented Jun 15, 2024

I opened an issue at Java Arrow repo and described the root cause: apache/arrow#42156. Fixing it there might wait for a longer release period. I'm thinking to copy BufferImportTypeVisitor into Comet and fix in Comet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants