Skip to content

Duplicated project schema will cause index out of bounds exception in orc_exec #722

@harveyyue

Description

@harveyyue

Describe the bug
A clear and concise description of what the bug is.

Table:
CREATE TABLE test_orc(
id BIGINT COMMENT 'pk',
m MAP<STRING,STRING> COMMENT 'test read map type',
l ARRAY COMMENT 'test read list type',
s STRING COMMENT 'string type'
) using orc

Sql statement:
select l from test_orc

Execute this sql will get execption as below:
`24/12/26 15:08:13 INFO BlazeCallNativeWrapper: Start executing native plan
(+398.133s) [INFO] (stage: 5, partition: 0) - start executing plan:
ProjectExec [cast(#2@0 AS Utf8) AS #65], schema=[#65:Utf8;N]
RenameColumnsExec: ["#2"], schema=[#2:List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} });N]
OrcExec: file_group=[PartitionedFile { object_meta: ObjectMeta { location: Path { raw: "ZmlsZTovLy9Vc2Vycy9zaDAwNzA0bWwvRG93bmxvYWRzL2Nvcy9wYXJ0LTAwMDAwLTFiNzE4YzI4LWFlYjgtNDM2My04NjFkLTg1YmUwNTlkYTM1MC1jMDAwLnNuYXBweS5vcmM" }, last_modified: 1970-01-01T00:00:00Z, size: 804, e_tag: None, version: None }, partition_values: [], range: Some(FileRange { start: 0, end: 804 }), statistics: None, extensions: None }], limit=None, projection=Some([2]), schema=[l:List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} });N]

thread 'blaze-native-stage-5-part-0' panicked at native-engine/datafusion-ext-plans/src/common/execution_context.rs:285:21:
output_with_sender[OrcScan]: output() returns error: Arrow error: Schema error: project index 2 out of bounds, max field 1
thread 'blaze-native-stage-5-part-0' panicked at native-engine/datafusion-ext-plans/src/common/execution_context.rs:308:21:
output_with_sender[OrcScan] error: Execution error: output_with_sender[OrcScan]: output() returns error: Arrow error: Schema error: project index 2 out of bounds, max field 1
thread 'blaze-native-stage-5-part-0' panicked at native-engine/datafusion-ext-plans/src/common/execution_context.rs:308:21:
output_with_sender[Project] error: Execution error: output_with_sender[OrcScan] error: Execution error: output_with_sender[OrcScan]: output() returns error: Arrow error: Schema error: project index 2 out of bounds, max field 1
(+398.215s) [ERROR] (stage: 5, partition: 0) - native execution panics: Execution error: Execution error: output_with_sender[Project] error: Execution error: output_with_sender[OrcScan] error: Execution error: output_with_sender[OrcScan]: output() returns error: Arrow error: Schema error: project index 2 out of bounds, max field 1
(+398.215s) [INFO] (stage: 5, partition: 0) - task exited abnormally.
(+398.218s) [INFO] (stage: 0, partition: 0) - (partition=0) native execution finalizing
(+398.227s) [INFO] (stage: 0, partition: 0) - (partition=0) native execution finalized
24/12/26 15:08:13 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
java.lang.RuntimeException: poll record batch error: Execution error: native execution panics: Execution error: Execution error: output_with_sender[Project] error: Execution error: output_with_sender[OrcScan] error: Execution error: output_with_sender[OrcScan]: output() returns error: Arrow error: Schema error: project index 2 out of bounds, max field 1
at org.apache.spark.sql.blaze.JniBridge.nextBatch(Native Method)
at org.apache.spark.sql.blaze.BlazeCallNativeWrapper$$anon$1.hasNext(BlazeCallNativeWrapper.scala:80)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
24/12/26 15:08:13 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5) (192.168.132.23 executor driver): java.lang.RuntimeException: poll record batch error: Execution error: native execution panics: Execution error: Execution error: output_with_sender[Project] error: Execution error: output_with_sender[OrcScan] error: Execution error: output_with_sender[OrcScan]: output() returns error: Arrow error: Schema error: project index 2 out of bounds, max field 1
at org.apache.spark.sql.blaze.JniBridge.nextBatch(Native Method)
at org.apache.spark.sql.blaze.BlazeCallNativeWrapper$$anon$1.hasNext(BlazeCallNativeWrapper.scala:80)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)`

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions