Skip to content

Conversation

@westhide
Copy link
Contributor

@westhide westhide commented Mar 19, 2025

Which issue does this PR close?

Rationale for this change

Fix datafusion-ballista: Unsupported NdJsonExec plan and extension codec Exception

What changes are included in this PR?

Support serde for JsonSource PhysicalPlan

Are these changes tested?

Unit Test pass✅.

Ballista integration Test✅

   Compiling ballista v45.0.0 (/home/westhide/Code/apache/datafusion-ballista/ballista/client)
warning: `ballista-scheduler` (lib) generated 12 warnings
   Compiling ballista-examples v45.0.0 (/home/westhide/Code/apache/datafusion-ballista/examples)
    Finished `release` profile [optimized] target(s) in 26.98s
     Running `/home/westhide/Code/apache/datafusion-ballista/target/release/examples/remote-sql`
+-----+------------------+---------------+------+
| a   | b                | c             | d    |
+-----+------------------+---------------+------+
| 1   | [2.0, 1.3, -6.1] | [false, true] | 4    |
| -10 | [2.0, 1.3, -6.1] | [true, true]  | 4    |
| 2   | [2.0, , -6.1]    | [false, ]     | text |
|     |                  |               |      |
+-----+------------------+---------------+------+

Are there any user-facing changes?

No

@github-actions github-actions bot added the proto Related to proto crate label Mar 19, 2025
@westhide
Copy link
Contributor Author

take

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @westhide

Once CI passes I think this PR looks good to me, but I do think we should consider serializing the other field too

.with_file_compression_type(FileCompressionType::UNCOMPRESSED);
Ok(conf.build())
}
PhysicalPlanType::JsonScan(scan) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there is one more relevant field for JsonSource, batch_size

/// JsonSource holds the extra configuration that is necessary for [`JsonOpener`]
#[derive(Clone, Default)]
pub struct JsonSource {
batch_size: Option<usize>,
metrics: ExecutionPlanMetricsSet,
projected_statistics: Option<Statistics>,
}

Perhaps we can add that field to the serialization as well (or file a ticket to add it?)

Copy link
Contributor Author

@westhide westhide Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently batch_size will set by context.session_config().batch_size() when DataSource call open(..), serialization batch_size will let the JsonSource use client config instead of the Executor session_config in Ballista. Should we change this behavior for all file format(csv,parquert,avro...)?

.with_batch_size(context.session_config().batch_size())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client config should be propagated from to executor (task context) so it should be the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I read the code about TaskDefinition in ballista. Thx~

let ctx = SessionContext::new();
ctx.register_json("t1", "../core/tests/data/1.json", Default::default())
.await?;
let plan = ctx.table("t1").await?.create_physical_plan().await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb requested a review from milenkovicm March 19, 2025 19:17
@milenkovicm
Copy link
Contributor

Thanks @westhide, PR makes sense to me.

I left one comment in the original ballista issue, apache/datafusion-ballista#1209 (comment)

If you have time please add test in ballista unsupported features will help us to track this issue until we get to datafusion 47, you can the same one you used in the issue. Otherwise we can add it as a follow up

Copy link
Contributor

@milenkovicm milenkovicm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @alamb, adding batch_size should be straight forward

@alamb alamb merged commit 722ccb9 into apache:main Mar 20, 2025
32 of 33 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 20, 2025

Thanks @westhide

For anyone following along:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

proto Related to proto crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unsupported NdJsonExec plan and extension codec

3 participants