Skip to content

[Umbrella] Collect observation metrics at all stages of Spark SQL. #6021

@wForget

Description

@wForget

Code of Conduct

Search before asking

  • I have searched in the issues and found no similar issues.

Describe the proposal

In the process of upgrading the spark version or introducing new optimizations, in order to ensure data consistency, we usually double-run SQL in multiple environments, and then compare the results to determine inconsistent behavior. But when the results are inconsistent, it is difficult for us to quickly find the stage where the inconsistency first occurred.

Spark provides Observation to allows inserting observers into dataframes to define and obtain observation metrics.

I want to insert CRC checksum metric observers at all stages of SQL so that inconsistent stages can be found quickly.

I made a simple implementation, like:

9a773848b2d6cbd9e4e66a39f533081

Task list

Are you willing to submit PR?

  • Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve.
  • No. I cannot submit a PR at this time.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions