Skip to content

[Bug] When spark writes data to the paimon table, data is lost due to some task retries #4831

Closed
@xyk0930

Description

@xyk0930

Search before asking

  • I searched in the issues and found nothing similar.

Paimon version

0.9

Compute Engine

spark3.5.1

Minimal reproduce step

  1. Save data to the paimon table
    dataset.write().mode(mode).format("paimon").save(path);
  2. Perform to the stage (collect at PaimonSparkWriter. Scala: 195), Some nodes are lost. Try again
    image
    image
    image
  3. The amount of data written by the two retries is different from that of the final query
    image
    image
    9314203 + 6211188 = 15525391
    But the amount of data queried from the paimon table is 15476552
    image

What doesn't meet your expectations?

When I increased execu's memory, the task did not retry and ended up writing 15,525,244 pieces of data. I guess the possible reason is that the task retry will overwrite the file written the first time, or some other possibility

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions