Skip to content

variable length record optimization  #521

@sree018

Description

@sree018

Background [Optional]

we have 76 GB variable width variable length (BDW+RDW) with multi-segmented file(47 segments) , file contains 470,000,000 records with 700 columns. I am trying to convert parquet file. It's creating single index(1)( single partition). For parsing file and able to see data correctly with df.show()

Question

How to do I parallelize job across executors?

df.write taking single thread while write into parquet file.

options used

  1. inputsplit records
  2. input split size

Metadata

Metadata

Assignees

Labels

acceptedAccepted for implementationbugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions