Skip to content

IndexBuilder step running in single thread for ASCII variable length files #543

@saikumare-a

Description

@saikumare-a

Background [Optional]

Platform: azure Databricks with file on azure storage
Cluster: 1 driver ( 16 core) + 3 workers (each with 16 core)
Spark API: pyspark

Cobrix:
using cobrix: spark-cobol_2.12

Code:
df = spark.read.format("cobol")
.option("copybook", copybook_path)
.option("encoding", "ascii")
.option("record_format", "D2")
.load(src_file_path)

df.write.parquet("output_path")

Scenario:
we have 300+ GB variable width variable length with multi-segmented file containing more than 2000 columns , file contains 13+ M records and input file is a text file (CRLF / LF are used to split records) .

I am trying to write to parquet file.

  1. indexBuilder stage is running in single partition ( 1 core of single worker node) and taking more than 2 hours
    2)after index build, writing parquet file is completing in 30 mins using multiple partitions & all worker cores/threads.

able to see data correctly in parquet file

Question

How to do I parallelize job across executors for indexBuilder step also?

Metadata

Metadata

Assignees

No one assigned

    Labels

    acceptedAccepted for implementationbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions