-
Notifications
You must be signed in to change notification settings - Fork 82
Closed
Labels
acceptedAccepted for implementationAccepted for implementationbugSomething isn't workingSomething isn't working
Description
Background [Optional]
Platform: azure Databricks with file on azure storage
Cluster: 1 driver ( 16 core) + 3 workers (each with 16 core)
Spark API: pyspark
Cobrix:
using cobrix: spark-cobol_2.12
Code:
df = spark.read.format("cobol")
.option("copybook", copybook_path)
.option("encoding", "ascii")
.option("record_format", "D2")
.load(src_file_path)
df.write.parquet("output_path")
Scenario:
we have 300+ GB variable width variable length with multi-segmented file containing more than 2000 columns , file contains 13+ M records and input file is a text file (CRLF / LF are used to split records) .
I am trying to write to parquet file.
- indexBuilder stage is running in single partition ( 1 core of single worker node) and taking more than 2 hours
2)after index build, writing parquet file is completing in 30 mins using multiple partitions & all worker cores/threads.
able to see data correctly in parquet file
Question
How to do I parallelize job across executors for indexBuilder step also?
Metadata
Metadata
Assignees
Labels
acceptedAccepted for implementationAccepted for implementationbugSomething isn't workingSomething isn't working