-
Notifications
You must be signed in to change notification settings - Fork 82
Description
Background [Optional]
We have fixed length ebcdic file with file structure starts with header, body records and trailer. We user cobrix library (za.co.absa.cobrix:spark-cobol_2.12:2.6.0) in pyspark to read the file and create dataframe.
currently to process header, body and trailer we are creating 3 different dataframes (each has their own copybook).
By doing this we endup in reading file multiple times, and when fetching body records we have to skip first and last record (header, trailer) using Record_id sequence (sorting the dataframe to get last and first id's) to filter out. This approach is taking long time as data needs to shuffled before filter.
Question
when trying to optimize I have seen below file read options but observed that file_end_offset is skipping last record from each partition instead of one record from file. For instance my file is getting processed in 100 partitions, 100 last records from each partition is getting removed.
.option("file_start_offset", 11000)
.option("file_end_offset", 11000)
Kindly suggest way/approach to efficiently skip header, trailer or just fetch header/trailer alone.
Many Thanks
Manoj