Skip to content

Fixed length Cobol EBCDIC file with header, body and trailer records - How to skip header and trailer #556

@ManojKolisetty-git

Description

@ManojKolisetty-git

Background [Optional]

We have fixed length ebcdic file with file structure starts with header, body records and trailer. We user cobrix library (za.co.absa.cobrix:spark-cobol_2.12:2.6.0) in pyspark to read the file and create dataframe.

currently to process header, body and trailer we are creating 3 different dataframes (each has their own copybook).
By doing this we endup in reading file multiple times, and when fetching body records we have to skip first and last record (header, trailer) using Record_id sequence (sorting the dataframe to get last and first id's) to filter out. This approach is taking long time as data needs to shuffled before filter.

Question

when trying to optimize I have seen below file read options but observed that file_end_offset is skipping last record from each partition instead of one record from file. For instance my file is getting processed in 100 partitions, 100 last records from each partition is getting removed.
.option("file_start_offset", 11000)
.option("file_end_offset", 11000)

Kindly suggest way/approach to efficiently skip header, trailer or just fetch header/trailer alone.

Many Thanks
Manoj

Metadata

Metadata

Assignees

No one assigned

    Labels

    acceptedAccepted for implementationbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions