Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an ability to provide a custom record extractor #338

Closed
yruslan opened this issue Nov 24, 2020 · 2 comments
Closed

Add an ability to provide a custom record extractor #338

yruslan opened this issue Nov 24, 2020 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@yruslan
Copy link
Collaborator

yruslan commented Nov 24, 2020

Background

Custom record header parsers have several limitations:

  • They can only be used to parse records that have record headers at the beginning of the record
  • They don't account for record footers
  • Thay cannot be used to parse files having record separators instead of headers (e.g. text files where records are separated by LF)
  • They cannot be used to parse files for which record length depends on the record number (See What is the current state of BDW+RDW? #336 for files having RDW+BDW)

A new way of handling variable-length records has been introduced to parse ASCII variable line length files and binary files with variable length OCCURS that don't have RDWs.

The interface of a raw record extractor is very simple. It is an iterator of record bytes in a file.

trait RawRecordExtractor extends Iterator[Array[Byte]]

The array of bytes for each record is going to be parsed according to the copybook. The record extractor has the freedom to parse records however it wants.

Feature

Raw record extractors can be adapted to make record extractor customizable, which fixed the issues 1-4.

For exampe a custom raw record parser can look like this,

class CustomRawRecordExtractor(startingRecordIndex: Int, data: SimpleStream, copybook: Copybook) extends RawRecordExtractor

The record index is needed since after a sparse index is generated the input file will be read in parallel staring from different offsets.

@yruslan yruslan added the enhancement New feature or request label Nov 24, 2020
@yruslan yruslan self-assigned this Nov 24, 2020
yruslan added a commit that referenced this issue Nov 25, 2020
* Unify the record parser interface and constructor signature.
yruslan added a commit that referenced this issue Dec 3, 2020
* Unify the record parser interface and constructor signature.
@mark-weghorst
Copy link

Thanks @yruslan for looking into this for me, it looks like you are fairly far along on this already, but I wanted to point you to a resource that I think encapsulates the use cases that I believe Cobrix should cover:

https://www.ibm.com/support/knowledgecenter/zosbasics/com.ibm.zos.zconcepts/zconcepts_159.htm

  • Fixed
  • Fixed Blocked
  • Variable
  • Variable Blocked

For my own work I am unconcerned with the undefined format, but those 4 would cover all of my use cases.

Currently the fixed and variable options are working correctly, so i think it's just a matter of adding support for the blocked versions. If someone has usage of the undefined format, they could create a custom extractor for their use case.

@yruslan
Copy link
Collaborator Author

yruslan commented Dec 4, 2020

Thanks, @mark-weghorst , this is very helpful. Looks like with record extractors all 4 types are perfectly doable. It would still be nice to have an example for blocked file types so we are sure we are not missing anything. RDWs can be big-endian or little-endian, for instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants