Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark DataFrame Writer for Cobol datafiles #415

Open
mark-weghorst opened this issue Aug 23, 2021 · 10 comments
Open

Spark DataFrame Writer for Cobol datafiles #415

mark-weghorst opened this issue Aug 23, 2021 · 10 comments
Labels
enhancement New feature or request

Comments

@mark-weghorst
Copy link

Background

I work for a credit card company in the retail sector, and we are currently utilizing Cobrix to acquire data from our credit card transaction processor and produce business events to Kafka for our event driven architecture and analytic platform.

Thanks to @yruslan and his work with #338 Cobrix is now fully functional for our data ingest use case, however, our electronic data interchange with this business partner is bidirectional.

For example we receive mainframe data transmissions for things like customer purchases, and account status. But we also have to transmit monetary data to our mainframe based partner for things like credits and adjustments, and non-monetary data for account configuration changes including but not limited to change of address.

Additionally, we also believe that such a feature could also be used to simplify the process of creating test data for our system.

Feature

Implement a Spark DataFrame writer for Cobol data, the feature should:

  • Derive a default copybook layout from the Spark Schema
  • Support configurable endianness
  • Support configurable code page output
  • Support writing Cobol output data files in the F, FB, V, VB file types from https://www.ibm.com/docs/en/zos-basic-skills?topic=set-data-record-formats
  • Support the writing of a copybook file that matches the output schema as-written
  • Provide a declarative configuration option to override individual DataFrame Schema -> Copybook transformation decisions at a field level including:
  • specify width for PIC X(n) fields
  • specify scale and precision for PIC 9 fields such as S9(11)V99
  • specify binary packing options for individual fields such as COMP-3

Proposed Solution [Optional]

We could contribute development labor to the implementation of the feature, however we would need assistance with high level design should such a feature be accepted.

At this point I would like to open a discussion about how such a feature might be implemented, and as I mentioned we would be willing to contribute some development labor to help make this feature a reality, but we would need some assistance in the architecture of the solution.

@mark-weghorst mark-weghorst added the enhancement New feature or request label Aug 23, 2021
@yruslan
Copy link
Collaborator

yruslan commented Aug 25, 2021

This sounds great. The demand for the feature seems to exists already, but the feature requires a lot of effort. This could be a good collaboration. As soon as the implementation of VBVR is finished (probably end of the next week), I can prepare a design document for a Cobol fire writer. We can discuss features the writer can support and prioritize features required for your use case. The features that are useful but not immediately required for you we can implement later from our side.
I think the work can be divided as independent tasks and with your help the feature can be implemented much faster.

@mark-weghorst
Copy link
Author

mark-weghorst commented Aug 26, 2021

I had a meeting to discuss the first draft of these requirements and one of my peers suggested that while dynamically creating a copybook from a Spark schema and declarative configuration was a nice feature, that it might be complex to implement and isn't really necessary for MVP.

My colleague suggested that perhaps a better idea would be to require a copybook layout be passed into the data frame writer, since we would have to set static field sizes for every column in the data frame anyway.

Of course we would have to verify that the DF schema can be mapped to the Copybook schema, but that may be an easier lift than programmatically generating a copybook.

In our use case the copybook is defined by our business partner, and we would have to ensure that the DF we generate can map to the service contract (copybook) that they are expecting.

Also on the subject of narrowing the MVP features, our use case only requires a single code page (I believe it is cp037 but will verify with the business partner), and only big endian.

All of our data ingest code uses CodePageCommon which is working adequately so far.

@yruslan
Copy link
Collaborator

yruslan commented Aug 27, 2021

Good. We can start looking into requirements in about 2 weeks.

Actually, generating our own copybook from a Spark dataframe is easier since we can choose output data types. Conforming to an existing copybook would require supporting the plethora of formats that COBOL supports (picture, usage, etc). But conforming to an existing copybook is something that is usually required, so that's something that we should implement at some point anyway. And since it matches your use case we can look into that first.

Supporting only cp037 or basic + cp037 is good as well.

@yruslan
Copy link
Collaborator

yruslan commented Aug 27, 2021

What about data formats? Do you need the support of F, V, VB (RDW, no RDW, BDW+RDW), or we can just start with basic V (RDW)?

@mark-weghorst
Copy link
Author

I have a colleague researching this now, but the preliminary answer is that we need FB and VB formats. In a day or two I'll have a final answer and copybooks for you to review.

@milehighhokie
Copy link

Mark is leaving Nordstrom and I will be taking over as a contact for Nordstrom

@mark-weghorst
Copy link
Author

@yruslan as @milehighhokie indicated I have accepted a new position in another company and Bill will be taking over this issue for my former employer. We had a turnover meeting this morning, and I reminded him that you are still waiting on copybook examples for the outbound data transfer use case that I outlined in this issue.

I want to extend my thanks for the excellent support I have received while using Cobrix, and in particular I appreciate the opportunity to collaborate with you on adding the new record format readers.

@yruslan
Copy link
Collaborator

yruslan commented Dec 14, 2021

Thanks for the kind words, Mark! Enjoy the holiday season and the best of luck at the new role!

@milehighhokie , looking forward to future collaboration.

@joeyang001
Copy link

Hi yruslan, we have a similar requirement for copybook writer. You have closed this issue. Did you make any progress in Spark Dataframe writer for copybook data files?

@yruslan
Copy link
Collaborator

yruslan commented Oct 6, 2022

Hi, sorry, the writer would require a lot of effort and we don't have the capacity nor internal demand for it at the moment.
But it is in long term plans to o it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants