-
Notifications
You must be signed in to change notification settings - Fork 86
Extended EBCDIC Encoder - added support for cp1144 #790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extended EBCDIC Encoder - added support for cp1144 #790
Conversation
WalkthroughThe PR refactors EBCDIC encoding by migrating encoding logic from EncoderSelector to CodePage classes. The removed Changes
Sequence Diagram(s)sequenceDiagram
actor Client
participant EncoderSelector
participant CodePage
rect rgb(220, 240, 255)
Note over Client,CodePage: Before: Encoding handled in EncoderSelector
Client->>EncoderSelector: selectEncoder(EBCDIC)
EncoderSelector->>EncoderSelector: encodeEbcdicString()
EncoderSelector-->>Client: Array[Byte]
end
rect rgb(240, 255, 240)
Note over Client,CodePage: After: Encoding delegated to CodePage
Client->>EncoderSelector: selectEncoder(EBCDIC)
EncoderSelector->>CodePage: supportsEncoding?
CodePage-->>EncoderSelector: true/false
alt Supports Encoding
EncoderSelector->>CodePage: convert(string, length)
CodePage-->>EncoderSelector: Array[Byte]
else No Encoding Support
EncoderSelector-->>Client: no encoder
end
EncoderSelector-->>Client: Optional[encoder]
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes The changes involve moderate complexity: multiple files updated with consistent patterns (adding encoding methods across the CodePage hierarchy), significant API changes (method removal and introduction of new encoding contracts), and mixed edit types (constructor signature updates, new method implementations, mapping function additions). However, changes are well-structured and follow a clear migration pattern, reducing per-file cognitive load. Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (5)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (1)
47-53: Scaladoc: correct grammar and return description.Update to “an ASCII” and describe returned bytes, not a string.
- /** - * An encoder from a ASCII basic string to an EBCDIC byte array + /** + * An encoder from an ASCII string to an EBCDIC byte array of fixed length. * * @param string An input string * @param length The length of the output (in bytes) - * @return A string representation of the binary data + * @return EBCDIC bytes, space‑padded (0x40) to the requested length */spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1)
82-124: Test addition for cp1144 looks good.Covers key glyphs and fixed-length behavior; mirrors existing suite patterns.
Suggestion: factor the “read first part file into bytes” boilerplate into a small helper to reduce duplication across tests.
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage.scala (3)
21-22: Remove unused import.
import java.utilisn’t used here.-import java.util
37-46: Scaladoc: fix return description and article.Describe returned bytes and use “an ASCII”.
- /** - * An encoder from a ASCII basic string to an EBCDIC byte array + /** + * An encoder from an ASCII string to an EBCDIC byte array @@ - * @return A string representation of the binary data + * @return EBCDIC bytes of the given length (space‑padded 0x40) */
47-53: Scaladoc: tightensupportsEncodingwording.Minor: clarify what “able to encode” means.
- /** - * Returns whether the CodePage is able to encode strings to ebcdic - * - * @return - */ + /** + * Indicates whether this CodePage supports ASCII→EBCDIC encoding. + */
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala(1 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage.scala(2 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage1144.scala(2 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePageCommon.scala(2 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala(2 hunks)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala(1 hunks)spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (2)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (2)
write(74-78)close(79-81)spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/builder/SparkCobolBuilder.scala (1)
copybookContents(23-25)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (3)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/Encoding.scala (1)
EBCDIC(23-23)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (3)
supportsEncoding(73-73)convert(37-45)convert(54-71)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala (3)
supportsEncoding(98-98)convert(35-92)convert(94-96)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePageCommon.scala (1)
asciiToEbcdicMapping(80-82)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala (3)
convert(35-92)convert(94-96)supportsEncoding(98-98)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage1144.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePageCommon.scala (1)
asciiToEbcdicMapping(80-82)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage.scala (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (3)
convert(37-45)convert(54-71)supportsEncoding(73-73)cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala (3)
convert(35-92)convert(94-96)supportsEncoding(98-98)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (3)
convert(37-45)convert(54-71)supportsEncoding(73-73)
🔇 Additional comments (6)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (1)
73-73: Encoding support flag looks good.
supportsEncodingcorrectly mirrors presence of the mapping.cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePageCommon.scala (1)
24-26: LGTM overall; clarify semantics of 0x00 entries.Wiring
Some(asciiToEbcdicMapping)is correct. Please confirm that0x00values indataare intended to produce NUL bytes (overriding 0x40 prefill) rather than “unmapped ⇒ space”. If the latter, we may need to treat0x00as “unmapped” except for code point 0.Also applies to: 59-76, 80-82
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage1144.scala (4)
25-25: Constructor change enables bidirectional encoding support.The addition of
Some(CodePage1144.asciiToEbcdicMapping)as the second parameter correctly enables ASCII-to-EBCDIC encoding for this code page, aligning with the PR's goal of supporting cp1144 encoding.
99-99: Line ending handling is well-documented.The comment explains the conscious decision to map Unicode LF (line feed, code point 10) to EBCDIC 0x25. This handles the complexity where EBCDIC has separate NL (New Line) and LF (Line Feed) control codes, while Unicode typically uses a single LF character. The choice to always map to LF is reasonable and maintains consistency.
318-318: Reasonable default for unmapped characters.Defaulting to 0x40 (EBCDIC space) for unmapped Unicode code points is a sensible fallback strategy. This provides graceful handling of characters outside the cp1144 character set rather than throwing exceptions, which is consistent with typical encoding practices where not all Unicode characters have representations in legacy encodings.
62-319: No issues found. Mapping is comprehensive and bidirectionally consistent.Verification confirms:
- ✓ No duplicate case statements in
asciiToEbcdicMapping- ✓ All common ASCII characters (0-9, A-Z, a-z, space, punctuation) are properly mapped
- ✓ Both encode and decode mappings exist and are bidirectionally consistent:
- ASCII 32 (space) ↔ EBCDIC 0x40 ✓
- ASCII 65 (A) ↔ EBCDIC 0xc1 (193) ✓
- ASCII 97 (a) ↔ EBCDIC 0x81 (129) ✓
- ASCII 48 (0) ↔ EBCDIC 0xf0 (240) ✓
- ✓ Different implementation strategies are intentional and appropriate:
asciiToEbcdicMapping: Int → Byte function (pattern matching with 255 cases + default fallback to space)ebcdicToAsciiMapping: Array[Char] (indexed direct lookup, 256 complete entries)The mapping table is complete, correct, and well-designed with reasonable defaults.
| def convert(string: String, length: Int): Array[Byte] = { | ||
| require(length >= 0, s"Field length cannot be negative, got $length") | ||
| require(asciiToEbcdicMapping.isDefined, s"Cannot encode strings for Code Page without ASCII to EBCDIC " + | ||
| s"mapping ${this.getClass.getSimpleName}") | ||
|
|
||
| var i = 0 | ||
| val buf = new Array[Byte](length) | ||
|
|
||
| // PIC X fields are space-filled on mainframe. Use EBCDIC space 0x40. | ||
| util.Arrays.fill(buf, 0x40.toByte) | ||
|
|
||
| while (i < string.length && i < length) { | ||
| val unicodeCodePoint: Int = string.codePointAt(i) | ||
| buf(i) = asciiToEbcdicMapping.get(unicodeCodePoint) | ||
| i = i + 1 | ||
| } | ||
| buf | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix code point iteration (surrogate pairs) and decouple input/output indices.
Using codePointAt(i) but incrementing i += 1 corrupts iteration for non-BMP chars and misaligns output. Keep separate indices and advance by Character.charCount(cp).
Apply:
- def convert(string: String, length: Int): Array[Byte] = {
+ def convert(string: String, length: Int): Array[Byte] = {
require(length >= 0, s"Field length cannot be negative, got $length")
require(asciiToEbcdicMapping.isDefined, s"Cannot encode strings for Code Page without ASCII to EBCDIC " +
s"mapping ${this.getClass.getSimpleName}")
- var i = 0
- val buf = new Array[Byte](length)
-
- // PIC X fields are space-filled on mainframe. Use EBCDIC space 0x40.
- util.Arrays.fill(buf, 0x40.toByte)
-
- while (i < string.length && i < length) {
- val unicodeCodePoint: Int = string.codePointAt(i)
- buf(i) = asciiToEbcdicMapping.get(unicodeCodePoint)
- i = i + 1
- }
- buf
+ val mapper = asciiToEbcdicMapping.get
+ val buf = new Array[Byte](length)
+ // PIC X fields are space-filled on mainframe. Use EBCDIC space 0x40.
+ util.Arrays.fill(buf, 0x40.toByte)
+ var in = 0
+ var out = 0
+ while (in < string.length && out < length) {
+ val cp = string.codePointAt(in)
+ buf(out) = mapper(cp)
+ in += java.lang.Character.charCount(cp)
+ out += 1
+ }
+ buf
}🤖 Prompt for AI Agents
In
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala
around lines 54 to 71, the loop uses string.codePointAt(i) but increments i by 1
which breaks surrogate-pair handling and couples input and output indices;
change to two indices (inputIndex and outIndex), read val cp =
string.codePointAt(inputIndex), write the mapped byte to buf(outIndex) (use
asciiToEbcdicMapping.getOrElse(cp, 0x40.toByte) or similar fallback), then
advance inputIndex by Character.charCount(cp) and increment outIndex by 1, and
loop while inputIndex < string.length && outIndex < length so non-BMP chars are
iterated correctly and output stays aligned.
| override def convert(string: String, length: Int): Array[Byte] = throw new IllegalStateException(s"Cannot encode " + | ||
| s"strings for Code Page without ASCII to EBCDIC mapping ${this.getClass.getSimpleName}, did you forget to call" + | ||
| s"supportsEncoding?") | ||
|
|
||
| override def supportsEncoding: Boolean = false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix error message spacing; consider UnsupportedOperationException.
There’s a missing space before “supportsEncoding”. Using UnsupportedOperationException better matches intent.
- override def convert(string: String, length: Int): Array[Byte] = throw new IllegalStateException(s"Cannot encode " +
- s"strings for Code Page without ASCII to EBCDIC mapping ${this.getClass.getSimpleName}, did you forget to call" +
- s"supportsEncoding?")
+ override def convert(string: String, length: Int): Array[Byte] =
+ throw new UnsupportedOperationException(
+ s"Cannot encode strings for Code Page ${this.getClass.getSimpleName}: ASCII→EBCDIC mapping is not provided. " +
+ s"Did you forget to check supportsEncoding()?"
+ )🤖 Prompt for AI Agents
In
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala
around lines 94 to 99, the thrown IllegalStateException message is missing a
space before "supportsEncoding" and the exception type should be
UnsupportedOperationException to better express the operation isn't supported;
update the exception to throw new UnsupportedOperationException and fix the
message spacing so it reads "...did you forget to call supportsEncoding?"
(ensure spaces around concatenated strings or use a single interpolated string).
| case EBCDIC if ebcdicCodePage.supportsEncoding => | ||
| val encoder = (a: Any) => { | ||
| encodeEbcdicString(a.toString, CodePageCommon.asciiToEbcdicMapping, fieldLength) | ||
| ebcdicCodePage.convert(a.toString, fieldLength) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Handle nulls to avoid NPE and preserve null semantics.
Calling toString on null will throw at runtime. For PIC X nulls, previous behavior wrote zero bytes; preserve that.
- case EBCDIC if ebcdicCodePage.supportsEncoding =>
- val encoder = (a: Any) => {
- ebcdicCodePage.convert(a.toString, fieldLength)
- }
+ case EBCDIC if ebcdicCodePage.supportsEncoding =>
+ val encoder = (a: Any) => {
+ if (a == null) new Array[Byte](fieldLength) // nulls -> zero-filled
+ else ebcdicCodePage.convert(a.toString, fieldLength)
+ }
Option(encoder)Please also add a unit test covering a null alphanumeric field with an EBCDIC code page (e.g., cp1144) to lock this in.
---
🏁 Script executed:
```shell
#!/bin/bash
# Examine the current encoder implementations to understand null handling patterns
fd --type f "EncoderSelector" | head -5
Length of output: 89
🏁 Script executed:
#!/bin/bash
# Check the current state of the EBCDIC encoder block and surrounding context
fd --type f -e scala "EncoderSelector" -x cat {} \; | grep -A 20 -B 5 "case EBCDIC"Length of output: 934
🏁 Script executed:
#!/bin/bash
# Search for other encoder case branches to see null handling patterns
fd --type f -e scala "EncoderSelector" -x cat {} \; | grep -A 10 "case.*=>" | head -60Length of output: 4151
🏁 Script executed:
#!/bin/bash
# Search for null-related tests in the cobol writer/encoder test suites
fd --type f -e scala "Test" -o -e scala "Suite" | xargs grep -l "null.*encod\|encod.*null" 2>/dev/null | head -10Length of output: 116
🏁 Script executed:
#!/bin/bash
# Look for existing null value handling in encoder tests
fd --type f -e scala -path "*/test/*" | xargs grep -l "EBCDIC\|Encoder" 2>/dev/null | head -5Length of output: 168
🏁 Script executed:
#!/bin/bash
# Correct search for test files related to encoding/writer
find . -type f -name "*Test.scala" -o -name "*Suite.scala" | grep -i "encod\|writer" | head -10Length of output: 392
🏁 Script executed:
#!/bin/bash
# Find getDisplayEncoder implementation to see if it handles nulls
fd -e scala "EncoderSelector" -x cat {} \; | grep -A 15 "def getDisplayEncoder"Length of output: 885
🏁 Script executed:
#!/bin/bash
# Search for CodePage.convert method signature and implementation
fd -e scala "CodePage" | head -5Length of output: 476
🏁 Script executed:
#!/bin/bash
# Look for null handling in existing string encoders or similar patterns
fd -e scala "EncoderSelector" -x cat {} \; | grep -B 5 -A 5 "case null"Length of output: 1782
🏁 Script executed:
#!/bin/bash
# Search test directory for encoding tests with null
find . -path "*/test/*" -name "*.scala" | xargs grep -l "Encoder\|encod" 2>/dev/null | head -8Length of output: 792
🏁 Script executed:
#!/bin/bash
# Check CodePage.convert method signature and what it does with null
cat cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage.scala | head -50Length of output: 1657
🏁 Script executed:
#!/bin/bash
# Look at DisplayEncodersSuite to see null handling test patterns
cat ./cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/encoding/DisplayEncodersSuite.scala | grep -B 3 -A 3 "null"Length of output: 2214
🏁 Script executed:
#!/bin/bash
# Check FixedLengthEbcdicWriterSuite for null encoding behavior
cat ./spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala | grep -B 5 -A 5 "null\|0x00\|zero"Length of output: 4890
🏁 Script executed:
#!/bin/bash
# Look for how encoders handle null output - check if downstream expects Array[Byte] or null
fd -e scala "EncoderSelector" -x cat {} \; | tail -50Length of output: 2233
🏁 Script executed:
#!/bin/bash
# Check if there are any string encoder tests or EBCDIC tests with null
find . -path "*/test/*" -name "*.scala" -exec grep -l "EBCDIC.*null\|null.*EBCDIC\|string.*encod.*null" {} \;Length of output: 0
Handle nulls in EBCDIC string encoder to avoid NPE and maintain null semantics.
Calling toString on null throws NPE at runtime. Other encoders in this class (display, binary, BCD) explicitly handle null using case null => null, then convert downstream. For EBCDIC string fields (PIC X), tests confirm nulls encode to zero bytes.
case EBCDIC if ebcdicCodePage.supportsEncoding =>
val encoder = (a: Any) => {
+ if (a == null) new Array[Byte](fieldLength)
+ else ebcdicCodePage.convert(a.toString, fieldLength)
- ebcdicCodePage.convert(a.toString, fieldLength)
}
Option(encoder)Add a unit test covering null alphanumeric field with EBCDIC code page (e.g., cp1144).
🤖 Prompt for AI Agents
In
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala
around lines 68 to 71, the EBCDIC string encoder calls a.toString which throws
NPE for nulls; change the encoder to pattern-match null (case null => null) and
only call ebcdicCodePage.convert for non-null values so null semantics are
preserved like other encoders, and add a unit test that encodes a null
alphanumeric (PIC X) field using an EBCDIC code page (e.g., cp1144) and asserts
the resulting bytes are zero-filled for the field length.
|
Hi @Il-Pela , this is a fantastic PR! Thank you so much for your contribution! Going to review and merge very soon. |
|
Hi @yruslan, may I ask you in which version, and when, do you think the implementation will be released? Also, I'd like to ask you a couple of additional questions related to this PR.
Thank you |
|
Hi @Il-Pela , I was thinking about adding support for EBCDIC code page tables that are generated on-fly from decoder pages, using lazy computation of these tables in the way sililar to the Python script that you used. But I want to test what would perform better, a pre-built table based on match cases, or arrays generated on-fly. I'll let you know the outcome of the observation in a week or so. |
|
|
Context
Hi, I'm a Data Engineer working for a consulting company. One of our customers is from the Banking world, and in some
use cases we used your library to perform positional binary files parsing and offloading via Spark.
In the last period we also faced some projects that wanted to produce back some positional binary files after Spark
transformations, and with the recent release of the feature for adapt the Cobol source to be able also to write files we
thought we had it. However, we soon faced the fact that our requirements for file production went over the simple common
page encoding, so we thought to contribute to the library by providing the implementation for the cp of our interest
(cp1144) and if the process described in this PR will be validated, with a replicable process to implement also other
code pages encoders.
I want to inform you that I'm not much of an expert in programming in Scala, so please forgive me if I'm missing on some
conventions. Also, this is my first contribution to an Open Source Software, so please keep in mind also that.
Co-author of this PR: Andrea Fonti
Let's get to the sauce.
What we have done?
In the last few weeks we saw the contents of the last commits for the feature about the EBCDIC Writer.
We saw you were focusing only to implement the common code page.
We started from there, and we wanted to follow the same approach to implement an encoder for cp1144.
We firstly refactored
EncoderSelectorclass to delegate the convert method.We then added to the SingleByteCodePages class signature one parameter:
asciiToEbcdicMapping: Option[Int => Byte] = NoneWe focused mainly on SingleByteCodePages, but we also provided a default implementation of such methods for
DoubleByteCodePages to preserve compatibility (
supportsEncoding = false).We then used this parameter to implement the
convertfunction, that converts an ASCII string into an EBCDIC bytearray of fixed length.
Step-by-step:
For generating the mapping for code page 1144 we used this python script
What this script does:
Creates a conversion array - Initializes a 256-element list to store UTF-32 codepoints corresponding to each EBCDIC byte value (0-255)
Builds the lookup table - For each possible EBCDIC byte (0-255):
Handles errors - If a character cannot be encoded to UTF-32, it defaults to 0x40 (the space character in EBCDIC)
Generates Scala code - Outputs a series of case statements mapping UTF-32 codepoints to their corresponding EBCDIC byte values, formatted as hexadecimal
Final Notes
asciiToEbcdicMappingfunction has been refactored.FixedLengthEbcdicWriterSuitefor new functionality.What to do next?
We are open to further communication and collaboration and looking forward to read feedbacks from you.
Summary by CodeRabbit
New Features
Tests