Extended EBCDIC Encoder - added support for cp1144 #790

Il-Pela · 2025-10-24T19:07:27Z

Context

Hi, I'm a Data Engineer working for a consulting company. One of our customers is from the Banking world, and in some
use cases we used your library to perform positional binary files parsing and offloading via Spark.
In the last period we also faced some projects that wanted to produce back some positional binary files after Spark
transformations, and with the recent release of the feature for adapt the Cobol source to be able also to write files we
thought we had it. However, we soon faced the fact that our requirements for file production went over the simple common
page encoding, so we thought to contribute to the library by providing the implementation for the cp of our interest
(cp1144) and if the process described in this PR will be validated, with a replicable process to implement also other
code pages encoders.

I want to inform you that I'm not much of an expert in programming in Scala, so please forgive me if I'm missing on some
conventions. Also, this is my first contribution to an Open Source Software, so please keep in mind also that.

Co-author of this PR: Andrea Fonti

Let's get to the sauce.

What we have done?

In the last few weeks we saw the contents of the last commits for the feature about the EBCDIC Writer.
We saw you were focusing only to implement the common code page.
We started from there, and we wanted to follow the same approach to implement an encoder for cp1144.
We firstly refactored EncoderSelector class to delegate the convert method.
We then added to the SingleByteCodePages class signature one parameter:
asciiToEbcdicMapping: Option[Int => Byte] = None

This is an optional function from Int to Byte.
- The Int represents a Unicode code point (e.g. 65 for 'A').
- The Byte is the corresponding EBCDIC byte (e.g. 0xC1).
Option[Int => Byte] means:
- Some(f) → a function is provided to encode ASCII → EBCDIC.
- None → no encoding function provided (perhaps only decoding supported).
The = None part gives it a default value, so subclasses can omit it if they only need one direction (decoding).

We focused mainly on SingleByteCodePages, but we also provided a default implementation of such methods for
DoubleByteCodePages to preserve compatibility (supportsEncoding = false).

We then used this parameter to implement the convert function, that converts an ASCII string into an EBCDIC byte
array of fixed length.

Step-by-step:

Input validation:
- field length must be non-negative.
- asciiToEbcdicMapping must be defined, or it throws an error.
Buffer initialization:
- Pre-fills the byte array with 0x40 (the EBCDIC space character).
Conversion Loop:
- Iterates over the string’s characters.
- Converts each Unicode code point to an EBCDIC byte using the provided mapping for the specific code page.
- Stops when it reaches either the string end or the desired output length.
Return value:
- Returns the EBCDIC byte array.

For generating the mapping for code page 1144 we used this python script

from ebcdic import codecs
conversion = list(range(256))
   
for i in range(256):
    data = bytes([i])
    decoded = codecs.decode(data, "cp1144")
    
    try:
        encoded = codecs.encode(decoded, "UTF-32-le")
        conversion[i] = int.from_bytes(encoded, byteorder="little")
    except UnicodeEncodeError:
        print(f"Cannot convert char at index {i}")
        conversion[i] = 0x40
   
for ebcdic, utf in list(enumerate(conversion)):
    print(f"case {utf} => 0x{ebcdic:02x}.toByte")

What this script does:

Creates a conversion array - Initializes a 256-element list to store UTF-32 codepoints corresponding to each EBCDIC byte value (0-255)
Builds the lookup table - For each possible EBCDIC byte (0-255):
- Decodes the byte from CP1144 (EBCDIC variant) to Unicode
- Encodes the Unicode character to UTF-32-LE (little-endian)
- Converts the UTF-32 bytes to an integer codepoint
- Stores this codepoint at the index matching the original EBCDIC byte
Handles errors - If a character cannot be encoded to UTF-32, it defaults to 0x40 (the space character in EBCDIC)
Generates Scala code - Outputs a series of case statements mapping UTF-32 codepoints to their corresponding EBCDIC byte values, formatted as hexadecimal

Final Notes

Changes on previous common page implementation were limited to adapt it to the usage of codepoints, so the asciiToEbcdicMapping function has been refactored.
New tests added for in FixedLengthEbcdicWriterSuite for new functionality.
All tests pass.

What to do next?

We are open to further communication and collaboration and looking forward to read feedbacks from you.

Summary by CodeRabbit

New Features
- Enhanced EBCDIC string encoding and decoding with support for multiple code pages.
- Improved character conversion capabilities for bidirectional ASCII-EBCDIC transformations.
Tests
- Added test validation for EBCDIC data writing using code page 1144.

…re/implemented-ebcdic-encoder

coderabbitai · 2025-10-24T19:07:40Z

Walkthrough

The PR refactors EBCDIC encoding by migrating encoding logic from EncoderSelector to CodePage classes. The removed encodeEbcdicString method is replaced with new convert() and supportsEncoding() APIs in the CodePage hierarchy. SingleByteCodePage implementations now manage ASCII-to-EBCDIC conversion, enabling bidirectional encoding support.

Changes

Cohort / File(s)	Summary
Encoding responsibility shift `EncoderSelector.scala`	Removed public `encodeEbcdicString()` method; EBCDIC encoder selection now checks `ebcdicCodePage.supportsEncoding()` and delegates to `ebcdicCodePage.convert()`
CodePage base class extensions `CodePage.scala`	Added public `convert(string: String, length: Int): Array[Byte]` method for ASCII-to-EBCDIC encoding and `supportsEncoding: Boolean` property
SingleByteCodePage implementation `SingleByteCodePage.scala`	Updated constructor to accept optional `asciiToEbcdicMapping: Option[Int => Byte]`; implemented `convert()` encoder and `supportsEncoding` override
TwoByteCodePage encoding stub `TwoByteCodePage.scala`	Added `convert()` method raising `IllegalStateException` and `supportsEncoding` returning false (encoding unsupported for multi-byte pages)
Bidirectional mappings `CodePage1144.scala`, `CodePageCommon.scala`	Added `asciiToEbcdicMapping: Int => Byte` functions to companion objects; updated constructors to pass ASCII-to-EBCDIC mappings to parent class
Test coverage `FixedLengthEbcdicWriterSuite.scala`	Added test case for writing EBCDIC data using code page 1144 with expected byte sequence validation

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant EncoderSelector
    participant CodePage
    
    rect rgb(220, 240, 255)
    Note over Client,CodePage: Before: Encoding handled in EncoderSelector
    Client->>EncoderSelector: selectEncoder(EBCDIC)
    EncoderSelector->>EncoderSelector: encodeEbcdicString()
    EncoderSelector-->>Client: Array[Byte]
    end
    
    rect rgb(240, 255, 240)
    Note over Client,CodePage: After: Encoding delegated to CodePage
    Client->>EncoderSelector: selectEncoder(EBCDIC)
    EncoderSelector->>CodePage: supportsEncoding?
    CodePage-->>EncoderSelector: true/false
    alt Supports Encoding
        EncoderSelector->>CodePage: convert(string, length)
        CodePage-->>EncoderSelector: Array[Byte]
    else No Encoding Support
        EncoderSelector-->>Client: no encoder
    end
    EncoderSelector-->>Client: Optional[encoder]
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

The changes involve moderate complexity: multiple files updated with consistent patterns (adding encoding methods across the CodePage hierarchy), significant API changes (method removal and introduction of new encoding contracts), and mixed edit types (constructor signature updates, new method implementations, mapping function additions). However, changes are well-structured and follow a clear migration pattern, reducing per-file cognitive load.

Possibly related PRs

#782 Add display type for the spark-cobol writer #783: Modifies EncoderSelector encoding logic and related codepage handling alongside this refactoring.
#777 Add support for writing numeric data types having binary format (comp, comp-4, comp-9) #779: Extends EncoderSelector with additional encoder types in the same file being refactored for encoding delegation.
#415 Implement the basic version of EBCDIC writer #775: Updates EBCDIC encoding path and writer tests touching overlapping encoding-related classes.

Poem

🐰 Encoders now rest where codepages stand,
Bidirectional mappings, hand in hand,
ASCII to EBCDIC, a journey complete,
Encoding now owned by pages so neat! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "Extended EBCDIC Encoder - added support for cp1144" accurately captures the primary objective of the changeset, as confirmed by the PR objectives which state the goal is to "add support for EBCDIC code page cp1144." The title is specific and clear, identifying cp1144 as the main deliverable. While the implementation involved significant refactoring of the encoder infrastructure (removing the encodeEbcdicString method, adding convert and supportsEncoding methods across multiple CodePage classes, and extending SingleByteCodePage), these changes are architectural enablers for cp1144 support rather than independent goals. The title appropriately highlights the user-facing outcome without claiming to cover all implementation details.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (5)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (1)
47-53: Scaladoc: correct grammar and return description.

Update to “an ASCII” and describe returned bytes, not a string.
-  /**
-   * An encoder from a ASCII basic string to an EBCDIC byte array
+  /**
+   * An encoder from an ASCII string to an EBCDIC byte array of fixed length.
    *
    * @param string          An input string
    * @param length          The length of the output (in bytes)
-   * @return A string representation of the binary data
+   * @return EBCDIC bytes, space‑padded (0x40) to the requested length
    */
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1)

82-124: Test addition for cp1144 looks good.

Covers key glyphs and fixed-length behavior; mirrors existing suite patterns.

Suggestion: factor the “read first part file into bytes” boilerplate into a small helper to reduce duplication across tests.
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage.scala (3)
21-22: Remove unused import.

import java.util isn’t used here.
-import java.util
37-46: Scaladoc: fix return description and article.

Describe returned bytes and use “an ASCII”.
-  /**
-   * An encoder from a ASCII basic string to an EBCDIC byte array
+  /**
+   * An encoder from an ASCII string to an EBCDIC byte array
@@
-   * @return A string representation of the binary data
+   * @return EBCDIC bytes of the given length (space‑padded 0x40)
   */
47-53: Scaladoc: tighten supportsEncoding wording.

Minor: clarify what “able to encode” means.
-  /**
-   * Returns whether the CodePage is able to encode strings to ebcdic
-   *
-   * @return
-   */
+  /**
+   * Indicates whether this CodePage supports ASCII→EBCDIC encoding.
+   */

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6d8a997 and 7133a5b.

📒 Files selected for processing (7)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (1 hunks)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage.scala (2 hunks)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage1144.scala (2 hunks)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePageCommon.scala (2 hunks)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (2 hunks)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala (1 hunks)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (6)

spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (2)

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (2)

write (74-78)

close (79-81)

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/builder/SparkCobolBuilder.scala (1)

copybookContents (23-25)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (3)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/Encoding.scala (1)

EBCDIC (23-23)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (3)

supportsEncoding (73-73)

convert (37-45)

convert (54-71)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala (3)

supportsEncoding (98-98)

convert (35-92)

convert (94-96)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (2)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePageCommon.scala (1)

asciiToEbcdicMapping (80-82)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala (3)

convert (35-92)

convert (94-96)

supportsEncoding (98-98)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage1144.scala (1)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePageCommon.scala (1)

asciiToEbcdicMapping (80-82)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage.scala (2)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (3)

convert (37-45)

convert (54-71)

supportsEncoding (73-73)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala (3)

convert (35-92)

convert (94-96)

supportsEncoding (98-98)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala (1)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (3)

convert (37-45)

convert (54-71)

supportsEncoding (73-73)

🔇 Additional comments (6)

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (1)

73-73: Encoding support flag looks good.

supportsEncoding correctly mirrors presence of the mapping.

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePageCommon.scala (1)

24-26: LGTM overall; clarify semantics of 0x00 entries.

Wiring Some(asciiToEbcdicMapping) is correct. Please confirm that 0x00 values in data are intended to produce NUL bytes (overriding 0x40 prefill) rather than “unmapped ⇒ space”. If the latter, we may need to treat 0x00 as “unmapped” except for code point 0.

Also applies to: 59-76, 80-82

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage1144.scala (4)

25-25: Constructor change enables bidirectional encoding support.

The addition of Some(CodePage1144.asciiToEbcdicMapping) as the second parameter correctly enables ASCII-to-EBCDIC encoding for this code page, aligning with the PR's goal of supporting cp1144 encoding.

99-99: Line ending handling is well-documented.

The comment explains the conscious decision to map Unicode LF (line feed, code point 10) to EBCDIC 0x25. This handles the complexity where EBCDIC has separate NL (New Line) and LF (Line Feed) control codes, while Unicode typically uses a single LF character. The choice to always map to LF is reasonable and maintains consistency.

318-318: Reasonable default for unmapped characters.

Defaulting to 0x40 (EBCDIC space) for unmapped Unicode code points is a sensible fallback strategy. This provides graceful handling of characters outside the cp1144 character set rather than throwing exceptions, which is consistent with typical encoding practices where not all Unicode characters have representations in legacy encodings.

62-319: No issues found. Mapping is comprehensive and bidirectionally consistent.

Verification confirms:

✓ No duplicate case statements in asciiToEbcdicMapping

✓ All common ASCII characters (0-9, A-Z, a-z, space, punctuation) are properly mapped

✓ Both encode and decode mappings exist and are bidirectionally consistent:

ASCII 32 (space) ↔ EBCDIC 0x40 ✓

ASCII 65 (A) ↔ EBCDIC 0xc1 (193) ✓

ASCII 97 (a) ↔ EBCDIC 0x81 (129) ✓

ASCII 48 (0) ↔ EBCDIC 0xf0 (240) ✓

✓ Different implementation strategies are intentional and appropriate:

asciiToEbcdicMapping: Int → Byte function (pattern matching with 255 cases + default fallback to space)

ebcdicToAsciiMapping: Array[Char] (indexed direct lookup, 256 complete entries)

The mapping table is complete, correct, and well-designed with reasonable defaults.

coderabbitai · 2025-10-24T19:18:17Z

...ser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala

+  def convert(string: String, length: Int): Array[Byte] = {
+    require(length >= 0, s"Field length cannot be negative, got $length")
+    require(asciiToEbcdicMapping.isDefined, s"Cannot encode strings for Code Page without ASCII to EBCDIC " +
+      s"mapping ${this.getClass.getSimpleName}")
+
+    var i = 0
+    val buf = new Array[Byte](length)
+
+    // PIC X fields are space-filled on mainframe. Use EBCDIC space 0x40.
+    util.Arrays.fill(buf, 0x40.toByte)
+
+    while (i < string.length && i < length) {
+      val unicodeCodePoint: Int = string.codePointAt(i)
+      buf(i) = asciiToEbcdicMapping.get(unicodeCodePoint)
+      i = i + 1
+    }
+    buf
+  }


⚠️ Potential issue | 🔴 Critical

Fix code point iteration (surrogate pairs) and decouple input/output indices.

Using codePointAt(i) but incrementing i += 1 corrupts iteration for non-BMP chars and misaligns output. Keep separate indices and advance by Character.charCount(cp).

Apply:

- def convert(string: String, length: Int): Array[Byte] = { + def convert(string: String, length: Int): Array[Byte] = { require(length >= 0, s"Field length cannot be negative, got $length") require(asciiToEbcdicMapping.isDefined, s"Cannot encode strings for Code Page without ASCII to EBCDIC " + s"mapping ${this.getClass.getSimpleName}") - var i = 0 - val buf = new Array[Byte](length) - - // PIC X fields are space-filled on mainframe. Use EBCDIC space 0x40. - util.Arrays.fill(buf, 0x40.toByte) - - while (i < string.length && i < length) { - val unicodeCodePoint: Int = string.codePointAt(i) - buf(i) = asciiToEbcdicMapping.get(unicodeCodePoint) - i = i + 1 - } - buf + val mapper = asciiToEbcdicMapping.get + val buf = new Array[Byte](length) + // PIC X fields are space-filled on mainframe. Use EBCDIC space 0x40. + util.Arrays.fill(buf, 0x40.toByte) + var in = 0 + var out = 0 + while (in < string.length && out < length) { + val cp = string.codePointAt(in) + buf(out) = mapper(cp) + in += java.lang.Character.charCount(cp) + out += 1 + } + buf }

🤖 Prompt for AI Agents

In cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala around lines 54 to 71, the loop uses string.codePointAt(i) but increments i by 1 which breaks surrogate-pair handling and couples input and output indices; change to two indices (inputIndex and outIndex), read val cp = string.codePointAt(inputIndex), write the mapped byte to buf(outIndex) (use asciiToEbcdicMapping.getOrElse(cp, 0x40.toByte) or similar fallback), then advance inputIndex by Character.charCount(cp) and increment outIndex by 1, and loop while inputIndex < string.length && outIndex < length so non-BMP chars are iterated correctly and output stays aligned.

coderabbitai · 2025-10-24T19:18:18Z

...parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala

+  override def convert(string: String, length: Int): Array[Byte] = throw new IllegalStateException(s"Cannot encode " +
+    s"strings for Code Page without ASCII to EBCDIC mapping ${this.getClass.getSimpleName}, did you forget to call" +
+    s"supportsEncoding?")
+
+  override def supportsEncoding: Boolean = false
 }


⚠️ Potential issue | 🟡 Minor

Fix error message spacing; consider UnsupportedOperationException.

There’s a missing space before “supportsEncoding”. Using UnsupportedOperationException better matches intent.

- override def convert(string: String, length: Int): Array[Byte] = throw new IllegalStateException(s"Cannot encode " + - s"strings for Code Page without ASCII to EBCDIC mapping ${this.getClass.getSimpleName}, did you forget to call" + - s"supportsEncoding?") + override def convert(string: String, length: Int): Array[Byte] = + throw new UnsupportedOperationException( + s"Cannot encode strings for Code Page ${this.getClass.getSimpleName}: ASCII→EBCDIC mapping is not provided. " + + s"Did you forget to check supportsEncoding()?" + )

🤖 Prompt for AI Agents

In cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala around lines 94 to 99, the thrown IllegalStateException message is missing a space before "supportsEncoding" and the exception type should be UnsupportedOperationException to better express the operation isn't supported; update the exception to throw new UnsupportedOperationException and fix the message spacing so it reads "...did you forget to call supportsEncoding?" (ensure spaces around concatenated strings or use a single interpolated string).

coderabbitai · 2025-10-24T19:18:18Z

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala

+      case EBCDIC if ebcdicCodePage.supportsEncoding =>
        val encoder = (a: Any) => {
-          encodeEbcdicString(a.toString, CodePageCommon.asciiToEbcdicMapping, fieldLength)
+          ebcdicCodePage.convert(a.toString, fieldLength)
        }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Handle nulls to avoid NPE and preserve null semantics.

Calling toString on null will throw at runtime. For PIC X nulls, previous behavior wrote zero bytes; preserve that.

- case EBCDIC if ebcdicCodePage.supportsEncoding => - val encoder = (a: Any) => { - ebcdicCodePage.convert(a.toString, fieldLength) - } + case EBCDIC if ebcdicCodePage.supportsEncoding => + val encoder = (a: Any) => { + if (a == null) new Array[Byte](fieldLength) // nulls -> zero-filled + else ebcdicCodePage.convert(a.toString, fieldLength) + } Option(encoder)

Please also add a unit test covering a null alphanumeric field with an EBCDIC code page (e.g., cp1144) to lock this in.

--- 🏁 Script executed: ```shell #!/bin/bash # Examine the current encoder implementations to understand null handling patterns fd --type f "EncoderSelector" | head -5

Length of output: 89

🏁 Script executed:

#!/bin/bash # Check the current state of the EBCDIC encoder block and surrounding context fd --type f -e scala "EncoderSelector" -x cat {} \; | grep -A 20 -B 5 "case EBCDIC"

Length of output: 934

🏁 Script executed:

#!/bin/bash # Search for other encoder case branches to see null handling patterns fd --type f -e scala "EncoderSelector" -x cat {} \; | grep -A 10 "case.*=>" | head -60

Length of output: 4151

🏁 Script executed:

#!/bin/bash # Search for null-related tests in the cobol writer/encoder test suites fd --type f -e scala "Test" -o -e scala "Suite" | xargs grep -l "null.*encod\|encod.*null" 2>/dev/null | head -10

Length of output: 116

🏁 Script executed:

#!/bin/bash # Look for existing null value handling in encoder tests fd --type f -e scala -path "*/test/*" | xargs grep -l "EBCDIC\|Encoder" 2>/dev/null | head -5

Length of output: 168

🏁 Script executed:

#!/bin/bash # Correct search for test files related to encoding/writer find . -type f -name "*Test.scala" -o -name "*Suite.scala" | grep -i "encod\|writer" | head -10

Length of output: 392

🏁 Script executed:

#!/bin/bash # Find getDisplayEncoder implementation to see if it handles nulls fd -e scala "EncoderSelector" -x cat {} \; | grep -A 15 "def getDisplayEncoder"

Length of output: 885

🏁 Script executed:

#!/bin/bash # Search for CodePage.convert method signature and implementation fd -e scala "CodePage" | head -5

Length of output: 476

🏁 Script executed:

#!/bin/bash # Look for null handling in existing string encoders or similar patterns fd -e scala "EncoderSelector" -x cat {} \; | grep -B 5 -A 5 "case null"

Length of output: 1782

🏁 Script executed:

#!/bin/bash # Search test directory for encoding tests with null find . -path "*/test/*" -name "*.scala" | xargs grep -l "Encoder\|encod" 2>/dev/null | head -8

Length of output: 792

🏁 Script executed:

#!/bin/bash # Check CodePage.convert method signature and what it does with null cat cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage.scala | head -50

Length of output: 1657

🏁 Script executed:

#!/bin/bash # Look at DisplayEncodersSuite to see null handling test patterns cat ./cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/encoding/DisplayEncodersSuite.scala | grep -B 3 -A 3 "null"

Length of output: 2214

🏁 Script executed:

#!/bin/bash # Check FixedLengthEbcdicWriterSuite for null encoding behavior cat ./spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala | grep -B 5 -A 5 "null\|0x00\|zero"

Length of output: 4890

🏁 Script executed:

#!/bin/bash # Look for how encoders handle null output - check if downstream expects Array[Byte] or null fd -e scala "EncoderSelector" -x cat {} \; | tail -50

Length of output: 2233

🏁 Script executed:

#!/bin/bash # Check if there are any string encoder tests or EBCDIC tests with null find . -path "*/test/*" -name "*.scala" -exec grep -l "EBCDIC.*null\|null.*EBCDIC\|string.*encod.*null" {} \;

Length of output: 0

Handle nulls in EBCDIC string encoder to avoid NPE and maintain null semantics.

Calling toString on null throws NPE at runtime. Other encoders in this class (display, binary, BCD) explicitly handle null using case null => null, then convert downstream. For EBCDIC string fields (PIC X), tests confirm nulls encode to zero bytes.

case EBCDIC if ebcdicCodePage.supportsEncoding => val encoder = (a: Any) => { + if (a == null) new Array[Byte](fieldLength) + else ebcdicCodePage.convert(a.toString, fieldLength) - ebcdicCodePage.convert(a.toString, fieldLength) } Option(encoder)

Add a unit test covering null alphanumeric field with EBCDIC code page (e.g., cp1144).

🤖 Prompt for AI Agents

In cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala around lines 68 to 71, the EBCDIC string encoder calls a.toString which throws NPE for nulls; change the encoder to pattern-match null (case null => null) and only call ebcdicCodePage.convert for non-null values so null semantics are preserved like other encoders, and add a unit test that encodes a null alphanumeric (PIC X) field using an EBCDIC code page (e.g., cp1144) and asserts the resulting bytes are zero-filled for the field length.

yruslan · 2025-10-24T19:45:58Z

Hi @Il-Pela , this is a fantastic PR! Thank you so much for your contribution! Going to review and merge very soon.

Il-Pela · 2025-10-26T11:54:08Z

Hi @yruslan, may I ask you in which version, and when, do you think the implementation will be released?

Also, I'd like to ask you a couple of additional questions related to this PR.

What do you think about bringing the encoder implementation also to other code pages supported at the moment (SingleByte ones)? Do you think the approach is clean enough to be spreaded also to other code pages or does it require some refactoring?
Tests: for the moment I extended an already existing test Suite. Do you think it was the right choice or do you see other ways? I'm thinking that eventually, extending it for covering all supported encoders will result in a mess in the future.

Thank you

yruslan · 2025-10-27T13:05:04Z

Hi @Il-Pela ,
The new encoding is going to be available in 2.9.2, which can be released this week.

I was thinking about adding support for EBCDIC code page tables that are generated on-fly from decoder pages, using lazy computation of these tables in the way sililar to the Python script that you used. But I want to test what would perform better, a pre-built table based on match cases, or arrays generated on-fly. I'll let you know the outcome of the observation in a week or so.

yruslan · 2025-10-30T07:17:57Z

2.9.2 is released

Francesco Pelacani added 5 commits October 13, 2025 18:41

implementation of ebcdic-cp1144 encoder

f3dd932

Merge branch 'master' of https://github.com/AbsaOSS/cobrix into featu…

96798c2

…re/implemented-ebcdic-encoder

added test for cp1144

608596a

Removed python code from comment in code. Moved to PR text

5e404d9

Restored slf4j version

7133a5b

Il-Pela requested a review from yruslan as a code owner October 24, 2025 19:07

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

yruslan approved these changes Oct 25, 2025

View reviewed changes

yruslan merged commit 8473064 into AbsaOSS:master Oct 25, 2025
5 of 6 checks passed

coderabbitai bot mentioned this pull request Nov 6, 2025

#792 Add EBCDIC to ASCII encoders for all single byte code pages #794

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extended EBCDIC Encoder - added support for cp1144 #790

Extended EBCDIC Encoder - added support for cp1144 #790

Uh oh!

Il-Pela commented Oct 24, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 24, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 24, 2025

Uh oh!

coderabbitai bot Oct 24, 2025

Uh oh!

coderabbitai bot Oct 24, 2025

Uh oh!

yruslan commented Oct 24, 2025

Uh oh!

Uh oh!

Il-Pela commented Oct 26, 2025

Uh oh!

yruslan commented Oct 27, 2025

Uh oh!

yruslan commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Extended EBCDIC Encoder - added support for cp1144 #790

Extended EBCDIC Encoder - added support for cp1144 #790

Uh oh!

Conversation

Il-Pela commented Oct 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What we have done?

Final Notes

What to do next?

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

yruslan commented Oct 24, 2025

Uh oh!

Uh oh!

Il-Pela commented Oct 26, 2025

Uh oh!

yruslan commented Oct 27, 2025

Uh oh!

yruslan commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Il-Pela commented Oct 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 24, 2025 •

edited

Loading