Skip to content

Conversation

@Il-Pela
Copy link

@Il-Pela Il-Pela commented Oct 24, 2025

Context

Hi, I'm a Data Engineer working for a consulting company. One of our customers is from the Banking world, and in some
use cases we used your library to perform positional binary files parsing and offloading via Spark.
In the last period we also faced some projects that wanted to produce back some positional binary files after Spark
transformations, and with the recent release of the feature for adapt the Cobol source to be able also to write files we
thought we had it. However, we soon faced the fact that our requirements for file production went over the simple common
page encoding, so we thought to contribute to the library by providing the implementation for the cp of our interest
(cp1144) and if the process described in this PR will be validated, with a replicable process to implement also other
code pages encoders.

I want to inform you that I'm not much of an expert in programming in Scala, so please forgive me if I'm missing on some
conventions. Also, this is my first contribution to an Open Source Software, so please keep in mind also that.

Co-author of this PR: Andrea Fonti

Let's get to the sauce.

What we have done?

In the last few weeks we saw the contents of the last commits for the feature about the EBCDIC Writer.
We saw you were focusing only to implement the common code page.
We started from there, and we wanted to follow the same approach to implement an encoder for cp1144.
We firstly refactored EncoderSelector class to delegate the convert method.
We then added to the SingleByteCodePages class signature one parameter:
asciiToEbcdicMapping: Option[Int => Byte] = None

  • This is an optional function from Int to Byte.
    • The Int represents a Unicode code point (e.g. 65 for 'A').
    • The Byte is the corresponding EBCDIC byte (e.g. 0xC1).
  • Option[Int => Byte] means:
    • Some(f) → a function is provided to encode ASCII → EBCDIC.
    • None → no encoding function provided (perhaps only decoding supported).
  • The = None part gives it a default value, so subclasses can omit it if they only need one direction (decoding).

We focused mainly on SingleByteCodePages, but we also provided a default implementation of such methods for
DoubleByteCodePages to preserve compatibility (supportsEncoding = false).

We then used this parameter to implement the convert function, that converts an ASCII string into an EBCDIC byte
array of fixed length.

Step-by-step:

  1. Input validation:
    • field length must be non-negative.
    • asciiToEbcdicMapping must be defined, or it throws an error.
  2. Buffer initialization:
    • Pre-fills the byte array with 0x40 (the EBCDIC space character).
  3. Conversion Loop:
    • Iterates over the string’s characters.
    • Converts each Unicode code point to an EBCDIC byte using the provided mapping for the specific code page.
    • Stops when it reaches either the string end or the desired output length.
  4. Return value:
    • Returns the EBCDIC byte array.

For generating the mapping for code page 1144 we used this python script

from ebcdic import codecs
conversion = list(range(256))
   
for i in range(256):
    data = bytes([i])
    decoded = codecs.decode(data, "cp1144")
    
    try:
        encoded = codecs.encode(decoded, "UTF-32-le")
        conversion[i] = int.from_bytes(encoded, byteorder="little")
    except UnicodeEncodeError:
        print(f"Cannot convert char at index {i}")
        conversion[i] = 0x40
   
for ebcdic, utf in list(enumerate(conversion)):
    print(f"case {utf} => 0x{ebcdic:02x}.toByte")

What this script does:

  1. Creates a conversion array - Initializes a 256-element list to store UTF-32 codepoints corresponding to each EBCDIC byte value (0-255)

  2. Builds the lookup table - For each possible EBCDIC byte (0-255):

    • Decodes the byte from CP1144 (EBCDIC variant) to Unicode
    • Encodes the Unicode character to UTF-32-LE (little-endian)
    • Converts the UTF-32 bytes to an integer codepoint
    • Stores this codepoint at the index matching the original EBCDIC byte
  3. Handles errors - If a character cannot be encoded to UTF-32, it defaults to 0x40 (the space character in EBCDIC)

  4. Generates Scala code - Outputs a series of case statements mapping UTF-32 codepoints to their corresponding EBCDIC byte values, formatted as hexadecimal

Final Notes

  • Changes on previous common page implementation were limited to adapt it to the usage of codepoints, so the asciiToEbcdicMapping function has been refactored.
  • New tests added for in FixedLengthEbcdicWriterSuite for new functionality.
  • All tests pass.

What to do next?

We are open to further communication and collaboration and looking forward to read feedbacks from you.

Summary by CodeRabbit

  • New Features

    • Enhanced EBCDIC string encoding and decoding with support for multiple code pages.
    • Improved character conversion capabilities for bidirectional ASCII-EBCDIC transformations.
  • Tests

    • Added test validation for EBCDIC data writing using code page 1144.

@Il-Pela Il-Pela requested a review from yruslan as a code owner October 24, 2025 19:07
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 24, 2025

Walkthrough

The PR refactors EBCDIC encoding by migrating encoding logic from EncoderSelector to CodePage classes. The removed encodeEbcdicString method is replaced with new convert() and supportsEncoding() APIs in the CodePage hierarchy. SingleByteCodePage implementations now manage ASCII-to-EBCDIC conversion, enabling bidirectional encoding support.

Changes

Cohort / File(s) Summary
Encoding responsibility shift
EncoderSelector.scala
Removed public encodeEbcdicString() method; EBCDIC encoder selection now checks ebcdicCodePage.supportsEncoding() and delegates to ebcdicCodePage.convert()
CodePage base class extensions
CodePage.scala
Added public convert(string: String, length: Int): Array[Byte] method for ASCII-to-EBCDIC encoding and supportsEncoding: Boolean property
SingleByteCodePage implementation
SingleByteCodePage.scala
Updated constructor to accept optional asciiToEbcdicMapping: Option[Int => Byte]; implemented convert() encoder and supportsEncoding override
TwoByteCodePage encoding stub
TwoByteCodePage.scala
Added convert() method raising IllegalStateException and supportsEncoding returning false (encoding unsupported for multi-byte pages)
Bidirectional mappings
CodePage1144.scala, CodePageCommon.scala
Added asciiToEbcdicMapping: Int => Byte functions to companion objects; updated constructors to pass ASCII-to-EBCDIC mappings to parent class
Test coverage
FixedLengthEbcdicWriterSuite.scala
Added test case for writing EBCDIC data using code page 1144 with expected byte sequence validation

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant EncoderSelector
    participant CodePage
    
    rect rgb(220, 240, 255)
    Note over Client,CodePage: Before: Encoding handled in EncoderSelector
    Client->>EncoderSelector: selectEncoder(EBCDIC)
    EncoderSelector->>EncoderSelector: encodeEbcdicString()
    EncoderSelector-->>Client: Array[Byte]
    end
    
    rect rgb(240, 255, 240)
    Note over Client,CodePage: After: Encoding delegated to CodePage
    Client->>EncoderSelector: selectEncoder(EBCDIC)
    EncoderSelector->>CodePage: supportsEncoding?
    CodePage-->>EncoderSelector: true/false
    alt Supports Encoding
        EncoderSelector->>CodePage: convert(string, length)
        CodePage-->>EncoderSelector: Array[Byte]
    else No Encoding Support
        EncoderSelector-->>Client: no encoder
    end
    EncoderSelector-->>Client: Optional[encoder]
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

The changes involve moderate complexity: multiple files updated with consistent patterns (adding encoding methods across the CodePage hierarchy), significant API changes (method removal and introduction of new encoding contracts), and mixed edit types (constructor signature updates, new method implementations, mapping function additions). However, changes are well-structured and follow a clear migration pattern, reducing per-file cognitive load.

Possibly related PRs

Poem

🐰 Encoders now rest where codepages stand,
Bidirectional mappings, hand in hand,
ASCII to EBCDIC, a journey complete,
Encoding now owned by pages so neat!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Extended EBCDIC Encoder - added support for cp1144" accurately captures the primary objective of the changeset, as confirmed by the PR objectives which state the goal is to "add support for EBCDIC code page cp1144." The title is specific and clear, identifying cp1144 as the main deliverable. While the implementation involved significant refactoring of the encoder infrastructure (removing the encodeEbcdicString method, adding convert and supportsEncoding methods across multiple CodePage classes, and extending SingleByteCodePage), these changes are architectural enablers for cp1144 support rather than independent goals. The title appropriately highlights the user-facing outcome without claiming to cover all implementation details.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (1)

47-53: Scaladoc: correct grammar and return description.

Update to “an ASCII” and describe returned bytes, not a string.

-  /**
-   * An encoder from a ASCII basic string to an EBCDIC byte array
+  /**
+   * An encoder from an ASCII string to an EBCDIC byte array of fixed length.
    *
    * @param string          An input string
    * @param length          The length of the output (in bytes)
-   * @return A string representation of the binary data
+   * @return EBCDIC bytes, space‑padded (0x40) to the requested length
    */
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1)

82-124: Test addition for cp1144 looks good.

Covers key glyphs and fixed-length behavior; mirrors existing suite patterns.

Suggestion: factor the “read first part file into bytes” boilerplate into a small helper to reduce duplication across tests.

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage.scala (3)

21-22: Remove unused import.

import java.util isn’t used here.

-import java.util

37-46: Scaladoc: fix return description and article.

Describe returned bytes and use “an ASCII”.

-  /**
-   * An encoder from a ASCII basic string to an EBCDIC byte array
+  /**
+   * An encoder from an ASCII string to an EBCDIC byte array
@@
-   * @return A string representation of the binary data
+   * @return EBCDIC bytes of the given length (space‑padded 0x40)
   */

47-53: Scaladoc: tighten supportsEncoding wording.

Minor: clarify what “able to encode” means.

-  /**
-   * Returns whether the CodePage is able to encode strings to ebcdic
-   *
-   * @return
-   */
+  /**
+   * Indicates whether this CodePage supports ASCII→EBCDIC encoding.
+   */
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6d8a997 and 7133a5b.

📒 Files selected for processing (7)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (1 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage.scala (2 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage1144.scala (2 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePageCommon.scala (2 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (2 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala (1 hunks)
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (6)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala (2)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/writer/RawBinaryOutputFormat.scala (2)
  • write (74-78)
  • close (79-81)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/builder/SparkCobolBuilder.scala (1)
  • copybookContents (23-25)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala (3)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/Encoding.scala (1)
  • EBCDIC (23-23)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (3)
  • supportsEncoding (73-73)
  • convert (37-45)
  • convert (54-71)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala (3)
  • supportsEncoding (98-98)
  • convert (35-92)
  • convert (94-96)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePageCommon.scala (1)
  • asciiToEbcdicMapping (80-82)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala (3)
  • convert (35-92)
  • convert (94-96)
  • supportsEncoding (98-98)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage1144.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePageCommon.scala (1)
  • asciiToEbcdicMapping (80-82)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage.scala (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (3)
  • convert (37-45)
  • convert (54-71)
  • supportsEncoding (73-73)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala (3)
  • convert (35-92)
  • convert (94-96)
  • supportsEncoding (98-98)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (3)
  • convert (37-45)
  • convert (54-71)
  • supportsEncoding (73-73)
🔇 Additional comments (6)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala (1)

73-73: Encoding support flag looks good.

supportsEncoding correctly mirrors presence of the mapping.

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePageCommon.scala (1)

24-26: LGTM overall; clarify semantics of 0x00 entries.

Wiring Some(asciiToEbcdicMapping) is correct. Please confirm that 0x00 values in data are intended to produce NUL bytes (overriding 0x40 prefill) rather than “unmapped ⇒ space”. If the latter, we may need to treat 0x00 as “unmapped” except for code point 0.

Also applies to: 59-76, 80-82

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage1144.scala (4)

25-25: Constructor change enables bidirectional encoding support.

The addition of Some(CodePage1144.asciiToEbcdicMapping) as the second parameter correctly enables ASCII-to-EBCDIC encoding for this code page, aligning with the PR's goal of supporting cp1144 encoding.


99-99: Line ending handling is well-documented.

The comment explains the conscious decision to map Unicode LF (line feed, code point 10) to EBCDIC 0x25. This handles the complexity where EBCDIC has separate NL (New Line) and LF (Line Feed) control codes, while Unicode typically uses a single LF character. The choice to always map to LF is reasonable and maintains consistency.


318-318: Reasonable default for unmapped characters.

Defaulting to 0x40 (EBCDIC space) for unmapped Unicode code points is a sensible fallback strategy. This provides graceful handling of characters outside the cp1144 character set rather than throwing exceptions, which is consistent with typical encoding practices where not all Unicode characters have representations in legacy encodings.


62-319: No issues found. Mapping is comprehensive and bidirectionally consistent.

Verification confirms:

  • ✓ No duplicate case statements in asciiToEbcdicMapping
  • ✓ All common ASCII characters (0-9, A-Z, a-z, space, punctuation) are properly mapped
  • ✓ Both encode and decode mappings exist and are bidirectionally consistent:
    • ASCII 32 (space) ↔ EBCDIC 0x40 ✓
    • ASCII 65 (A) ↔ EBCDIC 0xc1 (193) ✓
    • ASCII 97 (a) ↔ EBCDIC 0x81 (129) ✓
    • ASCII 48 (0) ↔ EBCDIC 0xf0 (240) ✓
  • ✓ Different implementation strategies are intentional and appropriate:
    • asciiToEbcdicMapping: Int → Byte function (pattern matching with 255 cases + default fallback to space)
    • ebcdicToAsciiMapping: Array[Char] (indexed direct lookup, 256 complete entries)

The mapping table is complete, correct, and well-designed with reasonable defaults.

Comment on lines +54 to +71
def convert(string: String, length: Int): Array[Byte] = {
require(length >= 0, s"Field length cannot be negative, got $length")
require(asciiToEbcdicMapping.isDefined, s"Cannot encode strings for Code Page without ASCII to EBCDIC " +
s"mapping ${this.getClass.getSimpleName}")

var i = 0
val buf = new Array[Byte](length)

// PIC X fields are space-filled on mainframe. Use EBCDIC space 0x40.
util.Arrays.fill(buf, 0x40.toByte)

while (i < string.length && i < length) {
val unicodeCodePoint: Int = string.codePointAt(i)
buf(i) = asciiToEbcdicMapping.get(unicodeCodePoint)
i = i + 1
}
buf
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix code point iteration (surrogate pairs) and decouple input/output indices.

Using codePointAt(i) but incrementing i += 1 corrupts iteration for non-BMP chars and misaligns output. Keep separate indices and advance by Character.charCount(cp).

Apply:

-  def convert(string: String, length: Int): Array[Byte] = {
+  def convert(string: String, length: Int): Array[Byte] = {
     require(length >= 0, s"Field length cannot be negative, got $length")
     require(asciiToEbcdicMapping.isDefined, s"Cannot encode strings for Code Page without ASCII to EBCDIC " +
       s"mapping ${this.getClass.getSimpleName}")

-    var i = 0
-    val buf = new Array[Byte](length)
-
-    // PIC X fields are space-filled on mainframe. Use EBCDIC space 0x40.
-    util.Arrays.fill(buf, 0x40.toByte)
-
-    while (i < string.length && i < length) {
-      val unicodeCodePoint: Int = string.codePointAt(i)
-      buf(i) = asciiToEbcdicMapping.get(unicodeCodePoint)
-      i = i + 1
-    }
-    buf
+    val mapper = asciiToEbcdicMapping.get
+    val buf = new Array[Byte](length)
+    // PIC X fields are space-filled on mainframe. Use EBCDIC space 0x40.
+    util.Arrays.fill(buf, 0x40.toByte)
+    var in  = 0
+    var out = 0
+    while (in < string.length && out < length) {
+      val cp = string.codePointAt(in)
+      buf(out) = mapper(cp)
+      in  += java.lang.Character.charCount(cp)
+      out += 1
+    }
+    buf
   }
🤖 Prompt for AI Agents
In
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/SingleByteCodePage.scala
around lines 54 to 71, the loop uses string.codePointAt(i) but increments i by 1
which breaks surrogate-pair handling and couples input and output indices;
change to two indices (inputIndex and outIndex), read val cp =
string.codePointAt(inputIndex), write the mapped byte to buf(outIndex) (use
asciiToEbcdicMapping.getOrElse(cp, 0x40.toByte) or similar fallback), then
advance inputIndex by Character.charCount(cp) and increment outIndex by 1, and
loop while inputIndex < string.length && outIndex < length so non-BMP chars are
iterated correctly and output stays aligned.

Comment on lines +94 to 99
override def convert(string: String, length: Int): Array[Byte] = throw new IllegalStateException(s"Cannot encode " +
s"strings for Code Page without ASCII to EBCDIC mapping ${this.getClass.getSimpleName}, did you forget to call" +
s"supportsEncoding?")

override def supportsEncoding: Boolean = false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix error message spacing; consider UnsupportedOperationException.

There’s a missing space before “supportsEncoding”. Using UnsupportedOperationException better matches intent.

-  override def convert(string: String, length: Int): Array[Byte] = throw new IllegalStateException(s"Cannot encode " +
-    s"strings for Code Page without ASCII to EBCDIC mapping ${this.getClass.getSimpleName}, did you forget to call" +
-    s"supportsEncoding?")
+  override def convert(string: String, length: Int): Array[Byte] =
+    throw new UnsupportedOperationException(
+      s"Cannot encode strings for Code Page ${this.getClass.getSimpleName}: ASCII→EBCDIC mapping is not provided. " +
+      s"Did you forget to check supportsEncoding()?"
+    )
🤖 Prompt for AI Agents
In
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/TwoByteCodePage.scala
around lines 94 to 99, the thrown IllegalStateException message is missing a
space before "supportsEncoding" and the exception type should be
UnsupportedOperationException to better express the operation isn't supported;
update the exception to throw new UnsupportedOperationException and fix the
message spacing so it reads "...did you forget to call supportsEncoding?"
(ensure spaces around concatenated strings or use a single interpolated string).

Comment on lines +68 to 71
case EBCDIC if ebcdicCodePage.supportsEncoding =>
val encoder = (a: Any) => {
encodeEbcdicString(a.toString, CodePageCommon.asciiToEbcdicMapping, fieldLength)
ebcdicCodePage.convert(a.toString, fieldLength)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Handle nulls to avoid NPE and preserve null semantics.

Calling toString on null will throw at runtime. For PIC X nulls, previous behavior wrote zero bytes; preserve that.

-      case EBCDIC if ebcdicCodePage.supportsEncoding =>
-        val encoder = (a: Any) => {
-          ebcdicCodePage.convert(a.toString, fieldLength)
-        }
+      case EBCDIC if ebcdicCodePage.supportsEncoding =>
+        val encoder = (a: Any) => {
+          if (a == null) new Array[Byte](fieldLength) // nulls -> zero-filled
+          else ebcdicCodePage.convert(a.toString, fieldLength)
+        }
         Option(encoder)

Please also add a unit test covering a null alphanumeric field with an EBCDIC code page (e.g., cp1144) to lock this in.



---

🏁 Script executed:

```shell
#!/bin/bash
# Examine the current encoder implementations to understand null handling patterns
fd --type f "EncoderSelector" | head -5

Length of output: 89


🏁 Script executed:

#!/bin/bash
# Check the current state of the EBCDIC encoder block and surrounding context
fd --type f -e scala "EncoderSelector" -x cat {} \; | grep -A 20 -B 5 "case EBCDIC"

Length of output: 934


🏁 Script executed:

#!/bin/bash
# Search for other encoder case branches to see null handling patterns
fd --type f -e scala "EncoderSelector" -x cat {} \; | grep -A 10 "case.*=>" | head -60

Length of output: 4151


🏁 Script executed:

#!/bin/bash
# Search for null-related tests in the cobol writer/encoder test suites
fd --type f -e scala "Test" -o -e scala "Suite" | xargs grep -l "null.*encod\|encod.*null" 2>/dev/null | head -10

Length of output: 116


🏁 Script executed:

#!/bin/bash
# Look for existing null value handling in encoder tests
fd --type f -e scala -path "*/test/*" | xargs grep -l "EBCDIC\|Encoder" 2>/dev/null | head -5

Length of output: 168


🏁 Script executed:

#!/bin/bash
# Correct search for test files related to encoding/writer
find . -type f -name "*Test.scala" -o -name "*Suite.scala" | grep -i "encod\|writer" | head -10

Length of output: 392


🏁 Script executed:

#!/bin/bash
# Find getDisplayEncoder implementation to see if it handles nulls
fd -e scala "EncoderSelector" -x cat {} \; | grep -A 15 "def getDisplayEncoder"

Length of output: 885


🏁 Script executed:

#!/bin/bash
# Search for CodePage.convert method signature and implementation
fd -e scala "CodePage" | head -5

Length of output: 476


🏁 Script executed:

#!/bin/bash
# Look for null handling in existing string encoders or similar patterns
fd -e scala "EncoderSelector" -x cat {} \; | grep -B 5 -A 5 "case null"

Length of output: 1782


🏁 Script executed:

#!/bin/bash
# Search test directory for encoding tests with null
find . -path "*/test/*" -name "*.scala" | xargs grep -l "Encoder\|encod" 2>/dev/null | head -8

Length of output: 792


🏁 Script executed:

#!/bin/bash
# Check CodePage.convert method signature and what it does with null
cat cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/codepage/CodePage.scala | head -50

Length of output: 1657


🏁 Script executed:

#!/bin/bash
# Look at DisplayEncodersSuite to see null handling test patterns
cat ./cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/parser/encoding/DisplayEncodersSuite.scala | grep -B 3 -A 3 "null"

Length of output: 2214


🏁 Script executed:

#!/bin/bash
# Check FixedLengthEbcdicWriterSuite for null encoding behavior
cat ./spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/writer/FixedLengthEbcdicWriterSuite.scala | grep -B 5 -A 5 "null\|0x00\|zero"

Length of output: 4890


🏁 Script executed:

#!/bin/bash
# Look for how encoders handle null output - check if downstream expects Array[Byte] or null
fd -e scala "EncoderSelector" -x cat {} \; | tail -50

Length of output: 2233


🏁 Script executed:

#!/bin/bash
# Check if there are any string encoder tests or EBCDIC tests with null
find . -path "*/test/*" -name "*.scala" -exec grep -l "EBCDIC.*null\|null.*EBCDIC\|string.*encod.*null" {} \;

Length of output: 0


Handle nulls in EBCDIC string encoder to avoid NPE and maintain null semantics.

Calling toString on null throws NPE at runtime. Other encoders in this class (display, binary, BCD) explicitly handle null using case null => null, then convert downstream. For EBCDIC string fields (PIC X), tests confirm nulls encode to zero bytes.

      case EBCDIC if ebcdicCodePage.supportsEncoding =>
        val encoder = (a: Any) => {
+         if (a == null) new Array[Byte](fieldLength)
+         else ebcdicCodePage.convert(a.toString, fieldLength)
-         ebcdicCodePage.convert(a.toString, fieldLength)
        }
        Option(encoder)

Add a unit test covering null alphanumeric field with EBCDIC code page (e.g., cp1144).

🤖 Prompt for AI Agents
In
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/encoding/EncoderSelector.scala
around lines 68 to 71, the EBCDIC string encoder calls a.toString which throws
NPE for nulls; change the encoder to pattern-match null (case null => null) and
only call ebcdicCodePage.convert for non-null values so null semantics are
preserved like other encoders, and add a unit test that encodes a null
alphanumeric (PIC X) field using an EBCDIC code page (e.g., cp1144) and asserts
the resulting bytes are zero-filled for the field length.

@yruslan
Copy link
Collaborator

yruslan commented Oct 24, 2025

Hi @Il-Pela , this is a fantastic PR! Thank you so much for your contribution! Going to review and merge very soon.

@yruslan yruslan merged commit 8473064 into AbsaOSS:master Oct 25, 2025
5 of 6 checks passed
@Il-Pela
Copy link
Author

Il-Pela commented Oct 26, 2025

Hi @yruslan, may I ask you in which version, and when, do you think the implementation will be released?

Also, I'd like to ask you a couple of additional questions related to this PR.

  • What do you think about bringing the encoder implementation also to other code pages supported at the moment (SingleByte ones)? Do you think the approach is clean enough to be spreaded also to other code pages or does it require some refactoring?
  • Tests: for the moment I extended an already existing test Suite. Do you think it was the right choice or do you see other ways? I'm thinking that eventually, extending it for covering all supported encoders will result in a mess in the future.

Thank you

@yruslan
Copy link
Collaborator

yruslan commented Oct 27, 2025

Hi @Il-Pela ,
The new encoding is going to be available in 2.9.2, which can be released this week.

I was thinking about adding support for EBCDIC code page tables that are generated on-fly from decoder pages, using lazy computation of these tables in the way sililar to the Python script that you used. But I want to test what would perform better, a pre-built table based on match cases, or arrays generated on-fly. I'll let you know the outcome of the observation in a week or so.

@yruslan
Copy link
Collaborator

yruslan commented Oct 30, 2025

2.9.2 is released

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants