Skip to content

Conversation

@yruslan
Copy link
Collaborator

@yruslan yruslan commented Nov 13, 2025

Closes #795

Summary by CodeRabbit

  • New Features

    • Added configurable EBCDIC processing strategies (InPlace and ToVariableLength) for flexible record handling.
    • Introduced experimental EBCDIC Spark raw record RDD generator to directly create RDDs from COBOL copybooks.
    • New public methods for copybook parsing and RDD conversion.
  • Documentation

    • Updated README with processing strategy documentation and code examples demonstrating configuration options.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 13, 2025

Walkthrough

This PR introduces a pluggable EBCDIC processing strategy system for COBOL record transformation. It defines a new CobolProcessingStrategy abstraction with InPlace and ToVariableLength implementations, refactors CobolProcessor to select implementations based on strategy, replaces the monolithic CobolProcessorImpl with strategy-specific subclasses, and extends both cobol-parser and Spark integration layers to support optional RawRecordProcessor and strategy-driven routing with optional index-aware processing.

Changes

Cohort / File(s) Summary
Documentation
README.md
Updated with new EBCDIC processing strategy examples and added experimental EBCDIC Spark RDD generator feature documentation.
Strategy Abstraction
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessingStrategy.scala
New file defining CobolProcessingStrategy trait with two case object implementations: InPlace and ToVariableLength.
Processor Orchestration
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessor.scala
Updated builder and loader to support strategy selection; added withProcessingStrategy() method; replaced fixed CobolProcessorImpl instantiation with strategy-based branching.
Base Processor
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorBase.scala
New abstract base class extending CobolProcessor with shared getRecordExtractor helper method for record extraction logic.
Strategy Implementations
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorInPlace.scala, CobolProcessorToRdw.scala
Refactored CobolProcessorImpl renamed to CobolProcessorInPlace extending CobolProcessorBase; introduced new CobolProcessorToRdw for variable-length RDW record processing.
Stream Processing
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/StreamProcessor.scala
Renamed processStream to processStreamInPlace; added new processStreamToRdw method for RDW-wrapped output generation.
Parameter Validation
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala
Added optional validateRedundantOptions parameter (default true) to parse and validateSparkCobolOptions methods for conditional validation of redundant configuration keys.
Spark Processor Integration
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessor.scala
Extended builder with withProcessingStrategy() support; made RawRecordProcessor optional at load time; added getParsedCopybook() and toRDD() public methods; refactored internal logic with index-aware processing paths via getCobolParameters and getRecordRdd helpers.
Source Integration
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala
Extracted file listing logic to public companion object method getListFilesWithOrder(); removed private instance method version.
Data Source Helpers
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala
Exposed previously private reader construction helpers (buildEitherReader, createTextReader, createFixedLengthReader, createVariableLengthReader) as public companion object methods; added logging support.
Index Visibility
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala
Changed visibility scope from private[source] to private[cobol].
Processor Base Tests
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorBaseSuite.scala
Renamed from CobolProcessorImplSuite; updated to use static CobolProcessorBase.getRecordExtractor() instead of builder pattern.
In-Place Strategy Tests
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorInPlaceSuite.scala
New test validating in-place processing with InPlace strategy and per-record byte transformation.
RDW Strategy Tests
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorToRdwSuite.scala
New test validating variable-length RDW processing with ToVariableLength strategy and record transformation.
Spark Integration Tests
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala
Added import of CobolProcessingStrategy; introduced new test cases for VRL+DRW conversion, RDD generation with and without indexes.

Sequence Diagram(s)

sequenceDiagram
    participant App as Application
    participant Builder as CobolProcessor.Builder
    participant Strategy as CobolProcessingStrategy
    participant Loader as CobolProcessorLoader
    participant Processor as CobolProcessor<br/>(InPlace|ToRdw)
    participant StreamProc as StreamProcessor

    App->>Builder: withProcessingStrategy(strategy)
    Builder->>Builder: store strategy
    App->>Builder: load(files)
    Builder->>Loader: create with strategy
    App->>Loader: save(output)
    
    Loader->>Strategy: match on strategy
    alt InPlace
        Loader->>Processor: new CobolProcessorInPlace()
    else ToVariableLength
        Loader->>Processor: new CobolProcessorToRdw()
    end
    
    Processor->>StreamProc: select processing method
    alt InPlace
        StreamProc->>StreamProc: processStreamInPlace()
    else ToVariableLength
        StreamProc->>StreamProc: processStreamToRdw()
    end
    
    StreamProc-->>Processor: record count
    Processor-->>Loader: completed
Loading
sequenceDiagram
    participant App as Application
    participant SparkBuilder as SparkCobolProcessor.Builder
    participant SparkLoader as SparkCobolProcessorLoader
    participant CoblProcBase as CobolProcessorBase
    participant IndexBuilder as IndexBuilder
    participant RDD as RDD[Array[Byte]]

    App->>SparkBuilder: withProcessingStrategy(strategy)
    App->>SparkBuilder: load(files)
    SparkBuilder->>SparkLoader: create with strategy & optional processor
    
    App->>SparkLoader: toRDD()
    
    alt Enable Indexes
        SparkLoader->>IndexBuilder: build index from files
        IndexBuilder-->>SparkLoader: index metadata
        SparkLoader->>CoblProcBase: getRecordExtractor(indexed ranges)
    else Disable Indexes
        SparkLoader->>SparkLoader: getRecordRdd(whole files)
    end
    
    SparkLoader->>RDD: create from record extractors
    RDD-->>App: RDD[Array[Byte]]
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas requiring extra attention:

  • Strategy pattern implementation consistency: Verify all instantiation paths correctly branch on CobolProcessingStrategy (CobolProcessor builder/loader, SparkCobolProcessor builder/loader, StreamProcessor method selection).
  • Stream processing methods: Ensure processStreamInPlace() and processStreamToRdw() both correctly handle record extraction, transformation, and output serialization without data loss.
  • Optional RawRecordProcessor lifecycle: Confirm that making RawRecordProcessor optional in SparkCobolProcessor does not introduce null-pointer risks or unexpected behavior downstream.
  • Index-aware processing paths: Validate that the new index-based and non-indexed paths in SparkCobolProcessor (via getRecordRdd helpers) handle all copybook types and edge cases correctly.
  • CobolProcessorBase.getRecordExtractor(): Ensure the shared helper correctly constructs VarLenNestedReader and handles missing extractors with appropriate error messaging.
  • Visibility scope changes: Confirm that elevating IndexBuilder visibility from private[source] to private[cobol] and exposing reader factories in DefaultSource do not breach intended encapsulation or introduce unintended coupling.

Possibly related PRs

Poem

🐰 Two paths now bloom where once was one,
InPlace and RDW, strategies spun.
Records transform with flexible grace,
Strategy pattern fills every space.
Variable lengths, no more constraints—
The refactor ascends without faints! 🎉

Pre-merge checks and finishing touches

✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding in-place EBCDIC processing with VRL format conversion capabilities.
Linked Issues check ✅ Passed The PR implements the core requirement from #795: introducing in-place EBCDIC processor with VRL format conversion via CobolProcessingStrategy abstraction with InPlace and ToVariableLength strategies.
Out of Scope Changes check ✅ Passed Additional changes include refactoring parameter validation, exposing internal helpers, and improving logging, but these are supporting changes that enable the main feature without introducing unrelated functionality.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/795-in-place-conversion-to-vrl

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

JaCoCo code coverage report - 'cobol-parser'

Overall Project 91.52% -0.16% 🍏
Files changed 77.4% 🍏

File Coverage
CobolProcessorInPlace.scala 100% 🍏
CobolProcessorToRdw.scala 100% 🍏
StreamProcessor.scala 89.81% 🍏
CobolProcessorBase.scala 82.14% -20.24% 🍏
CobolProcessingStrategy.scala 73.68% 🍏
CobolProcessor.scala 68.71% -5.9% 🍏

@github-actions
Copy link

JaCoCo code coverage report - 'spark-cobol'

File Coverage [94.15%] 🍏
IndexBuilder.scala 97.06% 🍏
SparkCobolProcessor.scala 96.68% 🍏
DefaultSource.scala 93.03% 🍏
CobolRelation.scala 82.8% 🍏
Total Project Coverage 81.1% 🍏

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessor.scala (1)

83-89: Remove duplicate validation check.

Lines 83-85 and 87-89 both check if rawRecordProcessorOpt.isEmpty. This is redundant.

Apply this diff:

     if (rawRecordProcessorOpt.isEmpty) {
       throw new IllegalArgumentException("A RawRecordProcessor must be provided.")
     }

-    if (rawRecordProcessorOpt.isEmpty) {
-      throw new IllegalArgumentException("A RawRecordProcessor must be provided.")
-    }
-
🧹 Nitpick comments (3)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorBase.scala (1)

33-38: Consider stream copying efficiency.

When headerStreamOpt is None, this method creates two copies of inputStream (lines 34 and 37). Additionally, calling code in CobolProcessorInPlace.scala (line 49) creates yet another copy. This could be inefficient for large streams.

Consider whether all these copies are necessary, or if the logic could be optimized to reduce redundant stream duplication.

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorInPlace.scala (1)

47-60: Potential redundant stream copy.

Line 49 creates dataStream as a copy of inputStream, but line 47's call to CobolProcessorBase.getRecordExtractor already creates copies of the input stream internally (lines 34 and 37 of CobolProcessorBase.scala).

Consider whether the dataStream copy at line 49 is necessary, or if you can pass inputStream directly to processStreamInPlace to avoid redundant copying.

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessor.scala (1)

198-203: Consider documenting ignoreRedundantOptions parameter.

Line 199 adds a PARAM_GENERATE_RECORD_ID option and line 201 uses !ignoreRedundantOptions for validation. The logic is correct, but it might be helpful to add a comment explaining when redundant options should be ignored vs validated.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87d5205 and b4f5707.

📒 Files selected for processing (16)
  • README.md (5 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessingStrategy.scala (1 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessor.scala (6 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorBase.scala (1 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorInPlace.scala (2 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorToRdw.scala (1 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/StreamProcessor.scala (2 hunks)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (4 hunks)
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorBaseSuite.scala (3 hunks)
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorInPlaceSuite.scala (1 hunks)
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorToRdwSuite.scala (1 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessor.scala (6 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala (2 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (6 hunks)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (1 hunks)
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (13)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorBase.scala (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala (1)
  • ReaderParameters (80-135)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala (1)
  • recordExtractor (54-89)
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorToRdwSuite.scala (4)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessor.scala (8)
  • processor (136-138)
  • processor (140-144)
  • processor (146-146)
  • CobolProcessor (48-206)
  • builder (203-205)
  • withProcessingStrategy (107-110)
  • withCopybookContents (97-100)
  • build (55-67)
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/mock/ByteStreamMock.scala (1)
  • ByteStreamMock (21-55)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorContext.scala (1)
  • CobolProcessorContext (21-23)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorToRdw.scala (1)
  • process (42-52)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala (2)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (1)
  • sqlContext (121-121)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/utils/FileUtils.scala (3)
  • FileUtils (35-235)
  • getFiles (51-53)
  • getFiles (60-84)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/decoders/BinaryUtils.scala (1)
  • cobol (165-209)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/internal/Logging.scala (1)
  • logger (33-38)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessor.scala (4)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorInPlace.scala (1)
  • CobolProcessorInPlace (40-63)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorToRdw.scala (1)
  • CobolProcessorToRdw (38-53)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessingStrategy.scala (3)
  • CobolProcessingStrategy (21-24)
  • InPlace (22-22)
  • ToVariableLength (23-23)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessor.scala (3)
  • withProcessingStrategy (99-102)
  • options (131-134)
  • save (144-166)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorInPlace.scala (2)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorBase.scala (3)
  • CobolProcessorBase (30-30)
  • CobolProcessorBase (32-50)
  • getRecordExtractor (33-49)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/StreamProcessor.scala (2)
  • StreamProcessor (26-108)
  • processStreamInPlace (39-70)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorToRdw.scala (5)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala (1)
  • ReaderParameters (80-135)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorInPlace.scala (1)
  • process (44-60)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/VarLenNestedReader.scala (1)
  • recordExtractor (54-89)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorBase.scala (3)
  • CobolProcessorBase (30-30)
  • CobolProcessorBase (32-50)
  • getRecordExtractor (33-49)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/StreamProcessor.scala (2)
  • StreamProcessor (26-108)
  • processStreamToRdw (83-106)
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorInPlaceSuite.scala (3)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessor.scala (7)
  • processor (136-138)
  • processor (140-144)
  • processor (146-146)
  • CobolProcessor (48-206)
  • withProcessingStrategy (107-110)
  • withCopybookContents (97-100)
  • build (55-67)
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/mock/ByteStreamMock.scala (1)
  • ByteStreamMock (21-55)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorInPlace.scala (1)
  • process (44-60)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala (4)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessor.scala (10)
  • processor (136-138)
  • processor (140-144)
  • processor (146-146)
  • save (156-200)
  • withRecordProcessor (102-105)
  • builder (203-205)
  • withCopybookContents (97-100)
  • withProcessingStrategy (107-110)
  • load (69-95)
  • option (119-123)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessingStrategy.scala (2)
  • CobolProcessingStrategy (21-24)
  • ToVariableLength (23-23)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessor.scala (11)
  • save (144-166)
  • withRecordProcessor (94-97)
  • SparkCobolProcessor (55-322)
  • builder (180-182)
  • withCopybookContents (89-92)
  • withProcessingStrategy (99-102)
  • load (66-71)
  • load (73-87)
  • option (119-123)
  • getParsedCopybook (168-172)
  • toRDD (174-177)
spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/source/fixtures/BinaryFileFixture.scala (3)
  • withTempDirectory (71-78)
  • writeBinaryFile (116-118)
  • readBinaryFile (120-122)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/StreamProcessor.scala (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorContext.scala (1)
  • CobolProcessorContext (21-23)
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorBaseSuite.scala (3)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorBase.scala (3)
  • CobolProcessorBase (30-30)
  • CobolProcessorBase (32-50)
  • getRecordExtractor (33-49)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/ReaderParameters.scala (1)
  • ReaderParameters (80-135)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/parser/recordformats/RecordFormat.scala (2)
  • RecordFormat (21-41)
  • VariableLength (24-24)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessor.scala (8)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (12)
  • cobol (47-331)
  • cobol (68-107)
  • cobol (113-133)
  • cobol (138-153)
  • cobol (155-180)
  • cobol (182-204)
  • cobol (206-270)
  • cobol (272-292)
  • cobol (297-302)
  • cobol (307-319)
  • cobol (321-323)
  • buildIndex (48-62)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorBase.scala (3)
  • CobolProcessorBase (30-30)
  • CobolProcessorBase (32-50)
  • getRecordExtractor (33-49)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/index/entry/SparseIndexEntry.scala (1)
  • SparseIndexEntry (19-19)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (3)
  • CobolParametersParser (39-1008)
  • getReaderProperties (366-453)
  • parse (225-311)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParameters.scala (1)
  • CobolParameters (68-111)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala (7)
  • schema (78-80)
  • source (105-112)
  • CobolRelation (65-113)
  • CobolRelation (115-132)
  • SerializableConfiguration (38-56)
  • getListFilesWithOrder (122-131)
  • isRecursiveRetrieval (100-103)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (6)
  • schema (101-101)
  • schema (122-122)
  • DefaultSource (38-129)
  • DefaultSource (131-189)
  • createVariableLengthReader (180-188)
  • sqlContext (121-121)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/schema/CobolSchema.scala (1)
  • fromReaderParameters (75-150)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Spark 2.4.8 on Scala 2.11.12
  • GitHub Check: test (2.12.20, 2.12, 3.3.4, 0, 80, 20)
  • GitHub Check: Spark 3.5.7 on Scala 2.12.20
  • GitHub Check: Spark 3.4.4 on Scala 2.12.20
  • GitHub Check: Spark 3.5.7 on Scala 2.13.17
🔇 Additional comments (22)
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/index/IndexBuilder.scala (1)

47-47: LGTM! Visibility narrowing aligns with package encapsulation.

The change from private[source] to private[cobol] appropriately restricts the visibility of IndexBuilder to the cobol package scope, supporting better encapsulation as index-based processing evolves.

cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorInPlaceSuite.scala (1)

1-60: LGTM! Well-structured test for InPlace processing strategy.

The test effectively validates the InPlace processing path with:

  • Clear arrangement of test data (4 input bytes)
  • Explicit strategy selection via CobolProcessingStrategy.InPlace
  • Simple transformation logic (decrement each byte)
  • Comprehensive assertions on count, length, and output values

The test aligns well with the new strategy-based processing architecture.

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessingStrategy.scala (1)

19-24: LGTM! Clean strategy pattern implementation.

The abstraction is well-designed:

  • Minimal trait definition provides clear type hierarchy
  • Two concrete strategies (InPlace and ToVariableLength) clearly express intent
  • No unnecessary complexity or premature abstraction

This provides a solid foundation for strategy-driven processing paths throughout the codebase.

README.md (2)

1665-1666: LGTM! Documentation clearly explains the new processing strategies.

The updated description effectively communicates:

  • InPlace strategy preserves original format
  • ToVariableLength strategy converts to VRL with RDWs
  • Clear distinction between the two approaches

1747-1774: LGTM! Well-documented experimental RDD feature.

The new section provides:

  • Clear explanation of the feature's purpose
  • Complete working example showing copybook parsing, RDD creation, and field extraction
  • Practical use case (extracting segment identifiers)

The example aligns with the new getParsedCopybook and toRDD methods introduced in SparkCobolProcessor.

cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorToRdwSuite.scala (1)

34-58: LGTM! Test correctly validates RDW format conversion.

The test properly verifies ToVariableLength processing:

  • Transforms 4 single-byte records into 2-byte records with RDW headers
  • Expected output: 4 records × (4-byte RDW + 2-byte payload) = 24 bytes
  • RDW format validation: [0, 2, 0, 0] indicates big-endian length of 2 bytes

The test aligns with the RDW generation logic in StreamProcessor.processStreamToRdw.

spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala (3)

109-150: LGTM! Comprehensive end-to-end test for ToVariableLength strategy.

This test effectively validates:

  • RDW header generation with proper format (big-endian, 4-byte headers)
  • Round-trip verification by reading the output file back via Spark COBOL format
  • JSON representation matches expected transformed values

The validation approach using Spark's COBOL reader provides strong confidence in the RDW format correctness.


153-184: LGTM! RDD generation test validates non-indexed path.

The test confirms:

  • RDD can be generated directly via toRDD method
  • Parsed copybook structure is accessible via getParsedCopybook
  • Record count and distinct values are correct
  • Non-indexed processing works as expected

186-215: LGTM! Index-based RDD generation test validates indexed path.

The test verifies:

  • Index generation works with enable_indexes=true and input_split_records=2
  • Record extraction is correct across index boundaries
  • Partitioning logic respects the split configuration

This complements the non-indexed test and validates the index-aware processing path.

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorToRdw.scala (1)

42-52: LGTM! Clean implementation of ToVariableLength strategy.

The implementation correctly:

  • Reuses record extraction via CobolProcessorBase.getRecordExtractor
  • Delegates to StreamProcessor.processStreamToRdw for RDW format conversion
  • Avoids unnecessary stream copying (unlike InPlace which needs to preserve headers)

The simpler stream handling compared to CobolProcessorInPlace is appropriate since RDW conversion doesn't need to preserve original record boundaries.

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/CobolRelation.scala (1)

74-74: LGTM! Refactoring improves code reusability.

The extraction of file listing logic to a public companion object method:

  • Makes file discovery logic reusable across components (e.g., SparkCobolProcessor)
  • Maintains identical functionality while improving organization
  • Follows idiomatic Scala patterns for utility methods
  • Supports the broader index-aware processing infrastructure

Also applies to: 115-131

cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorBaseSuite.scala (1)

26-26: LGTM! Clean test refactoring.

The test class has been properly renamed and updated to use the new CobolProcessorBase.getRecordExtractor API. All test cases consistently pass the required parameters and maintain good coverage of fixed-length, variable-length, and error scenarios.

Also applies to: 36-36, 49-52, 61-64

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorBase.scala (1)

40-48: LGTM! Clear error handling.

The record extractor creation logic is well-structured with a clear error message that guides users to check their configuration.

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/StreamProcessor.scala (1)

39-70: LGTM! Header/footer preservation in place.

The renamed processStreamInPlace method correctly preserves headers and footers from the input stream while processing records in place. The logic properly handles offset calculations and maintains the original format.

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/parameters/CobolParametersParser.scala (1)

225-225: LGTM! Clean conditional validation.

The addition of the validateRedundantOptions parameter with a default value of true preserves backward compatibility while allowing callers to skip redundant option validation when needed. This is useful for scenarios where options are dynamically extended or when using custom processors.

Also applies to: 309-309, 756-756, 949-961

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorInPlace.scala (1)

40-43: LGTM! Clear class naming and inheritance.

The class rename from CobolProcessorImpl to CobolProcessorInPlace better reflects its purpose, and extending CobolProcessorBase promotes code reuse.

cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/CobolProcessor.scala (2)

53-53: LGTM! Clean strategy pattern implementation.

The addition of the processing strategy pattern with a sensible default (InPlace) is well-designed. The builder pattern integration is clean and allows flexible processor instantiation.

Also applies to: 63-66, 107-110


154-160: LGTM! Strategy-based processor instantiation.

The loader correctly instantiates the appropriate processor implementation based on the selected strategy.

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/DefaultSource.scala (1)

131-132: Public API expansion - verify intent.

The helper methods buildEitherReader, createTextReader, createFixedLengthReader, and createVariableLengthReader are now public in the companion object. This expands the public API surface.

Ensure this is intentional and that these methods are meant to be used externally. If they're only for internal use, consider keeping them private or package-private.

Also applies to: 139-139, 157-157, 167-167, 180-180

spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessor.scala (3)

63-63: LGTM! Good design: optional processor at load, required at save.

Making rawRecordProcessor optional at load time but required at save() time is a good design decision. This enables use cases like schema inspection via getParsedCopybook without requiring a processor upfront.

Also applies to: 99-102, 139-140, 145-147


168-177: LGTM! Useful utility methods added.

The getParsedCopybook and toRDD methods provide valuable utilities for advanced use cases, allowing users to inspect the schema and generate RDDs without persisting to disk.


215-246: LGTM! Well-structured index-aware processing.

The getRecordRdd method correctly handles both indexed and non-indexed processing paths, properly using headerStream for indexed paths and omitting it for non-indexed paths. The logging provides good visibility into the processing logic.

Comment on lines +83 to +103
def processStreamToRdw(copybook: Copybook,
options: Map[String, String],
recordExtractor: RawRecordExtractor,
recordProcessor: RawRecordProcessor,
outputStream: OutputStream): Long = {
var recordCount = 0L

while (recordExtractor.hasNext) {
recordCount += 1
val record = recordExtractor.next()
val recordSize = record.length

val ctx = CobolProcessorContext(copybook, options, recordExtractor.offset)

val updatedRecord = recordProcessor.processRecord(record, ctx)

val rdw = Array[Byte](((updatedRecord.length >> 8) & 0xFF).toByte, ((updatedRecord.length) & 0xFF).toByte, 0, 0)

outputStream.write(rdw)
outputStream.write(updatedRecord)
}
Copy link
Contributor

@coderabbitai coderabbitai bot Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify RDW length calculation includes RDW header.

Line 99 calculates the RDW using only updatedRecord.length. According to IBM mainframe VRL standards, the RDW length field typically includes the 4-byte RDW header itself. This means the length should be updatedRecord.length + 4.

Verify that the current calculation matches your intended RDW format specification. If this needs to be compatible with standard mainframe VRL readers, the length should likely include the RDW header.


🏁 Script executed:

#!/bin/bash
# Check if there are any tests or documentation that clarify the expected RDW format
rg -n -C3 "RDW|rdw_adjustment|Record Descriptor Word" --type=scala

Length of output: 131567


[/run_scripts]


🏁 Script executed:

#!/bin/bash
# Find tests for StreamProcessor.processStreamToRdw method
rg -n "processStreamToRdw|StreamProcessor" --type=scala -A5 -B5 | head -100

Length of output: 6151


[/run_scripts]


🏁 Script executed:

#!/bin/bash
# Search for tests of CobolProcessorToRdw
find . -name "*Test*" -type f -exec grep -l "CobolProcessorToRdw\|processStreamToRdw" {} \;

Length of output: 0


[/web_search]


🌐 Web query:

IBM mainframe RDW Record Descriptor Word length field includes header standard

💡 Result:

Short answer: The RDW is 4 bytes and the length field (the first two bytes of the RDW) contains the logical record length including those 4 RDW bytes. The second two RDW bytes are the segment/spanning descriptor (zero if not spanned). [1][2]

Sources:

  • IBM z/OS SMF header documentation (RDW description). [1]
  • Variable-record format / RECFM documentation explaining 4‑byte RDW and that LRECL includes the 4 bytes. [2]

Correct the RDW length calculation to include the 4-byte header.

The RDW length field must include the 4-byte RDW header itself per IBM mainframe standard. The RDW length field contains the logical record length including those 4 RDW bytes. Line 99 should calculate the RDW as:

val rdw = Array[Byte](
  (((updatedRecord.length + 4) >> 8) & 0xFF).toByte, 
  ((updatedRecord.length + 4) & 0xFF).toByte, 
  0, 
  0
)

Without this adjustment, the output will not conform to standard mainframe VRL format and will be incompatible with standard RDW readers.

🤖 Prompt for AI Agents
In
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/StreamProcessor.scala
around lines 83 to 103, the RDW length calculation currently omits the 4-byte
RDW header; update the RDW byte array computation to use (updatedRecord.length +
4) when computing the two length bytes so the RDW length field equals the
logical record length including the 4-byte header, then write the RDW and record
as before.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have observed both cases:

  • when RDW includes the size of the header itself
  • when RDW does not include the header size, only the payload.

Historically, Cobrix uses the second option, and loading files which has RDWs that include header size need an adjustment by -4 (option("rdw_adjustment", "-1")). Going to keep this implementation because it matches Cobrix defaults. But if later different types of RDWs are required, we can always add an option of RDW adjustment for the writer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yruslan, thank you for the clarification! That makes perfect sense. Maintaining consistency with Cobrix's historical RDW format (payload-only) ensures backward compatibility and aligns with the existing rdw_adjustment option behavior. The flexibility to add writer-side RDW adjustment options later if needed sounds like a solid approach.

I've noted this convention for future reference.

(\__/)
(='.'=)  *noted*
(")_(")

✏️ Learnings added
Learnt from: yruslan
Repo: AbsaOSS/cobrix PR: 799
File: cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/StreamProcessor.scala:83-103
Timestamp: 2025-11-13T14:28:55.848Z
Learning: In Cobrix, RDW (Record Descriptor Word) length field contains only the payload size, not including the 4-byte RDW header itself. This is a historical convention. When reading files with RDW that includes the header size, users should use `option("rdw_adjustment", "-4")`.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@yruslan yruslan merged commit 9cb4a42 into master Nov 13, 2025
7 checks passed
@yruslan yruslan deleted the feature/795-in-place-conversion-to-vrl branch November 13, 2025 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add in-place EBCDIC processor that converts data to VRL format

2 participants