Skip to content

Variable OCCURS clause #239

@tr11

Description

@tr11

Describe the bug

Variable OCCURS fails if we don't specify variable record lengths. This may be very similar to the discussion of #156, but I think the issue there is slightly different.

To Reproduce

Run the following with pyspark

import os
import tempfile
from pyspark.sql.functions import explode

with tempfile.NamedTemporaryFile('wb') as f:
    f.write(b'   5ABC1ABC2ABC3ABC4ABC5   5DEF1DEF2DEF3DEF4DEF5')
    f.flush()

    (spark
      .read
      .format("cobol")
      .option("copybook_contents", """
          01 RECORD.
              02 COUNT PIC 9(4).
              02 GROUP OCCURS 0 TO 5 TIMES DEPENDING ON COUNT.
                  03 TEXT   PIC X(3).
                  03 FIELD  PIC 9.
      """)
      .option("encoding", 'ascii')
      .option("variable_size_occurs", "true")
      .load(f.name)
    ).select('RECORD.COUNT', explode('RECORD.GROUP')).show()

    (spark
      .read
      .format("cobol")
      .option("copybook_contents", """
          01 RECORD.
              02 COUNT PIC 9(4).
              02 GROUP OCCURS 0 TO 11 TIMES DEPENDING ON COUNT.
                  03 TEXT   PIC X(3).
                  03 FIELD  PIC 9.
      """)
      .option("encoding", 'ascii')
      .option("variable_size_occurs", "true")
      .load(f.name)
    ).select('RECORD.COUNT', explode('RECORD.GROUP')).show()    
    
    (spark
      .read
      .format("cobol")
      .option("copybook_contents", """
          01 RECORD.
              02 COUNT PIC 9(4).
              02 GROUP OCCURS 0 TO 10 TIMES DEPENDING ON COUNT.
                  03 TEXT   PIC X(3).
                  03 FIELD  PIC 9.
      """)
      .option("encoding", 'ascii')
      .option("variable_size_occurs", "true")
      .load(f.name)
    ).select('RECORD.COUNT', explode('RECORD.GROUP')).show()    

Expected behavior

All examples above should return the same

+------------+--------+
|RECORD.COUNT|     col|
+------------+--------+
|           5|[ABC, 1]|
|           5|[ABC, 2]|
|           5|[ABC, 3]|
|           5|[ABC, 4]|
|           5|[ABC, 5]|
|           5|[DEF, 1]|
|           5|[DEF, 2]|
|           5|[DEF, 3]|
|           5|[DEF, 4]|
|           5|[DEF, 5]|
+------------+--------+

but the second one gives only the first record

+------------+--------+
|RECORD.COUNT|     col|
+------------+--------+
|           5|[ABC, 1]|
|           5|[ABC, 2]|
|           5|[ABC, 3]|
|           5|[ABC, 4]|
|           5|[ABC, 5]|
+------------+--------+

and the third one fails with a file size check.

Additional context

This issue seems to be a consequence of the fact that the files I have to process do not have a leading RDW block or a field specifying record lengths. Essentially, the variable OCCURS clause means that in my case the record needs to be read in full, but then we need to backtrack the number of bytes that were not needed before reading the following record.

Metadata

Metadata

Assignees

No one assigned

    Labels

    acceptedAccepted for implementationbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions