-
Notifications
You must be signed in to change notification settings - Fork 82
Closed
Labels
acceptedAccepted for implementationAccepted for implementationbugSomething isn't workingSomething isn't working
Description
Describe the bug
Variable OCCURS fails if we don't specify variable record lengths. This may be very similar to the discussion of #156, but I think the issue there is slightly different.
To Reproduce
Run the following with pyspark
import os
import tempfile
from pyspark.sql.functions import explode
with tempfile.NamedTemporaryFile('wb') as f:
f.write(b' 5ABC1ABC2ABC3ABC4ABC5 5DEF1DEF2DEF3DEF4DEF5')
f.flush()
(spark
.read
.format("cobol")
.option("copybook_contents", """
01 RECORD.
02 COUNT PIC 9(4).
02 GROUP OCCURS 0 TO 5 TIMES DEPENDING ON COUNT.
03 TEXT PIC X(3).
03 FIELD PIC 9.
""")
.option("encoding", 'ascii')
.option("variable_size_occurs", "true")
.load(f.name)
).select('RECORD.COUNT', explode('RECORD.GROUP')).show()
(spark
.read
.format("cobol")
.option("copybook_contents", """
01 RECORD.
02 COUNT PIC 9(4).
02 GROUP OCCURS 0 TO 11 TIMES DEPENDING ON COUNT.
03 TEXT PIC X(3).
03 FIELD PIC 9.
""")
.option("encoding", 'ascii')
.option("variable_size_occurs", "true")
.load(f.name)
).select('RECORD.COUNT', explode('RECORD.GROUP')).show()
(spark
.read
.format("cobol")
.option("copybook_contents", """
01 RECORD.
02 COUNT PIC 9(4).
02 GROUP OCCURS 0 TO 10 TIMES DEPENDING ON COUNT.
03 TEXT PIC X(3).
03 FIELD PIC 9.
""")
.option("encoding", 'ascii')
.option("variable_size_occurs", "true")
.load(f.name)
).select('RECORD.COUNT', explode('RECORD.GROUP')).show()
Expected behavior
All examples above should return the same
+------------+--------+
|RECORD.COUNT| col|
+------------+--------+
| 5|[ABC, 1]|
| 5|[ABC, 2]|
| 5|[ABC, 3]|
| 5|[ABC, 4]|
| 5|[ABC, 5]|
| 5|[DEF, 1]|
| 5|[DEF, 2]|
| 5|[DEF, 3]|
| 5|[DEF, 4]|
| 5|[DEF, 5]|
+------------+--------+
but the second one gives only the first record
+------------+--------+
|RECORD.COUNT| col|
+------------+--------+
| 5|[ABC, 1]|
| 5|[ABC, 2]|
| 5|[ABC, 3]|
| 5|[ABC, 4]|
| 5|[ABC, 5]|
+------------+--------+
and the third one fails with a file size check.
Additional context
This issue seems to be a consequence of the fact that the files I have to process do not have a leading RDW block or a field specifying record lengths. Essentially, the variable OCCURS clause means that in my case the record needs to be read in full, but then we need to backtrack the number of bytes that were not needed before reading the following record.
Metadata
Metadata
Assignees
Labels
acceptedAccepted for implementationAccepted for implementationbugSomething isn't workingSomething isn't working