-
Notifications
You must be signed in to change notification settings - Fork 2
Description
readParquetTable produces a large multidimensional array. While functional this is not quite the equivalent of what happens in the Python version of the Arrow API.
With the example of TinyStories parquet files it is a single column of text. There are rows within that column that are strings of text. Each row may be of a different size. Arrow under Python stores this as a list of strings. This is space saving as the J methodology in the current version of the API stores the strings as a 2x2 table. This means that J will pad the rows of the table to match the largest string.
A more space saving methodology and one that would match the python implementation would be to store the table as a list of boxed strings. That way the sizes of the rows in the Parquet file can vary and the size is preserved with no padding.
Here is the code that will perform this type of access but it seems that Arrow should have some way of doing this without the use of the whilst loop to step through each row and box the string.
readbChunks=: {{
'chunkedArrayPt'=. y
nChunks=. ret@garrow_chunked_array_get_n_chunks < chunkedArrayPt
arrayPts=. readChunk each <"1 (<chunkedArrayPt),.(<"0 i. nChunks)
NB. res=. readArray each arrayPts
for_i. arrayPts do.
aLen=. readArrayLength >i
j=. 0
res=. <''
whilst. j < aLen do.
res=. res, <readArrayRows i,<j
j=.j+1
end.
end.
removeObject each arrayPts
}. res
}}"0
readbData=: {{
'tablePt'=. y
ncols=. tableNCols tablePt
chunkedArrayPts=. ptr"1 garrow_table_get_column_data tablePt ;"0 i. ncols
res=. ,. readbChunks chunkedArrayPts
removeObject"0 chunkedArrayPts
res
}}
readbTable=: {{
'tablePt'=. y
(,@readTableNames ,: readbData) tablePt
}}
readbFileTable=: {{readbTable@u ] filepath =. y}}
readbParquetTable=: (readParquet readbFileTable)
NB. add readbParquetTable to the transfers list
transfers=. 0 : 0
printTableSchema
readTableNames
readTableSchema
readTableColName
readArrowTable
readFileBufferTable
readBufferTable
readParquetSchema
printParquetSchema
readParquetData
readParquetTable
readsParquetTable
readParquetDataframe
readParquetCol
readbParquetTable
readFeatherSchema
printFeatherSchema
readFeatherData
readFeatherTable
readsFeatherTable
readFeatherDataframe
readFeatherCol
readCSVSchema
printCSVSchema
readCSVData
readCSVTable
readsCSVTable
readCSVDataframe
readCSVCol
readJsonSchema
printJsonSchema
readJsonData
readJsonTable
readsJsonTable
readJsonDataframe
readJsonCol
readFeatherSchema
printFeatherSchema
readFeatherData
readFeatherTable
readsFeatherTable
readFeatherDataframe
readFeatherCol
)
TinyStories can be found at Hugging Faces: roneneldan/TinyStories