Skip to content

read parquet Table with TinyStories single column table #9

@tmcguirefl

Description

@tmcguirefl

readParquetTable produces a large multidimensional array. While functional this is not quite the equivalent of what happens in the Python version of the Arrow API.

With the example of TinyStories parquet files it is a single column of text. There are rows within that column that are strings of text. Each row may be of a different size. Arrow under Python stores this as a list of strings. This is space saving as the J methodology in the current version of the API stores the strings as a 2x2 table. This means that J will pad the rows of the table to match the largest string.

A more space saving methodology and one that would match the python implementation would be to store the table as a list of boxed strings. That way the sizes of the rows in the Parquet file can vary and the size is preserved with no padding.

Here is the code that will perform this type of access but it seems that Arrow should have some way of doing this without the use of the whilst loop to step through each row and box the string.

readbChunks=: {{
'chunkedArrayPt'=. y
nChunks=. ret@garrow_chunked_array_get_n_chunks < chunkedArrayPt
arrayPts=. readChunk each <"1 (<chunkedArrayPt),.(<"0 i. nChunks)
NB. res=. readArray each arrayPts
for_i. arrayPts do.
aLen=. readArrayLength >i
j=. 0
res=. <''
whilst. j < aLen do.
res=. res, <readArrayRows i,<j
j=.j+1
end.
end.
removeObject each arrayPts
}. res
}}"0

readbData=: {{
'tablePt'=. y
ncols=. tableNCols tablePt
chunkedArrayPts=. ptr"1 garrow_table_get_column_data tablePt ;"0 i. ncols
res=. ,. readbChunks chunkedArrayPts
removeObject"0 chunkedArrayPts
res
}}

readbTable=: {{
'tablePt'=. y
(,@readTableNames ,: readbData) tablePt
}}

readbFileTable=: {{readbTable@u ] filepath =. y}}

readbParquetTable=: (readParquet readbFileTable)

NB. add readbParquetTable to the transfers list
transfers=. 0 : 0
printTableSchema
readTableNames
readTableSchema
readTableColName

readArrowTable
readFileBufferTable
readBufferTable

readParquetSchema
printParquetSchema
readParquetData
readParquetTable
readsParquetTable
readParquetDataframe
readParquetCol
readbParquetTable

readFeatherSchema
printFeatherSchema
readFeatherData
readFeatherTable
readsFeatherTable
readFeatherDataframe
readFeatherCol

readCSVSchema
printCSVSchema
readCSVData
readCSVTable
readsCSVTable
readCSVDataframe
readCSVCol

readJsonSchema
printJsonSchema
readJsonData
readJsonTable
readsJsonTable
readJsonDataframe
readJsonCol

readFeatherSchema
printFeatherSchema
readFeatherData
readFeatherTable
readsFeatherTable
readFeatherDataframe
readFeatherCol
)

TinyStories can be found at Hugging Faces: roneneldan/TinyStories

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions