Skip to content

[R] Can not parse file #34291

@elgabbas

Description

@elgabbas

Hello,

I am trying to load large csv file (tab-delimited; 23 GB, 16M rows, 259 cols) using arrow R package. I get this error early enough while reading the file content.

Error in `read_delim_arrow()`: ! Invalid: CSV parse error: Row #834603: Expected 259 columns, got 322: 2417934775 DSS00439000014FB CC0_1_0 National Museum of Nat ...

This is the content of the line shown in the previous error:

2417934775\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tDSS00439000014FB\t\t\t\t\t\t\t\t\t\tCC0_1_0\t\t\t\t\tNational Museum of Natural History, Luxembourg\t\t\t\t\t\t\t\t\t\t\t\t\t\t\thttps://ror.org/05natt857\tMnhnL\t\t\tMNHNL-HERB-LUX\tHerbarium\t\tPRESERVED_SPECIMEN\t\t\tTaxon status for Luxembourg: [\"Least concern - IUCN (2001)\"]\tDSS00439000014FB\t20471\t\tLéopold Reichling\t\t\t\t\t\t\t\t\t\t\t\t\tPRESENT\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t1953-08-06T00:00:00\t\t\t\t1953\t8\t6\t1953-8-6/1953-8-6\t\tUnknown\t\t\t\t\t\t\t\t\tEUROPE\t\t\t\tLU\t\t\t\tGarnich\t\"Entre Garnich et Windhof, chemin longeant la lisière du bois dit \"\"Lange Rés\"\" sur marnes liasiques\t\t\t\t\t\t\t\t49.6275\t5.96049\t\t\t\tLUGR\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\tOriginal\t\tLéopold Reichling\t\t\t\t\t\t\t\t2702084\t\t\t\t\t\tJuncus tenuis Willd.\t\t\t\t\t\t\t\tPlantae\tTracheophyta\tLiliopsida\tPoales\tJuncaceae\t\tJuncus\tJuncus\t\t\ttenuis\t\t\tSPECIES\t\t\t\tACCEPTED\t\t\t962f59bc-f762-11e1-a439-00145eb45e9a\tLU\t2023-01-24T22:54:17.514Z\t\t\t\t\t\t\t\tCOUNTRY_DERIVED_FROM_COORDINATES;CONTINENT_DERIVED_FROM_COORDINATES;COLLECTION_MATCH_FUZZY\tStillImage\ttrue\tfalse\t2702084\t2702084\t6\t7707728\t196\t1369\t5353\t2701072\t\t2702084\tJuncus tenuis\tJuncus tenuis Willd.\tJuncus tenuis\t\tEML\t2023-01-24T22:54:17.514Z\t2023-01-06T10:14:02.331Z\tfalse\t\tLUX\tLuxembourg\tLUX.3_1\tLuxembourg\tLUX.3.1_1\tCapellen\tLUX.3.1.4_1\tGarnich\tNE\t

Do you think that the problem is due to the use of 1, 2, or 3 quotes in the text? due to square brackets?
Can this because of the encoding?

Thanks.
Ahmed


EDIT: This is a reprex code for the issue:

Occ <- read_delim_arrow(file = "https://github.com/apache/arrow/files/10802973/Arrow_parse_Example.txt", delim = "\t")

Component(s)

R

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions