Skip to content

read_csv parse issue with newline in quoted items combined with skiprows #10911

Closed
@cstegemann

Description

@cstegemann

Now I don't know if this is known or the desired behaviour but when I try to read certain rows from a large file that uses "~" (tilde) as a quotechar and use skiprows at the same time, the parser screws up as follows:
Note: I use "" in the output even though that isn't shown, if I didn't the markup would become messed up - sorry...

>>> pd.read_csv(StringIO.StringIO('a,b,c\r~a\n b~,~e\n d~,~f\n f~\r1,2,~12\n 13\n 14~'), quotechar="~", skiprows=range(1,2) )
     a                  b        c
   "b~"           "e\n d"  "f\n f"
1    2     "12\n 13\n 14"     NaN

while the output I wish to get would be in this artificial case:

      a      b                 c
0     1      2     "12\n 13\n 14"

it seems when skipping rows, the parser ignores custom quotation - which in this case is undesired from my point of view.

EDIT: It might well be that in the quoted texts newlines are not always \n but sometimes also \r.

EDIT2 (31.8.):
The lineterminator fix fails as far as I can see with the following example:

>>> a = StringIO.StringIO('Text,url\r~example\r sentence\r one~,url1\r~example\n sentence\n two~,url2')
>>> pd.read_csv(a, quotechar="~", skiprows=range(1,2), lineterminator='\r' )
                            Text        url
0                       sentence        NaN
1                         "one~"       url1
2     "example\n sentence\n two"       url2

The problem is that there is a "text"-column in the csv with html-formatted textblocks as content. However, there is no saying what kind of newline the creators of the html used originally and the textblocks stem from different sources.
I might also add that it respects the quoting perfectly if one does not use "skiprows".

versioninfo:

python: 2.7.10.final.0
python-bits: 64
OS: Windows
OS-release: 7

pandas: 0.16.2
nose: None
Cython: None
numpy: 1.9.2
scipy: 0.16.0
statsmodels: None
IPython: 4.0.0
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.0
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions