Description
Now I don't know if this is known or the desired behaviour but when I try to read certain rows from a large file that uses "~" (tilde) as a quotechar and use skiprows at the same time, the parser screws up as follows:
Note: I use "" in the output even though that isn't shown, if I didn't the markup would become messed up - sorry...
>>> pd.read_csv(StringIO.StringIO('a,b,c\r~a\n b~,~e\n d~,~f\n f~\r1,2,~12\n 13\n 14~'), quotechar="~", skiprows=range(1,2) )
a b c
"b~" "e\n d" "f\n f"
1 2 "12\n 13\n 14" NaN
while the output I wish to get would be in this artificial case:
a b c
0 1 2 "12\n 13\n 14"
it seems when skipping rows, the parser ignores custom quotation - which in this case is undesired from my point of view.
EDIT: It might well be that in the quoted texts newlines are not always \n but sometimes also \r.
EDIT2 (31.8.):
The lineterminator fix fails as far as I can see with the following example:
>>> a = StringIO.StringIO('Text,url\r~example\r sentence\r one~,url1\r~example\n sentence\n two~,url2')
>>> pd.read_csv(a, quotechar="~", skiprows=range(1,2), lineterminator='\r' )
Text url
0 sentence NaN
1 "one~" url1
2 "example\n sentence\n two" url2
The problem is that there is a "text"-column in the csv with html-formatted textblocks as content. However, there is no saying what kind of newline the creators of the html used originally and the textblocks stem from different sources.
I might also add that it respects the quoting perfectly if one does not use "skiprows".
versioninfo:
python: 2.7.10.final.0
python-bits: 64
OS: Windows
OS-release: 7
pandas: 0.16.2
nose: None
Cython: None
numpy: 1.9.2
scipy: 0.16.0
statsmodels: None
IPython: 4.0.0
sphinx: None
patsy: None
dateutil: 2.4.2
pytz: 2015.4
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.4.0
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None