-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: read_csv, sep=None, wrong separator guessing in case of one column csv #53035
Comments
Hi @louisgeisler, you don't see this warning?
So you need to change your print line to Thank you for opening the issue but I will close it now. |
Hi @DeaMariaLeon, I will make the issue clearer: When you use this:
instead of this:
I would appreciate you reopen the issue... |
Hi, @louisgeisler, with open("dummy.csv") as fp:
print(csv.Sniffer().sniff(fp.readline()).delimiter)
# Throw "E". Therefore, HEADER --> H[sep]AD[sep]R Therefore, this problem seems to be a problem with the csv library, not with pandas as in #13839. Furthermore, I found the rules for delimiter analysis in the _guess_delimiter function of csv library. It seems that the most frequent character was selected. Therefore, "E" was output this time. The following test results in the output of "R". print(csv.Sniffer().sniff("HREARDER").delimiter)
# output : "R" |
The docs for
I would say this behaves as expected. A line stating that only the first valid line is passed on to |
May I add it? |
For me the issue did come from pandas ; why are you just passing the first line of the csv to the sniffer? From what I understand of the Sniffer example: https://docs.python.org/3/library/csv.html They passed more than the Header. And intuitively, for me, it seems to be commom sense to detect the separator with more than just the header... |
Ultimately if you are passing files with unknown delimiters, you should expect cases where the right delimiter is not detected. I'm not a pandas maintainer though - so feel free to discuss further! |
My example conforms to lines 212-219 of python_parser.py. I understand your question. However, the issue about how to find the separator when passing the entire csv is still dependent on the csv library. As for why only the first line is passed, I think it is an art as @asishm commented. |
Oh, ok, I understand. But in this case, why not always passed the whole csv to the sniffer? As my example show it, if you pass the whole csv, ot will raise an error. So I think we can conclude that passing the whole csv, or let's say a sample of at least 100 lines (you detect '/n') will really improve the reliability on read_csv when sep=None, isn't it? Because, I think we would agree that passing only the header to be parsed by the snifer is probably the less reliable choice isn't it? 😅 |
Passing the entire csv to sniffer may result in an error, as in the following example one_col_csv = """col1, col2, col3
hzh,rzhj,trj
rzth,yy
azr23,1
zr,ht,zrh
zrhtz,rh""". It seems that csv.sniffer outputs an error if the number of delimiters per line differs. (Maybe I don't understand it well enough). And the behavior is not as expected for me. I agree with some of the claims about reliability, but on the other hand, to do that I need a way to determine from the input csv alone that it is a single-column csv, and I haven't come up with a good way to do that. |
In my opinion, it's a very good thing it returns an error if the csv isn't well formated. (For example if the number of separator by lines differ) 👍 And about one column csv, I also think ut's better to get an error stating it can infer the delimiter. What do you think about it? |
Passing in the entire csv is also not performant. Also it's very easy to come up with a csv that would still cause the sniffer to give you the current results. For example, for the below contrived example - it still outputs In [1]: s = '''HEADER
...: THEREWAS'''
In [2]: import csv
In [3]: csv.Sniffer().sniff(s).delimiter
Out[3]: 'E' Hence my suggestion to either
|
Another option would be to add a comment in the docs that the separator found when sep=None, might not be correct. Pinning @phofl for visibility |
Not a fan of the sniffing, that said exploring passing the first header + 1 lines might be interesting. Should definitely clarify the docs |
@Toroi0610 do you still want to improve the docs? |
@DeaMariaLeon yes, I do. |
take |
I think the best compromise may be something like passing the 30 first line of the csv file to the sniffer.
It will hugely increase the reliability of the sniffer, while still being fast as we didn't read all the file... (I really think that sep=None is a very convenient feature that should be improve instead of deleted...) |
Reopening to keep track of the sniffer exploration |
I discovered this beautiful library "CleverCSV" that may solve our(my) problem: https://github.com/alan-turing-institute/CleverCSV It may be interesting to integrate it in pandas directly or using it as backend to read csv instead of the build-in csv library... |
Hi @louisgeisler, I am also beginning to feel that it is important to be able to choose to have no separator in the csv (i.e. read as a single column csv) when using read_csv. Let me comment on your code. file.readlines(N_LINE_FOR_SNIFFER) This part of the code works to read How about the following, for example? def get_first_n_lines(path, n_lines : int):
first_n_lines = []
with open(path, 'r') as file:
for _ in range(n_lines):
line = file.readline()
if not line:
break
first_n_lines.append(line)
return first_n_lines Also, how about ',\t; :' as a random separator? This is a result of |
You're totally right about readlines 👍 However, I would disagree with you about the default separator to use, I mean CSV, litteraly mean Comma Separated Values... So in my opinion, it seems more logical to come back to a comma as default value if the sniffer failed, isn't it? |
…3153) docs: pandas-dev#53035 clarify the behavior of sep=None.
…3153) docs: pandas-dev#53035 clarify the behavior of sep=None.
How often will there be a different delimiter other than |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When pd.reader is used with the parameter sep=None, it failed to automaticaly detect signle column csv.
I have already read this conversation (2016): #13839
They had the same problem but come to the conclusion that the error comes csv sniffer python library.
But I think this is no longer relevant as csv.sniffer now raise an error when pandas still insist on parsing it wrong.
Expected Behavior
It should automatically detect it is a single column file, or at least raise an error.
Installed Versions
INSTALLED VERSIONS
commit : 37ea63d
python : 3.8.16.final.0
python-bits : 64
OS : Linux
OS-release : 5.19.0-41-generic
Version : #42~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Apr 18 17:40:00 UTC 2
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.0.1
numpy : 1.23.0
pytz : 2023.3
dateutil : 2.8.2
setuptools : 67.6.1
pip : 23.1
Cython : 0.29.34
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.1
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.0
snappy : None
sqlalchemy : 2.0.9
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: