-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add parameter to read_html() that disables the _remove_whitespace() function #59827
Comments
where can we find the read_html function? and also, may I take this? |
Thanks for the report. Can you supply an html table that reproduces the issue you are experiencing. |
@rhshadrach Yes, here is an example. Unfortunately, html files are apparently not supported for uploading, so I just changed the file extension to txt, so you should be able to just change the extension back to html without any issues I think, For some reason, when I read this in using read_html() it reads each column as its own dataframe and puts them in a list, but they are easily merged using pd.concat(). Once they're merged, see row 124 and column (label not index) 6. In the html file it is 'eCigUse: Electronic Cigarette Use (Derived from UsedECigEver and UseECigNow; see History Document for more information)', but it is read into the data frame as 'eCigUse: Electronic Cigarette Use (Derived from UsedECigEver and UseECigNow; see History Document for more information)'. You can see immediately after the colon, there are two spaces in the html document which are reduced to one space in the data frame. I see how this is normally beneficial, but I just would like the option to turn it off in this specific case. @bruhnugget-nice not sure what you're second question is, but here is the link to the read_html function: https://pandas.pydata.org/docs/reference/api/pandas.read_html.html |
Hey, sorry for replying so late, but I have been looking through the read_html function,/ and from my perspective, it seems so complex, and I just don't understand what you are trying to do. Look at the file: Line 1010 in 2419343
|
Thanks for the example. When writing bug reports, it is appreciated to produce minimal examples. The file you've provided has lots of unnecessary javascript that can be stripped out. See https://matthewrocklin.com/minimal-bug-reports for more.
I cannot reproduce: result = pd.read_html("HINTS.Changes.html")
print(result[0].columns)
# Index(['Question', '6', '5 Cycle 4', '5 Cycle 3', '5 Cycle 2', '5 Cycle 1'], dtype='object')
I believe this is standard for HTML - multiple spaces are treated as a single space. All modern browsers will take two or more spaces and display them as one. When spaces are important, there are various options such as It seems to me having a browser display the table one way and having pandas read it another way is undesirable. |
@rhshadrach Sorry for my more complex example. I didn't think of that. I will attach a minimal example with this comment. I'm not a web programmer, so I have not had a large amounts of interactions with HTML. However, in the experience I do have, browsers do not treat double spaces as single spaces. Try opening the test file I have provided in a browser. You will see that it preserves the double spaces in the text. Sorry if it's hard to tell. For me, the spaces render on the end of each line, but you can see they're both there if you highlight the text. If you get a different result, I am testing this on a Windows with Chrome. Regardless of how browsers normally interpret it, I have a specific case where I would like to have a parameter that allows me to turn this functionality off. I don't think you need to change the default functionality because it definitely seems like your current functionality should be the default. On the other hand, it seems unlikely that this edge case will ever come up again for another user. I do think it's very easy to add this parameter, but I don't know what it's like to maintain a huge very popular python package, so I would not be upset if this is not important enough to be changed. I already have a functional work-around, so it truly won't be a big deal if this doesn't happen. |
@dstone42: If you strip out all the Javascript / CSS, you do not get multiple spaces. <!DOCTYPE html>
<html>
<head><title>Test</title></head>
<table>
<thead>
<tr>
<th dir="ltr">Question</th>
<th dir="auto">6</th>
<th dir="ltr">5 Cycle 4</th>
<th dir="ltr">5 Cycle 3</th>
<th dir="ltr">5 Cycle 2</th>
<th dir="ltr">5 Cycle 1</th>
</tr>
</thead>
<tbody>
<tr>
<td dir="ltr">Electronic Cigarette Use (Composite of UsedECigEver and UseECigNow)</td>
<td dir="ltr">eCigUse: Electronic Cigarette Use (Derived from UsedECigEver and UseECigNow; see History Document for more information)</td>
<td dir="ltr">eCigUse. Electronic Cigarette Use (Derived from UsedECigEver and UseECigNow; see History Document for more information)</td>
<td dir="ltr">eCigUse. Electronic Cigarette Use (Derived from UsedECigEver and UseECigNow; see History Document for more information)</td>
<td dir="ltr">eCigUse. Electronic Cigarette Use (Derived from UsedECigEver and UseECigNow; see History Document for more information)</td>
<td dir="ltr">eCigUse. Electronic Cigarette Use (Derived from ElectCigLessHarm, UsedECigEver, and UseECigNow; see History Document for more information)</td>
</tr>
</tbody>
</table>
</body>
</html> From the first column:
That said, such an argument does seem like it could be generally beneficial for certain use cases. I would be supportive of such an argument if the implementation is light-weight. Further investigations are welcome! |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
I have an html file that shows question names from a surveilance survey and how they have changed over the years to merge multiple files with slightly different names from those different years. I need to read in this file with the text exactly as it is in the html file so that those question names map exactly to the ones I mine from a pdf. The _remove_whitespace() function replaces all of the extra whitespace with single spaces, but there are some errors in these column names where they accidentally put two spaces or other similar things, and I need that text to match exactly, so I can properly clean the other files in the dataset.
Feature Description
Add a new parameter to the read_html() function that can disable the _remove_whitespace() function.
Alternative Solutions
The file I am using is originally md that I converted to html because I didn't think there was a way to read from md into pandas. I recently found this out, so instead of converting to html and reading from that, I am reading straight from the md. However, if my original file was html, I would probably create a similar solution of going back down to the read_table() function and manually making the changes I want to the cleaning.
Additional Context
No response
The text was updated successfully, but these errors were encountered: