Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
I have an html file that shows question names from a surveilance survey and how they have changed over the years to merge multiple files with slightly different names from those different years. I need to read in this file with the text exactly as it is in the html file so that those question names map exactly to the ones I mine from a pdf. The _remove_whitespace() function replaces all of the extra whitespace with single spaces, but there are some errors in these column names where they accidentally put two spaces or other similar things, and I need that text to match exactly, so I can properly clean the other files in the dataset.
Feature Description
Add a new parameter to the read_html() function that can disable the _remove_whitespace() function.
Alternative Solutions
The file I am using is originally md that I converted to html because I didn't think there was a way to read from md into pandas. I recently found this out, so instead of converting to html and reading from that, I am reading straight from the md. However, if my original file was html, I would probably create a similar solution of going back down to the read_table() function and manually making the changes I want to the cleaning.
def read_md(filename):
table = pd.read_table(filename, sep='|').dropna(axis=1, how='all').iloc[1:]
table.columns = table.columns.str.strip()
for col in table.columns:
table[col] = table[col].str.strip()
return table.reset_index(drop=True)
Additional Context
No response