Skip to content

ENH: Add parameter to read_html() that disables the _remove_whitespace() function #59827

Open
@dstone42

Description

@dstone42

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I have an html file that shows question names from a surveilance survey and how they have changed over the years to merge multiple files with slightly different names from those different years. I need to read in this file with the text exactly as it is in the html file so that those question names map exactly to the ones I mine from a pdf. The _remove_whitespace() function replaces all of the extra whitespace with single spaces, but there are some errors in these column names where they accidentally put two spaces or other similar things, and I need that text to match exactly, so I can properly clean the other files in the dataset.

Feature Description

Add a new parameter to the read_html() function that can disable the _remove_whitespace() function.

Alternative Solutions

The file I am using is originally md that I converted to html because I didn't think there was a way to read from md into pandas. I recently found this out, so instead of converting to html and reading from that, I am reading straight from the md. However, if my original file was html, I would probably create a similar solution of going back down to the read_table() function and manually making the changes I want to the cleaning.

def read_md(filename):

    table = pd.read_table(filename, sep='|').dropna(axis=1, how='all').iloc[1:]

    table.columns = table.columns.str.strip()

    for col in table.columns:

        table[col] = table[col].str.strip()

    return table.reset_index(drop=True)

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIO HTMLread_html, to_html, Styler.apply, Styler.applymapNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions