Skip to content

How can I force Pandas read_html function to read digit field as string not integer #30589

Closed
@Zhenye-Na

Description

@Zhenye-Na

Is there a possible way to convert the field from int to str?

I have explored the issues like #10534, #21379, https://github.com/gte620v/pandas/blob/5cb8243f2dd31cc2155627f29cfc89bbf6d4b84b/pandas/io/tests/test_html.py#L715

I do not think converters arg fit for our usage since the table is updated everyday and it may add a new column, then we need manually add a new key to the parameter

Here is the entire stacktrace when I used the function

PS C:\Users\Zhenye.na\Desktop> python3 .\dash-prod.py
.\dash-prod.py:4: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Mapping, Iterable
Traceback (most recent call last):
  File ".\dash-prod.py", line 59, in <module>
    df = pd.read_html(response.text, skiprows=1)
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 1105, in read_html
    displayed_only=displayed_only,
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 915, in _parse
    for table in tables:
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 213, in <genexpr>
    return (self._parse_thead_tbody_tfoot(table) for table in tables)
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 411, in _parse_thead_tbody_tfoot
    header = self._expand_colspan_rowspan(header_rows)
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 459, in _expand_colspan_rowspan
    colspan = int(self._attr_getter(td, "colspan") or 1)
ValueError: invalid literal for int() with base 10: '\\"1\\"'

The core usage of read_html function code is as follows:

response = requests.get(url, headers=hdrs)
df = pd.read_html(response.text, skiprows=1)[0]
print(df)

I would love to use the read_html function to extract the table in the response returned from the REST API. I have test the function in a small scale table, which contains only digits and it works. But for the data returned from REST API contains characters and digits.

Here is a demo of what the table looks like: (Assume DC1 and Location 1 has one '\n' symbol separated)

Date DC 1 Location 1 DC 2 Location 2 DC 3 Location 3
03/04 1.23.4 1.23.4 1.23.4
04/05 1.23.4 1.23.4 1.23.4

I assume the error message may because of the '.' symbol in field like 1.23.4 but I am not sure how to fix it.

Any ideas or thoughts are appreciated!

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions