How can I force Pandas `read_html` function to read digit field as string not integer

Is there a possible way to convert the field from `int` to `str`?

> I have explored the issues like https://github.com/pandas-dev/pandas/issues/10534, https://github.com/pandas-dev/pandas/issues/21379, https://github.com/gte620v/pandas/blob/5cb8243f2dd31cc2155627f29cfc89bbf6d4b84b/pandas/io/tests/test_html.py#L715
>
> I do not think `converters` arg fit for our usage since the table is updated everyday and it may add a new column, then we need manually add a new key to the parameter


Here is the entire stacktrace when I used the function

```
PS C:\Users\Zhenye.na\Desktop> python3 .\dash-prod.py
.\dash-prod.py:4: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Mapping, Iterable
Traceback (most recent call last):
  File ".\dash-prod.py", line 59, in <module>
    df = pd.read_html(response.text, skiprows=1)
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 1105, in read_html
    displayed_only=displayed_only,
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 915, in _parse
    for table in tables:
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 213, in <genexpr>
    return (self._parse_thead_tbody_tfoot(table) for table in tables)
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 411, in _parse_thead_tbody_tfoot
    header = self._expand_colspan_rowspan(header_rows)
  File "C:\Users\Zhenye.na\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\LocalCache\local-packages\Python37\site-packages\pandas\io\html.py", line 459, in _expand_colspan_rowspan
    colspan = int(self._attr_getter(td, "colspan") or 1)
ValueError: invalid literal for int() with base 10: '\\"1\\"'
```

The core usage of `read_html` function code is as follows:

```python
response = requests.get(url, headers=hdrs)
df = pd.read_html(response.text, skiprows=1)[0]
print(df)
```

I would love to use the `read_html` function to extract the table in the response returned from the REST API. I have test the function in a small scale table, which contains only digits and it works. But for the data returned from REST API contains characters and digits.

Here is a demo of what the table looks like: (Assume `DC1` and `Location 1` has one `'\n'` symbol separated)


|  Date | DC 1  Location 1 | DC 2   Location 2 | DC 3   Location 3 |
|:-----:|:----------------:|:-----------------:|:-----------------:|
| 03/04 |      1.23.4      |       1.23.4      |       1.23.4      |
| 04/05 |      1.23.4      |       1.23.4      |       1.23.4      |


I assume the error message may because of the `'.'` symbol in field like `1.23.4` but I am not sure how to fix it.

Any ideas or thoughts are appreciated!

Thanks!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How can I force Pandas `read_html` function to read digit field as string not integer #30589

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

How can I force Pandas read_html function to read digit field as string not integer #30589

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

How can I force Pandas `read_html` function to read digit field as string not integer #30589