Skip to content

ENH: Add I/O support of XML with pandas.read_xml and DataFrame.to_xml… #39516

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 60 commits into from
Feb 27, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
b67d876
ENH: Add i/o support of XML with pandas.read_xml and DataFrame.to_xml…
ParfaitG Feb 1, 2021
98e3bcd
Merge branch 'master' into read_xml
ParfaitG Feb 1, 2021
cd79a06
Refactor code for base classes, add tests, adjust whatsnew entry
ParfaitG Feb 3, 2021
6c06dc2
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 3, 2021
fadcb67
Fixed import_optional_dependency() args
ParfaitG Feb 3, 2021
ac5fd3a
Fix fixture and param name collision and check two errors in tests
ParfaitG Feb 3, 2021
25ba341
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 3, 2021
143402a
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 3, 2021
938b0a0
Adjusted tests to handle etree version issues
ParfaitG Feb 3, 2021
a92c21e
Add appropriate etree skips in tests
ParfaitG Feb 3, 2021
51f10f2
Remove check for warnings in tests
ParfaitG Feb 3, 2021
3520d58
Adjust code to conform to mypy and docstring validation
ParfaitG Feb 4, 2021
4832562
Add read_xml to TestPDApi test and fix for etree test
ParfaitG Feb 4, 2021
2914c32
Add read_xml to TestPDApi test and fix for etree test
ParfaitG Feb 4, 2021
72d0e93
Replace lxml ImportWarning for ImportError with added tests
ParfaitG Feb 4, 2021
6453f6e
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 4, 2021
8af695e
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 5, 2021
b80b8ce
Adjust fixture for lxml skip and add error validation in tests
ParfaitG Feb 5, 2021
a6cfc90
Add conditional skips for envs without lxml
ParfaitG Feb 5, 2021
6c4e0b4
Clean up whatnew rst of rebase issue
ParfaitG Feb 5, 2021
a57fd35
Fix unescaped emphasis and wording in read_xml docstring
ParfaitG Feb 5, 2021
16cbcd3
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 8, 2021
23439b4
Add XML section in io.rst and lxml dependency for read_xml in install…
ParfaitG Feb 8, 2021
2effae0
Add section title in whatsnew and tree builder for lxml dependency in…
ParfaitG Feb 10, 2021
878eebe
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 10, 2021
35fa6a6
Clean up merge issue in whatsnew, remove escape in io.rst, adjust exc…
ParfaitG Feb 11, 2021
80d44f9
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 11, 2021
f861d53
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 11, 2021
947840a
Remove redundant try/except and fix default namespace condition
ParfaitG Feb 16, 2021
f8dc56c
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 16, 2021
cb34dde
Replace path or buffer handling with get_handle and add compression a…
ParfaitG Feb 20, 2021
3133486
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 20, 2021
a7716b8
Fix issues in tests from other Python envs
ParfaitG Feb 21, 2021
701d225
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 21, 2021
5b93c16
Fix precommit issue with import line
ParfaitG Feb 21, 2021
9a0dfb4
Adjust code and tests per twoertwein comments
ParfaitG Feb 21, 2021
9556035
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 21, 2021
82ac370
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 22, 2021
c478cb0
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 22, 2021
e23200d
Remove redundancy and object names in XML parse and rename tests for …
ParfaitG Feb 23, 2021
b0b3759
Resolve merge conflict with upstream/master
ParfaitG Feb 23, 2021
b48e257
Add XML table in install.rst
ParfaitG Feb 23, 2021
453ac40
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 23, 2021
9b21636
Streamline filepath_or_buffer handling and add TypeError tests
ParfaitG Feb 23, 2021
bea318c
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 23, 2021
49343b1
Fix lxml test on few Python envs
ParfaitG Feb 23, 2021
ce986bc
Adjust io handling in context maanger
ParfaitG Feb 24, 2021
347d58b
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 24, 2021
e2f80db
Add and fix tests for special filepath_or_buffer values
ParfaitG Feb 24, 2021
c7e1e11
Fix tests for better example and wrong parser
ParfaitG Feb 24, 2021
9790e7c
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 24, 2021
df9ecf4
Adjust to handle empty string stylesheet with tests
ParfaitG Feb 24, 2021
46719b7
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 24, 2021
5d75d51
Move methods out of class, adjust xpath check, and data frame formatting
ParfaitG Feb 25, 2021
66c01d2
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 25, 2021
5c0af6e
Update tests to conform to mypy
ParfaitG Feb 25, 2021
2eae8ad
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 25, 2021
603644e
Import methods to avoid duplication and add typing to parse_doc
ParfaitG Feb 27, 2021
3ec7297
Merge remote-tracking branch 'upstream/master' into read_xml
ParfaitG Feb 27, 2021
6194f83
Refactor code and revert changes to avoid optional module type hints
ParfaitG Feb 27, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Clean up merge issue in whatsnew, remove escape in io.rst, adjust exc…
…eptions with added tests
  • Loading branch information
ParfaitG committed Feb 11, 2021
commit 35fa6a6b4e0e99217d05a1e5786559a8544dd890
2 changes: 1 addition & 1 deletion doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3012,7 +3012,7 @@ But assiging *any* temporary name to correct URI allows parsing by nodes.
namespaces={"pandas": "https://example.com"})
df

However, if XPath does not reference node names such as default, ``/\*``, then
However, if XPath does not reference node names such as default, ``/*``, then
``namespaces`` is not required.

With `lxml`_ as parser, you can flatten nested XML documents with an XSLT
Expand Down
6 changes: 0 additions & 6 deletions doc/source/whatsnew/v1.3.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,6 @@ For example:
storage_options=headers
)

.. _whatsnew_130.window_method_table:

:class:`Rolling` and :class:`Expanding` now support a ``method`` argument with a
``'table'`` option that performs the windowing operation over an entire :class:`DataFrame`.
See ref:`window.overview` for performance and functional benefits. (:issue:`15095`)

.. _whatsnew_130.read_to_xml:

Read and write XML documents
Expand Down
9 changes: 3 additions & 6 deletions pandas/io/formats/xml.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
import io
from typing import Any, Dict, List, Optional, Union
from urllib.error import HTTPError, URLError
from warnings import warn

from pandas._typing import FilePathOrBuffer
from pandas.errors import AbstractMethodError
Expand Down Expand Up @@ -252,7 +251,7 @@ def write_output(self) -> Optional[str]:

out_str = None

except (UnicodeDecodeError, OSError, FileNotFoundError) as e:
except (OSError, FileNotFoundError) as e:
raise e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you catching this specifically? the try/except really isn't doing anything

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm...actually this does catch exceptions. If user passes in a file name that cannot be saved, FileNotFoundError is raised and there is a test for that: geom_df.to_xml("/my/fake/path/output.xml", parser=parser). I am removing UnicodeDecodeError which relates more to Python 2 (borrowed from to_html). OSError is general class error and to test I need permissions issue, disk full error, or other OS reasons.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure but since you aren't changing the exception (type or message), this is not doing anything. just remove the try/except.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Will do.


return out_str
Expand Down Expand Up @@ -299,10 +298,8 @@ def build_tree(self) -> bytes:
self.out_xml = self.remove_declaration()

if self.stylesheet:
warn(
"To use stylesheet, you need lxml installed. "
"Instead, the non-transformed, original XML is returned.",
UserWarning,
raise ValueError(
"To use stylesheet, you need lxml installed and selected as parser."
)

return self.out_xml
Expand Down
7 changes: 2 additions & 5 deletions pandas/io/xml.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
import io
from typing import Dict, List, Optional, Union
from urllib.error import HTTPError, URLError
from warnings import warn

from pandas._typing import FilePathOrBuffer
from pandas.compat._optional import import_optional_dependency
Expand Down Expand Up @@ -231,10 +230,8 @@ def __init__(self, *args, **kwargs):
def parse_data(self) -> List[Dict[str, Optional[str]]]:

if self.stylesheet:
warn(
"To use stylesheet, you need lxml installed. "
"Nodes will be parsed on original XML at the xpath.",
UserWarning,
raise ValueError(
"To use stylesheet, you need lxml installed and selected as parser."
)

self.xml_doc = self._parse_doc()
Expand Down
24 changes: 21 additions & 3 deletions pandas/tests/io/formats/test_to_xml.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,15 +24,15 @@
[X] - LookupError("unknown encoding")
[X] - KeyError("...is not included in namespaces")
[X] - KeyError("no valid column")
[] - OSError (GENERAL ERROR WITH FileNotFoundError AS SUBCLASS)
[X] - ValueError("To use stylesheet, you need lxml installed...")
[] - OSError (NEED PERMISSOIN ISSUE, DISK FULL, ETC.)
[X] - FileNotFoundError("No such file or directory")

lxml
[X] - TypeError("...is not a valid type for attr_cols")
[X] - TypeError("...is not a valid type for elem_cols")
[X] - LookupError("unknown encoding")
[] - UnicodeDecodeError (NEED NON-UTF-8 STYLESHEET)
[] - OSError (GENERAL ERROR WITH FileNotFoundError AS SUBCLASS)
[] - OSError (NEED PERMISSOIN ISSUE, DISK FULL, ETC.)
[X] - FileNotFoundError("No such file or directory")
[X] - KeyError("...is not included in namespaces")
[X] - KeyError("no valid column")
Expand Down Expand Up @@ -1153,6 +1153,24 @@ def test_incorrect_xsl_apply(parser):
geom_df.to_xml(path, stylesheet=xsl)


def test_stylesheet_with_etree(datapath):
xsl = """\
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" encoding="utf-8" indent="yes" />
<xsl:strip-space elements="*"/>

<xsl:template match="@*|node(*)">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>"""

with pytest.raises(
ValueError, match=("To use stylesheet, you need lxml installed")
):
geom_df.to_xml(parser="etree", stylesheet=xsl)


@td.skip_if_no("lxml")
def test_style_to_csv():
xsl = """\
Expand Down
11 changes: 11 additions & 0 deletions pandas/tests/io/test_xml.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
[X] - ValueError("names does not match length of child elements in xpath.")
[X] - TypeError("...is not a valid type for names")
[X] - ValueError("io is not a url, file, or xml string")
[X] - ValueError("To use stylesheet, you need lxml installed...")
[] - URLError (GENERAL ERROR WITH HTTPError AS SUBCLASS)
[X] - HTTPError("HTTP Error 404: Not Found")
[] - OSError (GENERAL ERROR WITH FileNotFoundError AS SUBCLASS)
Expand Down Expand Up @@ -844,6 +845,16 @@ def test_wrong_stylesheet():
read_xml(kml, stylesheet=xsl)


def test_stylesheet_with_etree(datapath):
kml = os.path.join("data", "xml", "cta_rail_lines.kml")
xsl = os.path.join("data", "xml", "flatten_doc.xsl")

with pytest.raises(
ValueError, match=("To use stylesheet, you need lxml installed")
):
read_xml(kml, parser="etree", stylesheet=xsl)


@tm.network
@td.skip_if_no("lxml")
def test_online_stylesheet():
Expand Down