Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XMLSyntaxError during conversion to XML output #375

Closed
fortyfourforty opened this issue Jun 18, 2023 · 2 comments · Fixed by #462
Closed

XMLSyntaxError during conversion to XML output #375

fortyfourforty opened this issue Jun 18, 2023 · 2 comments · Fixed by #462
Labels
bug Something isn't working

Comments

@fortyfourforty
Copy link

fortyfourforty commented Jun 18, 2023

Error url: https://www.tristatetelecom.com/productdetailI2.aspx?dataid=IPGSM-4G

Error:

Traceback (most recent call last):

File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)

File "", line 2, in <cell line: 2>
content_source1 = trafilatura.extract(ssss, output_format='xml', include_comments=False)

File "/usr/local/lib/python3.10/dist-packages/trafilatura/core.py", line 1091, in extract
return determine_returnstring(document, output_format, include_formatting, tei_validation)

File "/usr/local/lib/python3.10/dist-packages/trafilatura/core.py", line 788, in determine_returnstring
returnstring = control_xml_output(output, output_format, tei_validation, document)

File "/usr/local/lib/python3.10/dist-packages/trafilatura/xml.py", line 106, in control_xml_output
output_tree = fromstring(control_string, CONTROL_PARSER)

File "src/lxml/etree.pyx", line 3257, in lxml.etree.fromstring

File "src/lxml/parser.pxi", line 1916, in lxml.etree._parseMemoryDocument

File "src/lxml/parser.pxi", line 1796, in lxml.etree._parseDoc

File "src/lxml/parser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc

File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc

File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult

File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError

File "", line 2
XMLSyntaxError: Failed to parse QName 'padding:', line 2, column 480

Function: trafilatura.extract(source, output_format='xml', include_comments=False)

@adbar adbar added the bug Something isn't working label Jun 19, 2023
@adbar
Copy link
Owner

adbar commented Jun 19, 2023

I can reproduce the bug, it only happens when the output is set to XML.

trafilatura -u "https://www.tristatetelecom.com/productdetailI2.aspx?dataid=IPGSM-4G" -vv --xml

URL saved for to reproduce it later: https://web.archive.org/web/20230619162141/https://www.tristatetelecom.com/productdetailI2.aspx?dataid=IPGSM-4G

@adbar adbar changed the title extract error XMLSyntaxError during conversion to XML output Jun 20, 2023
vbarbaresi added a commit to vbarbaresi/trafilatura that referenced this issue Dec 31, 2023
Fixes issue adbar#375

The bug happened when we had a `:` in an element attribute that didn't match any XML namespace (invalid XML). In the example it was `padding:1px=""; margin:15px=""`
We can workaround it by manually dropping those bad elements.
I hope it doesn't impact performance too much

To reproduce:
`trafilatura -u  https://web.archive.org/web/20230619162141/https://www.tristatetelecom.com/productdetailI2.aspx?dataid=IPGSM-4G --xml`

Minimal reproduction example:
```
echo 'Testing<ul style="" padding:1px; margin:15px""><b>Features:</b> <li>Saves the cost of two dedicated phone lines.</li> al station using Internet or cellular technology.</li> <li>Requires no change to the existing Fire Alarm Control Panel configuration. The IPGSM-4G connects directly to the primary and secondary telephone ports.</li>
' | trafilatura --xml
```
@vbarbaresi
Copy link
Contributor

I extracted a minimal reproducing example

echo 'Testing<ul style="" padding:1px; margin:15px""><b>Features:</b> <li>Saves the cost of two dedicated phone lines.</li> al station using Internet or cellular technology.</li> <li>Requires no change to the existing Fire Alarm Control Panel configuration. The IPGSM-4G connects directly to the primary and secondary telephone ports.</li>
' | trafilatura --xml

We need a long enough element not to be dropped, and then padding:1px; margin:15px are invalid attributes: the : symbol prefix doesn't match any XML namespace.

I proposed a fix in #462

@adbar adbar linked a pull request Jan 2, 2024 that will close this issue
@adbar adbar closed this as completed in #462 Jan 2, 2024
adbar added a commit that referenced this issue Jan 2, 2024
* drop invalid XML element attributes

Fixes issue #375

The bug happened when we had a `:` in an element attribute that didn't match any XML namespace (invalid XML). In the example it was `padding:1px=""; margin:15px=""`
We can workaround it by manually dropping those bad elements.
I hope it doesn't impact performance too much

To reproduce:
`trafilatura -u  https://web.archive.org/web/20230619162141/https://www.tristatetelecom.com/productdetailI2.aspx?dataid=IPGSM-4G --xml`

Minimal reproduction example:
```
echo 'Testing<ul style="" padding:1px; margin:15px""><b>Features:</b> <li>Saves the cost of two dedicated phone lines.</li> al station using Internet or cellular technology.</li> <li>Requires no change to the existing Fire Alarm Control Panel configuration. The IPGSM-4G connects directly to the primary and secondary telephone ports.</li>
' | trafilatura --xml
```

* pin lxml to < 5

* syntax

---------

Co-authored-by: Adrien Barbaresi <adbar@users.noreply.github.com>
Co-authored-by: Adrien Barbaresi <barbaresi@bbaw.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants