-
-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XMLSyntaxError during conversion to XML output #375
Comments
I can reproduce the bug, it only happens when the output is set to XML.
URL saved for to reproduce it later: https://web.archive.org/web/20230619162141/https://www.tristatetelecom.com/productdetailI2.aspx?dataid=IPGSM-4G |
Fixes issue adbar#375 The bug happened when we had a `:` in an element attribute that didn't match any XML namespace (invalid XML). In the example it was `padding:1px=""; margin:15px=""` We can workaround it by manually dropping those bad elements. I hope it doesn't impact performance too much To reproduce: `trafilatura -u https://web.archive.org/web/20230619162141/https://www.tristatetelecom.com/productdetailI2.aspx?dataid=IPGSM-4G --xml` Minimal reproduction example: ``` echo 'Testing<ul style="" padding:1px; margin:15px""><b>Features:</b> <li>Saves the cost of two dedicated phone lines.</li> al station using Internet or cellular technology.</li> <li>Requires no change to the existing Fire Alarm Control Panel configuration. The IPGSM-4G connects directly to the primary and secondary telephone ports.</li> ' | trafilatura --xml ```
I extracted a minimal reproducing example
We need a long enough element not to be dropped, and then I proposed a fix in #462 |
* drop invalid XML element attributes Fixes issue #375 The bug happened when we had a `:` in an element attribute that didn't match any XML namespace (invalid XML). In the example it was `padding:1px=""; margin:15px=""` We can workaround it by manually dropping those bad elements. I hope it doesn't impact performance too much To reproduce: `trafilatura -u https://web.archive.org/web/20230619162141/https://www.tristatetelecom.com/productdetailI2.aspx?dataid=IPGSM-4G --xml` Minimal reproduction example: ``` echo 'Testing<ul style="" padding:1px; margin:15px""><b>Features:</b> <li>Saves the cost of two dedicated phone lines.</li> al station using Internet or cellular technology.</li> <li>Requires no change to the existing Fire Alarm Control Panel configuration. The IPGSM-4G connects directly to the primary and secondary telephone ports.</li> ' | trafilatura --xml ``` * pin lxml to < 5 * syntax --------- Co-authored-by: Adrien Barbaresi <adbar@users.noreply.github.com> Co-authored-by: Adrien Barbaresi <barbaresi@bbaw.de>
Error url:
https://www.tristatetelecom.com/productdetailI2.aspx?dataid=IPGSM-4G
Error:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 2, in <cell line: 2>
content_source1 = trafilatura.extract(ssss, output_format='xml', include_comments=False)
File "/usr/local/lib/python3.10/dist-packages/trafilatura/core.py", line 1091, in extract
return determine_returnstring(document, output_format, include_formatting, tei_validation)
File "/usr/local/lib/python3.10/dist-packages/trafilatura/core.py", line 788, in determine_returnstring
returnstring = control_xml_output(output, output_format, tei_validation, document)
File "/usr/local/lib/python3.10/dist-packages/trafilatura/xml.py", line 106, in control_xml_output
output_tree = fromstring(control_string, CONTROL_PARSER)
File "src/lxml/etree.pyx", line 3257, in lxml.etree.fromstring
File "src/lxml/parser.pxi", line 1916, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1796, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1085, in lxml.etree._BaseParser._parseUnicodeDoc
File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 728, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 657, in lxml.etree._raiseParseError
File "", line 2
XMLSyntaxError: Failed to parse QName 'padding:', line 2, column 480
Function:
trafilatura.extract(source, output_format='xml', include_comments=False)
The text was updated successfully, but these errors were encountered: