fix formatting by correcting order of element generation, space handling #528

dlwh · 2024-03-21T20:33:28Z

Fixes #404 and a bunch of other associated issues around newlines and ordering output when formatting=true. I suspect this fixes most of the issues with formatting==True

trafilatura/xml.py

codecov · 2024-03-22T13:07:51Z

Codecov Report

Attention: Patch coverage is 96.00000% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 97.35%. Comparing base (ff38644) to head (55036fb).

Files	Patch %	Lines
trafilatura/xml.py	94.87%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #528      +/-   ##
==========================================
+ Coverage   97.25%   97.35%   +0.09%     
==========================================
  Files          22       22              
  Lines        3424     3439      +15     
==========================================
+ Hits         3330     3348      +18     
+ Misses         94       91       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

adbar · 2024-03-22T13:39:45Z

Hi @dlwh, thanks for the PR, it seems really helpful!

I see a small regression on my benchmark (see readme in tests/) in terms of accuracy and speed, the latter is probably related to your change in the trimming function, it is necessary?

As for the accuracy I'll try to see if I find something. Integrating the PR bit by bit (say function by function) could also be useful as the changes could reach further than expected.

dlwh · 2024-03-22T16:54:46Z

oh sorry the trim part can be reverted, missed that when cleaning up!

The spaces change is necessary because the spaces trimming is incorrect when outputting with formatting. Two words can get merged into one, etc.

adbar · 2024-03-22T17:24:40Z

I just tried the trim() part and it slows things a bit but that's not all, we need to check what's happening function-by-function.

I can have a look at it next week, feel free to work on the PR if you wish to add or amend anything.

dlwh · 2024-03-22T20:13:20Z

i reverted the changes to trim. My bad!

dlwh · 2024-03-22T20:22:22Z

something seems wrong on master? (I ran pip install -r eval-requirements.txt)

Your branch is up to date with 'origin/master'.
~/src/trafilatura/tests (master)$ python comparison.py
(node:42555) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
Traceback (most recent call last):
  File "/Users/dlwh/src/trafilatura/tests/comparison.py", line 613, in <module>
    result = run_resiliparse(htmlstring)
  File "/Users/dlwh/src/trafilatura/tests/comparison.py", line 346, in run_resiliparse
    decoded = bytes_to_str(htmlstring, detect_encoding(htmlstring))
TypeError: Argument 'data' has incorrect type (expected bytes, got str)

adbar · 2024-03-25T12:32:29Z

Thanks for the change!

I don't run the evaluation script on a regular basis so it could be that lines need to be changed to account for updated dependencies or functions.
For the time being you can run a minimal version of it to focus on Trafilatura only. If I compare the master version of Trafilatura with this PR I see a slight drop in the results. Nothing worrying as such but I cannot explain it which and I'd like to pinpoint the change which leads to it.

trafilatura/utils.py

trafilatura/xml.py

adbar · 2024-03-27T15:46:15Z

Hi @dlwh, I checked the code again, could you please address the questions above?

Also, I still don't quite understand the changes made to the xmltotxt() function, maybe the following example is too simple to grasp the point you're addressing but it seems to me that doc.iter("*") already returns all elements in the right order.

from lxml import etree

doc = etree.fromstring("<doc><a/><b><c>Test</c></b><d/></doc>")
assert list(doc) == doc.getchildren()  # first level, neither root nor children: a, b, d
assert doc.xpath("//*") == list(doc.iter("*"))  # depth-first, all elements: doc, a, b, c, d

Since the tests pass and your PR handles more complex elements (e.g. formatting) I'll integrate the PR once the open questions have been addressed, I'd just like to understand the interest of the recursive function.

What do you think?

dlwh · 2024-03-27T18:03:28Z

the issue is with tail and "post formatting". The current code does a preorder traversal of the tree, handling .text and .tail at the same time. The correct order for is pre_format + e.text + ''.join(recurse(c) for c in e.children) + post_format + e.tail but the current code does roughly pre_format + e.text + e.tail + post_format + ''.join(recurse(c) for c in e.children)

The code happens to be correct for no-formatting because the tree is highly flattened, but once you preserve formatting it behaves badly.

As an example:

def test_recurse_formatting():
    htmlstring = '''
    <html><body><article>
    <p> this is a <b>bold </b> test of the <i>italic</i> function</p>
    <a href="test.html">this is a link</a>
    </article></body></html>
    '''

    result = extract(htmlstring, no_fallback=True, output_format='txt', config=ZERO_CONFIG, include_formatting=True, include_links=True)
    print(result)

In master, this produces

this is a
**bold ** test of the *italic* function [this is a link](test.html)

while in my patch, this produces

this is a **bold ** test of the *italic* function
[this is a link](test.html)

This still isn't quite right (there should be another newline) but it's better.

All that said, I've decided to move to readability-->html2text. I think trafilatura is overall marginally better at extracting content, but readability does a much better job at preserving formatting (even comparing to Trafilatura's XML component)

I do appreciate the work you've done. I think this is a good project, it just doesn't quite have the functionality I need.

I think misprocessing of tail is a pretty pervasive problem in this code base, fwiw. For instance, in main (and in this patch), this text gets reordered at the XML level:

def test_recurse_formatting():
    htmlstring = '''
    <html><body><article>
    <p> this is a <a href="test.html"><b>bold </b>zz</a> test of the <i>italic</i> function</p>
    <a href="test.html">this is a link</a>
    </article></body></html>
    '''

    result = extract(htmlstring, no_fallback=True, output_format='xml', config=ZERO_CONFIG, include_formatting=True, include_links=True)
    print(result)

this produces

<doc categories="" tags="" fingerprint="9af9dbb52a4f0283">
  <main><p> this is a <ref target="test.html"> test of the bold zz</ref></p><hi rend="#i">italic</hi>function<p><ref target="test.html">this is a link</ref></p></main>
  <comments/>
</doc>

Note that "test of the" got moved to the beginning of the ref somehow, while it should be inside in the ref's tail.

adbar · 2024-03-28T12:44:12Z

Thanks for the detailed explanation and thanks again for the PR.

There is indeed a problem with nested elements in certain cases, this comes from the fact that I don't primarily focus on formatting or links, these features were added on request. The tests cover most difficulties but not all. The issue with element tails is also deeply linked to LXML, a package which is mainly beneficial but is also difficult to work with at times.

As you say, extraction accuracy depends on one's particular use case, software packages merely try to strike a balance. Existing evaluations focus on html2txt conversion which entails some structure (e.g. paragraph text not garbled, list elements correctly separated, etc.) though not in a full-fledged XML way.
Readability works OKish for English but external evaluations seem to agree on the fact that Trafilatura does a better job at extracting text. I recently found this evaluation on English news sources and Trafilatura compares favorably to the rest, even before work in PR #530.

In any case, feel free to follow the development of Trafilatura as issues with formatting could get fixed over time.

dlwh added 2 commits March 21, 2024 13:17

fix formatting by correcting order of element generation, space handling

707e20d

fix adbar#404

afd34e9

dlwh commented Mar 21, 2024

View reviewed changes

trafilatura/xml.py Show resolved Hide resolved

oops

82bcbf3

revert dumb change

0afe678

adbar reviewed Mar 27, 2024

View reviewed changes

trafilatura/utils.py Outdated Show resolved Hide resolved

adbar reviewed Mar 27, 2024

View reviewed changes

trafilatura/xml.py Show resolved Hide resolved

adbar reviewed Mar 27, 2024

View reviewed changes

trafilatura/xml.py Outdated Show resolved Hide resolved

review code

55036fb

adbar merged commit fa972ab into adbar:master Mar 28, 2024
15 of 16 checks passed

dlwh deleted the badnewlines branch May 24, 2024 18:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix formatting by correcting order of element generation, space handling #528

fix formatting by correcting order of element generation, space handling #528

dlwh commented Mar 21, 2024

codecov bot commented Mar 22, 2024 •

edited

Loading

adbar commented Mar 22, 2024

dlwh commented Mar 22, 2024

adbar commented Mar 22, 2024

dlwh commented Mar 22, 2024

dlwh commented Mar 22, 2024

adbar commented Mar 25, 2024

adbar commented Mar 27, 2024

dlwh commented Mar 27, 2024

adbar commented Mar 28, 2024

fix formatting by correcting order of element generation, space handling #528

fix formatting by correcting order of element generation, space handling #528

Conversation

dlwh commented Mar 21, 2024

codecov bot commented Mar 22, 2024 • edited Loading

Codecov Report

adbar commented Mar 22, 2024

dlwh commented Mar 22, 2024

adbar commented Mar 22, 2024

dlwh commented Mar 22, 2024

dlwh commented Mar 22, 2024

adbar commented Mar 25, 2024

adbar commented Mar 27, 2024

dlwh commented Mar 27, 2024

adbar commented Mar 28, 2024

codecov bot commented Mar 22, 2024 •

edited

Loading