Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix formatting by correcting order of element generation, space handling #528

Merged
merged 5 commits into from
Mar 28, 2024

Conversation

dlwh
Copy link
Contributor

@dlwh dlwh commented Mar 21, 2024

Fixes #404 and a bunch of other associated issues around newlines and ordering output when formatting=true. I suspect this fixes most of the issues with formatting==True

Copy link

codecov bot commented Mar 22, 2024

Codecov Report

Attention: Patch coverage is 96.00000% with 2 lines in your changes are missing coverage. Please review.

Project coverage is 97.35%. Comparing base (ff38644) to head (55036fb).

Files Patch % Lines
trafilatura/xml.py 94.87% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #528      +/-   ##
==========================================
+ Coverage   97.25%   97.35%   +0.09%     
==========================================
  Files          22       22              
  Lines        3424     3439      +15     
==========================================
+ Hits         3330     3348      +18     
+ Misses         94       91       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@adbar
Copy link
Owner

adbar commented Mar 22, 2024

Hi @dlwh, thanks for the PR, it seems really helpful!

I see a small regression on my benchmark (see readme in tests/) in terms of accuracy and speed, the latter is probably related to your change in the trimming function, it is necessary?

As for the accuracy I'll try to see if I find something. Integrating the PR bit by bit (say function by function) could also be useful as the changes could reach further than expected.

@dlwh
Copy link
Contributor Author

dlwh commented Mar 22, 2024

oh sorry the trim part can be reverted, missed that when cleaning up!

The spaces change is necessary because the spaces trimming is incorrect when outputting with formatting. Two words can get merged into one, etc.

@adbar
Copy link
Owner

adbar commented Mar 22, 2024

I just tried the trim() part and it slows things a bit but that's not all, we need to check what's happening function-by-function.

I can have a look at it next week, feel free to work on the PR if you wish to add or amend anything.

@dlwh
Copy link
Contributor Author

dlwh commented Mar 22, 2024

i reverted the changes to trim. My bad!

@dlwh
Copy link
Contributor Author

dlwh commented Mar 22, 2024

something seems wrong on master? (I ran pip install -r eval-requirements.txt)

Your branch is up to date with 'origin/master'.
~/src/trafilatura/tests (master)$ python comparison.py
(node:42555) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
(Use `node --trace-deprecation ...` to show where the warning was created)
Traceback (most recent call last):
  File "/Users/dlwh/src/trafilatura/tests/comparison.py", line 613, in <module>
    result = run_resiliparse(htmlstring)
  File "/Users/dlwh/src/trafilatura/tests/comparison.py", line 346, in run_resiliparse
    decoded = bytes_to_str(htmlstring, detect_encoding(htmlstring))
TypeError: Argument 'data' has incorrect type (expected bytes, got str)

@adbar
Copy link
Owner

adbar commented Mar 25, 2024

Thanks for the change!

I don't run the evaluation script on a regular basis so it could be that lines need to be changed to account for updated dependencies or functions.
For the time being you can run a minimal version of it to focus on Trafilatura only. If I compare the master version of Trafilatura with this PR I see a slight drop in the results. Nothing worrying as such but I cannot explain it which and I'd like to pinpoint the change which leads to it.

trafilatura/utils.py Outdated Show resolved Hide resolved
trafilatura/xml.py Outdated Show resolved Hide resolved
@adbar
Copy link
Owner

adbar commented Mar 27, 2024

Hi @dlwh, I checked the code again, could you please address the questions above?

Also, I still don't quite understand the changes made to the xmltotxt() function, maybe the following example is too simple to grasp the point you're addressing but it seems to me that doc.iter("*") already returns all elements in the right order.

from lxml import etree

doc = etree.fromstring("<doc><a/><b><c>Test</c></b><d/></doc>")
assert list(doc) == doc.getchildren()  # first level, neither root nor children: a, b, d
assert doc.xpath("//*") == list(doc.iter("*"))  # depth-first, all elements: doc, a, b, c, d

Since the tests pass and your PR handles more complex elements (e.g. formatting) I'll integrate the PR once the open questions have been addressed, I'd just like to understand the interest of the recursive function.

What do you think?

@dlwh
Copy link
Contributor Author

dlwh commented Mar 27, 2024

the issue is with tail and "post formatting". The current code does a preorder traversal of the tree, handling .text and .tail at the same time. The correct order for is pre_format + e.text + ''.join(recurse(c) for c in e.children) + post_format + e.tail but the current code does roughly pre_format + e.text + e.tail + post_format + ''.join(recurse(c) for c in e.children)

The code happens to be correct for no-formatting because the tree is highly flattened, but once you preserve formatting it behaves badly.

As an example:

def test_recurse_formatting():
    htmlstring = '''
    <html><body><article>
    <p> this is a <b>bold </b> test of the <i>italic</i> function</p>
    <a href="test.html">this is a link</a>
    </article></body></html>
    '''

    result = extract(htmlstring, no_fallback=True, output_format='txt', config=ZERO_CONFIG, include_formatting=True, include_links=True)
    print(result)

In master, this produces

this is a
**bold ** test of the *italic* function [this is a link](test.html)

while in my patch, this produces

this is a **bold ** test of the *italic* function
[this is a link](test.html)

This still isn't quite right (there should be another newline) but it's better.

All that said, I've decided to move to readability-->html2text. I think trafilatura is overall marginally better at extracting content, but readability does a much better job at preserving formatting (even comparing to Trafilatura's XML component)

I do appreciate the work you've done. I think this is a good project, it just doesn't quite have the functionality I need.

I think misprocessing of tail is a pretty pervasive problem in this code base, fwiw. For instance, in main (and in this patch), this text gets reordered at the XML level:

def test_recurse_formatting():
    htmlstring = '''
    <html><body><article>
    <p> this is a <a href="test.html"><b>bold </b>zz</a> test of the <i>italic</i> function</p>
    <a href="test.html">this is a link</a>
    </article></body></html>
    '''

    result = extract(htmlstring, no_fallback=True, output_format='xml', config=ZERO_CONFIG, include_formatting=True, include_links=True)
    print(result)

this produces

<doc categories="" tags="" fingerprint="9af9dbb52a4f0283">
  <main><p> this is a <ref target="test.html"> test of the bold zz</ref></p><hi rend="#i">italic</hi>function<p><ref target="test.html">this is a link</ref></p></main>
  <comments/>
</doc>

Note that "test of the" got moved to the beginning of the ref somehow, while it should be inside in the ref's tail.

@adbar
Copy link
Owner

adbar commented Mar 28, 2024

Thanks for the detailed explanation and thanks again for the PR.

There is indeed a problem with nested elements in certain cases, this comes from the fact that I don't primarily focus on formatting or links, these features were added on request. The tests cover most difficulties but not all. The issue with element tails is also deeply linked to LXML, a package which is mainly beneficial but is also difficult to work with at times.

As you say, extraction accuracy depends on one's particular use case, software packages merely try to strike a balance. Existing evaluations focus on html2txt conversion which entails some structure (e.g. paragraph text not garbled, list elements correctly separated, etc.) though not in a full-fledged XML way.
Readability works OKish for English but external evaluations seem to agree on the fact that Trafilatura does a better job at extracting text. I recently found this evaluation on English news sources and Trafilatura compares favorably to the rest, even before work in PR #530.

In any case, feel free to follow the development of Trafilatura as issues with formatting could get fixed over time.

@adbar adbar merged commit fa972ab into adbar:master Mar 28, 2024
15 of 16 checks passed
@dlwh dlwh deleted the badnewlines branch May 24, 2024 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Corrupted Markdown output when TXT+formatting
2 participants