-
-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix formatting by correcting order of element generation, space handling #528
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #528 +/- ##
==========================================
+ Coverage 97.25% 97.35% +0.09%
==========================================
Files 22 22
Lines 3424 3439 +15
==========================================
+ Hits 3330 3348 +18
+ Misses 94 91 -3 ☔ View full report in Codecov by Sentry. |
Hi @dlwh, thanks for the PR, it seems really helpful! I see a small regression on my benchmark (see readme in As for the accuracy I'll try to see if I find something. Integrating the PR bit by bit (say function by function) could also be useful as the changes could reach further than expected. |
oh sorry the trim part can be reverted, missed that when cleaning up! The spaces change is necessary because the spaces trimming is incorrect when outputting with formatting. Two words can get merged into one, etc. |
I just tried the I can have a look at it next week, feel free to work on the PR if you wish to add or amend anything. |
i reverted the changes to trim. My bad! |
something seems wrong on master? (I ran
|
Thanks for the change! I don't run the evaluation script on a regular basis so it could be that lines need to be changed to account for updated dependencies or functions. |
Hi @dlwh, I checked the code again, could you please address the questions above? Also, I still don't quite understand the changes made to the
Since the tests pass and your PR handles more complex elements (e.g. formatting) I'll integrate the PR once the open questions have been addressed, I'd just like to understand the interest of the recursive function. What do you think? |
the issue is with tail and "post formatting". The current code does a preorder traversal of the tree, handling .text and .tail at the same time. The correct order for is The code happens to be correct for no-formatting because the tree is highly flattened, but once you preserve formatting it behaves badly. As an example: def test_recurse_formatting():
htmlstring = '''
<html><body><article>
<p> this is a <b>bold </b> test of the <i>italic</i> function</p>
<a href="test.html">this is a link</a>
</article></body></html>
'''
result = extract(htmlstring, no_fallback=True, output_format='txt', config=ZERO_CONFIG, include_formatting=True, include_links=True)
print(result) In master, this produces
while in my patch, this produces
This still isn't quite right (there should be another newline) but it's better. All that said, I've decided to move to readability-->html2text. I think trafilatura is overall marginally better at extracting content, but readability does a much better job at preserving formatting (even comparing to Trafilatura's XML component) I do appreciate the work you've done. I think this is a good project, it just doesn't quite have the functionality I need. I think misprocessing of
this produces
Note that "test of the" got moved to the beginning of the ref somehow, while it should be inside in the ref's tail. |
Thanks for the detailed explanation and thanks again for the PR. There is indeed a problem with nested elements in certain cases, this comes from the fact that I don't primarily focus on formatting or links, these features were added on request. The tests cover most difficulties but not all. The issue with element tails is also deeply linked to LXML, a package which is mainly beneficial but is also difficult to work with at times. As you say, extraction accuracy depends on one's particular use case, software packages merely try to strike a balance. Existing evaluations focus on html2txt conversion which entails some structure (e.g. paragraph text not garbled, list elements correctly separated, etc.) though not in a full-fledged XML way. In any case, feel free to follow the development of Trafilatura as issues with formatting could get fixed over time. |
Fixes #404 and a bunch of other associated issues around newlines and ordering output when formatting=true. I suspect this fixes most of the issues with formatting==True