Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupted Markdown output when TXT+formatting #404

Closed
clach04 opened this issue Aug 6, 2023 · 2 comments · Fixed by #528
Closed

Corrupted Markdown output when TXT+formatting #404

clach04 opened this issue Aug 6, 2023 · 2 comments · Fixed by #528
Labels
bug Something isn't working

Comments

@clach04
Copy link

clach04 commented Aug 6, 2023

I wrote a fairly complicated testcase.. then realized I could use the command line tool :-D

The docs indicate Markdown is an option

# TXT/Markdown output

  • The plain text output (no Markdown) looks good.
  • In the examples I've tried so far the Markdown output is not usable, it appears to have the same content as text BUT the formatting is incorrect, new paragraph (line) breaks appear at odd places (e.g. the 2nd character on a line).

Demo

Session 1 - server test data

Get test data (once) and serve it to avoid repeatedly hitting web site (I could not see a way to pass in a file to trafilatura)

wget -O wget_output.html http://www.pcgamer.com/2012/08/09/an-illusionist-in-skyrim-part-1/
echo http://localhost:1234/wget_output.html
python3 -m http.server 1234

Session 2 - scrape data

cd /tmp
mkdir trafilatura_demo
cd trafilatura_demo/

python3 -m venv py3venv
. py3venv/bin/activate
python -m pip install trafilatura

trafilatura --version

Then:

# good text output, without formatting
trafilatura -u http://localhost:1234/wget_output.html 


# not great - some new lines show up
trafilatura --links -u http://localhost:1234/wget_output.html 
trafilatura --links --images -u http://localhost:1234/wget_output.html 

# messed up parapgraphs and newlines in markdown
trafilatura --formatting --links --images -u http://localhost:1234/wget_output.html 
trafilatura --formatting -u http://localhost:1234/wget_output.html 

Partial extract showing problem:

In
[Skyrim]...
....
"
*Legends ....

There are others in the same document but I'm reluctant to include too much of the content. Hopefully the test case above is enough to reproduce for other people.

It's really obvious there is odd formatting when converting back into html (e.g. using pandoc in gfm mode, or any other md2html tool).


There is no option for html (only xml) which was my idea for a workaround.

I did poke around the code but I can;t get a handle on why white space is being injected into the xml cleaning code (I can see there are reasons for it, my ham fisted attempt to remove them all was unsuccessful :-D).

Thanks for making this tool available, I'm using the python readability module and trafilatura does a much better job at the meta data extraction (so far, readability works better for me for content extraction). I'm not sure if I'm misusing the the library.

@adbar
Copy link
Owner

adbar commented Aug 7, 2023

Hi @clach04, thanks for your feedback.

First, I think you could simplify the test:

wget -O wget_output.html http://www.pcgamer.com/2012/08/09/an-illusionist-in-skyrim-part-1/
cat wget_output.html | trafilatura --formatting

Then there are two different types of issues with the output you get, right?

  1. Minor issues with links and images
  2. Real mess with formatting

Did I understand the problem or do you have anything to add?

@clach04
Copy link
Author

clach04 commented Aug 8, 2023

lol, that's a much better test case @adbar :-D

Yes, you got it. The number 2 is the main problem I'm experiencing.

@adbar adbar added the bug Something isn't working label Aug 8, 2023
@adbar adbar changed the title corrupted Markdown formatting Corrupted Markdown output when TXT+formatting Aug 8, 2023
dlwh added a commit to dlwh/trafilatura that referenced this issue Mar 21, 2024
adbar added a commit that referenced this issue Mar 28, 2024
…ing (#528)

* fix formatting by correcting order of element generation, space handling

* fix #404

* oops

* revert dumb change

* review code

---------

Co-authored-by: Adrien Barbaresi <barbaresi@bbaw.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants