Corrupted Markdown output when TXT+formatting #404

clach04 · 2023-08-06T22:38:28Z

I wrote a fairly complicated testcase.. then realized I could use the command line tool :-D

The docs indicate Markdown is an option

trafilatura/docs/usage-python.rst

Line 71 in d78fbb5

# TXT/Markdown output

The plain text output (no Markdown) looks good.
In the examples I've tried so far the Markdown output is not usable, it appears to have the same content as text BUT the formatting is incorrect, new paragraph (line) breaks appear at odd places (e.g. the 2nd character on a line).

Demo

Session 1 - server test data

Get test data (once) and serve it to avoid repeatedly hitting web site (I could not see a way to pass in a file to trafilatura)

wget -O wget_output.html http://www.pcgamer.com/2012/08/09/an-illusionist-in-skyrim-part-1/
echo http://localhost:1234/wget_output.html
python3 -m http.server 1234

Session 2 - scrape data

cd /tmp
mkdir trafilatura_demo
cd trafilatura_demo/

python3 -m venv py3venv
. py3venv/bin/activate
python -m pip install trafilatura

trafilatura --version

Then:

# good text output, without formatting
trafilatura -u http://localhost:1234/wget_output.html 


# not great - some new lines show up
trafilatura --links -u http://localhost:1234/wget_output.html 
trafilatura --links --images -u http://localhost:1234/wget_output.html 

# messed up parapgraphs and newlines in markdown
trafilatura --formatting --links --images -u http://localhost:1234/wget_output.html 
trafilatura --formatting -u http://localhost:1234/wget_output.html

Partial extract showing problem:

In
[Skyrim]...
....
"
*Legends ....

There are others in the same document but I'm reluctant to include too much of the content. Hopefully the test case above is enough to reproduce for other people.

It's really obvious there is odd formatting when converting back into html (e.g. using pandoc in gfm mode, or any other md2html tool).

There is no option for html (only xml) which was my idea for a workaround.

I did poke around the code but I can;t get a handle on why white space is being injected into the xml cleaning code (I can see there are reasons for it, my ham fisted attempt to remove them all was unsuccessful :-D).

Thanks for making this tool available, I'm using the python readability module and trafilatura does a much better job at the meta data extraction (so far, readability works better for me for content extraction). I'm not sure if I'm misusing the the library.

The text was updated successfully, but these errors were encountered:

adbar · 2023-08-07T12:07:22Z

Hi @clach04, thanks for your feedback.

First, I think you could simplify the test:

wget -O wget_output.html http://www.pcgamer.com/2012/08/09/an-illusionist-in-skyrim-part-1/
cat wget_output.html | trafilatura --formatting

Then there are two different types of issues with the output you get, right?

Minor issues with links and images
Real mess with formatting

Did I understand the problem or do you have anything to add?

clach04 · 2023-08-08T03:14:38Z

lol, that's a much better test case @adbar :-D

Yes, you got it. The number 2 is the main problem I'm experiencing.

…ing (#528) * fix formatting by correcting order of element generation, space handling * fix #404 * oops * revert dumb change * review code --------- Co-authored-by: Adrien Barbaresi <barbaresi@bbaw.de>

clach04 mentioned this issue Aug 8, 2023

Idea implement compatable API to Postlight (nee Mercury) Parser Soontao/trafilatura-srv#5

Open

adbar added the bug Something isn't working label Aug 8, 2023

adbar changed the title ~~corrupted Markdown formatting~~ Corrupted Markdown output when TXT+formatting Aug 8, 2023

dlwh added a commit to dlwh/trafilatura that referenced this issue Mar 21, 2024

fix adbar#404

afd34e9

dlwh mentioned this issue Mar 21, 2024

fix formatting by correcting order of element generation, space handling #528

Merged

adbar closed this as completed in #528 Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Corrupted Markdown output when TXT+formatting #404

Corrupted Markdown output when TXT+formatting #404

clach04 commented Aug 6, 2023

adbar commented Aug 7, 2023

clach04 commented Aug 8, 2023

Corrupted Markdown output when TXT+formatting #404

Corrupted Markdown output when TXT+formatting #404

Comments

clach04 commented Aug 6, 2023

Demo

Session 1 - server test data

Session 2 - scrape data

adbar commented Aug 7, 2023

clach04 commented Aug 8, 2023