-
-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Corrupted Markdown output when TXT+formatting #404
Labels
bug
Something isn't working
Comments
Hi @clach04, thanks for your feedback. First, I think you could simplify the test:
Then there are two different types of issues with the output you get, right?
Did I understand the problem or do you have anything to add? |
lol, that's a much better test case @adbar :-D Yes, you got it. The number 2 is the main problem I'm experiencing. |
adbar
changed the title
corrupted Markdown formatting
Corrupted Markdown output when TXT+formatting
Aug 8, 2023
dlwh
added a commit
to dlwh/trafilatura
that referenced
this issue
Mar 21, 2024
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I wrote a fairly complicated testcase.. then realized I could use the command line tool :-D
The docs indicate Markdown is an option
trafilatura/docs/usage-python.rst
Line 71 in d78fbb5
Demo
Session 1 - server test data
Get test data (once) and serve it to avoid repeatedly hitting web site (I could not see a way to pass in a file to trafilatura)
Session 2 - scrape data
Then:
Partial extract showing problem:
There are others in the same document but I'm reluctant to include too much of the content. Hopefully the test case above is enough to reproduce for other people.
It's really obvious there is odd formatting when converting back into html (e.g. using pandoc in gfm mode, or any other md2html tool).
There is no option for html (only xml) which was my idea for a workaround.
I did poke around the code but I can;t get a handle on why white space is being injected into the xml cleaning code (I can see there are reasons for it, my ham fisted attempt to remove them all was unsuccessful :-D).
Thanks for making this tool available, I'm using the python readability module and trafilatura does a much better job at the meta data extraction (so far, readability works better for me for content extraction). I'm not sure if I'm misusing the the library.
The text was updated successfully, but these errors were encountered: