- Fix
sanitizeTree
and real world test. - Add additional selector rules.
- Restructure cmd.
- Update README.
- Port the
comparison.py
. At this point all code have been ported. - Strip text elements containing only spaces.
- Fix HTML language element filter.
- Fix
postCleaning
. - Improve test coverage.
- Add support for details/summary tags.
- Refined metadata title selector.
- Include page license in metadata extraction.
- Fix: don't remove tail of discarded elements.
- Define generic function to remove nodes.
- Fix wrong constant in
collectLinkInfo
.
- Add license header in each file
- Improve charset encoding to make sure parsing HTML document always done in UTF-8.
- In CLI, add flags to fetch only the urls from sitemap.
- In CLI, implement feed finder and downloader.
- In CLI, add flags for custom user agent.
- Move
etree
andselector
package to internal dir so it can't be reached by user. - Remove finished python codes.
- In CLI, implement sitemap finder and downloader.
- In CLI, add support for several type of output.
- In CLI, add subcommand for batch download from file that contains list of url.
- Make the log less verbose.
- Implement initial CLI.
- Modify paragraphs handling since our output is in HTML, not XML like the original Trafilatura.
- Put whitespace in place of void element when writing text using
etree.IterText
. - Dont strip image elements when sanitizing extraction result.
- Add initial example.
-
Implement real world test from
tests/realworld_test.py
. In the original Trafilatura, in this test the extraction is done while enabling fallback extractors. However, since the fallback extractors in the original Trafilatura is different with the one that used in this port, obviously the result is different as well which make the test can't be ported as is.To solve this, I've changed the test in the original Trafilatura to disable the fallback extractors. This way the test is more focused on the capability of Trafilatura alone, which make the test is compatible and can be ported.
- Since our port use
go-readability
as one of its fallback, here we updated it to more recent version of Readability.js. - Fix external
dom
package to not appending child to void elements (elements that can't have any children, eg<br/>
).
- Now
Extract
also returns metadata along the extracted content. - Add advanced config in extraction
Options
. - Minor change in
etree.ToString
to make it more readable. - Implement unit tests.
- Finished implementing
Extract
function. At this point the port is kind of finished, but it's still not tested, so there is still a long way to go. - Restructure test files.
- Fix implementation of
IterText
inetree
package. - Implement fallback extraction using
go-readability
andgo-domdistiller
.
- Restructure selector files.
- Implement comments extraction.
- Implement content extraction.
- Port some of LXML functionality to
etree
package. - Fix major issue when appending or replacing node in external
dom
package. Apparently this issue goes unnoticed in bothgo-readability
andgo-domdistiller
. - Restart porting process from zero 😢.
- Reimplement
cache
. - Reimplement metadata extractor.
- No code today. Looks like I've made a wrong assumptions about LXML library that used by the original Trafilatura. In functionality it's really similar with
dom
package, however there are several difference in how it works. Might need to port some codes.
- Port
link_density_test
andlink_density_test_tables
fromhtmlprocessing.py
- Port
DISCARD_XPATH
inxpaths.py
- Port
LRUCache
inlru.py
- Port
textfilter
infilters.py
- Port
duplicate_test
infilters.py
- Port
extract_comments
incore.py
. It's still not tested though since there are no specific unit test for this. - Port
CONTENT_XPATH
inxpaths.py
- Port
check_html_lang
function infilters.py
- Port metadata extraction in
metadata.py
. There is a minor modification in metadata extraction from JSON+LD data. In the original Trafilatura, this step is done using regular expressions which is not exactly ideal for handling JSON data. Instead, here we use a proper JSON parser with fallback to the original regular expressions. This way, the extraction should be more accurate yet still give the same result when tested. - Port
tree_cleaning
andprune_html
inhtmlprocessing.py
- Good news: we might not need to port Python's
courlan
package since Go'snet/url
is good enough. - Bad news: we might need to port Python's
htmldate
which used to find publish date of a web page, which used in metadata extraction.
- Porting process started