Skip to content

Latest commit

 

History

History
133 lines (89 loc) · 5.01 KB

CHANGELOG.md

File metadata and controls

133 lines (89 loc) · 5.01 KB

Changelog

22 May 2021

  • Fix sanitizeTree and real world test.
  • Add additional selector rules.
  • Restructure cmd.
  • Update README.

21 May 2021

  • Port the comparison.py. At this point all code have been ported.
  • Strip text elements containing only spaces.
  • Fix HTML language element filter.
  • Fix postCleaning.
  • Improve test coverage.
  • Add support for details/summary tags.
  • Refined metadata title selector.
  • Include page license in metadata extraction.
  • Fix: don't remove tail of discarded elements.
  • Define generic function to remove nodes.
  • Fix wrong constant in collectLinkInfo.

20 May 2021

  • Add license header in each file
  • Improve charset encoding to make sure parsing HTML document always done in UTF-8.

19 May 2021

  • In CLI, add flags to fetch only the urls from sitemap.
  • In CLI, implement feed finder and downloader.
  • In CLI, add flags for custom user agent.
  • Move etree and selector package to internal dir so it can't be reached by user.
  • Remove finished python codes.

18 May 2021

  • In CLI, implement sitemap finder and downloader.

17 May 2021

  • In CLI, add support for several type of output.
  • In CLI, add subcommand for batch download from file that contains list of url.

16 May 2021

  • Make the log less verbose.
  • Implement initial CLI.

12 May 2021

  • Modify paragraphs handling since our output is in HTML, not XML like the original Trafilatura.
  • Put whitespace in place of void element when writing text using etree.IterText.
  • Dont strip image elements when sanitizing extraction result.
  • Add initial example.

11 May 2021

  • Implement real world test from tests/realworld_test.py. In the original Trafilatura, in this test the extraction is done while enabling fallback extractors. However, since the fallback extractors in the original Trafilatura is different with the one that used in this port, obviously the result is different as well which make the test can't be ported as is.

    To solve this, I've changed the test in the original Trafilatura to disable the fallback extractors. This way the test is more focused on the capability of Trafilatura alone, which make the test is compatible and can be ported.

10 May 2021

  • Since our port use go-readability as one of its fallback, here we updated it to more recent version of Readability.js.
  • Fix external dom package to not appending child to void elements (elements that can't have any children, eg <br/>).

9 May 2021

  • Now Extract also returns metadata along the extracted content.
  • Add advanced config in extraction Options.
  • Minor change in etree.ToString to make it more readable.
  • Implement unit tests.

8 May 2021

  • Finished implementing Extract function. At this point the port is kind of finished, but it's still not tested, so there is still a long way to go.
  • Restructure test files.

7 May 2021

  • Fix implementation of IterText in etree package.
  • Implement fallback extraction using go-readability and go-domdistiller.

6 May 2021

  • Restructure selector files.
  • Implement comments extraction.
  • Implement content extraction.

5 May 2021

  • Port some of LXML functionality to etree package.
  • Fix major issue when appending or replacing node in external dom package. Apparently this issue goes unnoticed in both go-readability and go-domdistiller.
  • Restart porting process from zero 😢.
  • Reimplement cache.
  • Reimplement metadata extractor.

4 May 2021

  • No code today. Looks like I've made a wrong assumptions about LXML library that used by the original Trafilatura. In functionality it's really similar with dom package, however there are several difference in how it works. Might need to port some codes.

3 May 2021

  • Port link_density_test and link_density_test_tables from htmlprocessing.py

2 May 2021

  • Port DISCARD_XPATH in xpaths.py

1 May 2021

  • Port LRUCache in lru.py
  • Port textfilter in filters.py
  • Port duplicate_test in filters.py
  • Port extract_comments in core.py. It's still not tested though since there are no specific unit test for this.
  • Port CONTENT_XPATH in xpaths.py

29 April 2021

  • Port check_html_lang function in filters.py
  • Port metadata extraction in metadata.py. There is a minor modification in metadata extraction from JSON+LD data. In the original Trafilatura, this step is done using regular expressions which is not exactly ideal for handling JSON data. Instead, here we use a proper JSON parser with fallback to the original regular expressions. This way, the extraction should be more accurate yet still give the same result when tested.
  • Port tree_cleaning and prune_html in htmlprocessing.py
  • Good news: we might not need to port Python's courlan package since Go's net/url is good enough.
  • Bad news: we might need to port Python's htmldate which used to find publish date of a web page, which used in metadata extraction.

25 April 2021

  • Porting process started