Releases: adbar/trafilatura
Releases · adbar/trafilatura
trafilatura-1.12.2
- downloads: add support for SOCKS proxies with @gremid (#682)
- extraction fix: ValueError in table spans (#685)
- spider:
prune_xpath
parameter added by @felipehertzer (#684) - spider: relax strict parameter for link extraction (#687)
- sitemaps:
max_sitemaps
parameter added by @felipehertzer (#690) - maintenance: make compression libraries optional (#691)
- metadata: review and lint code (#694)
trafilatura-1.12.1
trafilatura-1.12.0
Breaking change:
- enforce fixed list of output formats, deprecate
-out
on the CLI (#647)
Faster, more accurate extraction:
- review link and structure checks (#653)
- improve justext fallback (#652)
- baseline: prevent LXML error in JSON-LD (#643), do not use as backup extraction (#646)
- review XPaths for undesirable content (#645)
Bugfixes and maintenance:
- CLI fix: markdown format should trigger
include_formatting
(#649) - images fix: use a length threshold on src attribute (#654)
- XML-TEI: replace RelaxNG by DTD, remove pickle, and update (#655)
- formatting & markdown fix: add newlines (#656)
- table fix: prevent
MemoryError
&ValueError
during conversion to text (#658)
Documentation:
trafilatura-1.11.0
Breaking change:
- metadata now skipped by default (#613), to trigger inclusion in all output formats:
with_metadata=True
(Python)--with-metadata
(CLI)
Extraction:
- add HTML as output format (#614)
- better and faster baseline extraction (#619)
- better handling of HTML/XML elements (#628)
- XPath rules added with @felipehertzer (#540)
- fix: avoid faulty readability_lxml content (#635)
Evaluation:
- new scripts and data with @LydiaKoerber (#606, #615)
- additional data with @swetepete (#197)
Maintenance:
trafilatura-1.10.0
Breaking changes:
- raise errors on deprecated CLI and function arguments (#581)
- regroup classes and functions linked to deduplication (#582)
trafilatura.hashing
→trafilatura.deduplication
Extraction:
- port of is_probably_readerable from readability.js by @zirkelc in #587
- Markdown table fixes by @naktinis in #601
- fix list spacing in TXT output (#598)
- CLI fixes: file processing options, mtime, and tests (#605)
- CLI fix: read standard input as binary (#607)
Downloads:
- fix deflate and add optional zstd to accepted encodings (#594)
- spider fix: use internal download utilities for robots.txt (#590)
Maintenance:
- add author XPaths (#567)
- update justext and lxml dependencies (#593)
- simplify code: unique function for length tests (#591)
Docs:
trafilatura-1.9.0
Extraction:
- add markdown as explicit output (#550)
- improve recall preset (#571)
- speedup for readability-lxml (#547)
- add global options object for extraction and use it in CLI (#552)
- fix: better encoding detection (#548)
- recall: fix for lists inside tables with @mikhainin (#534)
- add symbol to preserve vertical spacing in Markdown (#499)
- fix: table cell separators in non-XML output (#563)
- slightly better accuracy and execution speed overall
Metadata:
- add file creation date (date extraction, JSON & XML-TEI) (#561)
- fix: empty content in meta tag by @felipehertzer (#545)
Maintenance:
- restructure and simplify code (#543, #556)
- CLI & downloads: revamp and use global options (#565)
- eval: review code, add guidelines and small benchmark (#542)
- fix: raise error if config file does not exist (#554)
- deprecate
process_record()
(#549) - docs: convert readme to markdown and update info (#564, #578)
trafilatura-1.8.1
trafilatura-1.8.0
Extraction:
- Better precision by @felipehertzer (#509, #520)
- Code formatting in TXT/Markdown output added (#498)
- Improved CSV output (#496)
- LXML: compile XPath expressions (#504)
- Overall speedup about +5%
Downloads and Navigation:
- More robust scans with
is_live_page()
(#501) - Better sitemap start and safeguards (#503, #506)
- Fix for headers in response object (#513)
Maintenance:
trafilatura-1.7.0
trafilatura-1.6.4
Maintenance:
- MacOS: fix setup, update htmldate and add tests (#460)
- drop invalid XML element attributes with @vbarbaresi in #462
- remove cyclic imports (#458)
Navigation:
- introduce
MAX_REDIRECTS
config setting and fix urllib3 redirect handling by @vbarbaresi in #461 - improve feed detection (#457)
Documentation: