Skip to content

Tags: coryhacking/goose

Tags

1.4.1

Toggle 1.4.1's commit message
Resolving goofy maven issue. it required a new version to fully update.

1.4.0

Toggle 1.4.0's commit message
Major: DefaultOutputFormatter#getFormattedText now unescapes HTML inc…

…luding all HTML Entities

Minor: I have begun to convert the usage of DefaultOutputFormatter so that you only use a single method: getFormattedText(Element topNode)

Bug fixes:
  * clean by class name was too restrictive and removed actual content elements, modified the list of names to only remove classes
    that end in "meta" instead of just containing the word "meta"

  * Modified DefaultDocumentCleaner#cleanBadTags to only select from within the body element to avoid removing it.

  * Added a helper method for removing nodes to handle cases where the node's parentNode is null (already removed). This was previously
    throwing an IllegalArgumentException from within jSoup and thus failing the extraction.

1.3.14

Toggle 1.3.14's commit message
Version 1.3.14

1.3.13

Toggle 1.3.13's commit message
upping to version 1.3.13 that contains a minor fix to tag extraction

1.3.12

Toggle 1.3.12's commit message
Adding tag 1.3.12

1.3.11

Toggle 1.3.11's commit message
including ability to define custom extractors as well as regex clean ups