Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new feature to extract annotated text in xml structure and sorting rules #3038

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

nicoell
Copy link
Contributor

@nicoell nicoell commented Mar 20, 2025

In addition to the existing method of text extraction using Inscriptis and get_text, a new method to extract texts with annotations is added.

Annotated Text

Annotated text allows to map tags to a css selector. All elements matched by the css selector will be annotated using xml annotations.

Why is this helpful?

Annotated text allows to preserve structure found in the html source.
This can be extremely helpful - it is mandatory for my use cases - when the content of detected changes needs to be further processed down the line. (e.g. when sent to custom notification endpoints).

I also added a way to sort annotated text preserving structure, as described further below.

Example

<html>
   <head><title>Title<title></head>
   <body>
     Some initial text<br>
     <p>Which is across multiple lines</p>
     <a href="/first_link"> More Text </a>
     <br>
     So let's see what happens.  <br>
     <a href="second_link.com"> Even More Text </a>
     <p class="item">This is an item with title <span class="test">Item Title</span></p>
   </body>
</html>

With the annotation rules:

{
    "a": ["hyperlink", "a"],
    "span[class*='test']": ["item-title"],
    "p[class*='item']": ["item"]
}

Creates the following annotated text:

<text>Title
Some initial text
Which is across multiple lines
<hyperlink><a>More Text</a></hyperlink>
So let's see what happens.
<hyperlink><a>Even More Text</a></hyperlink><item>
This is an item with title<item-title>Item Title</item-title>
</item></text>

Sorting annotated text

Either a single CSS or XPath Selector or a pair of two selectors (parent -> child) can be defined.
These selectors are run on the annotated text, so they need to match the annotation xml tags.

Sort Selectors consist of:

  • an "Element-to-Sort" CSS or XPath Selector matching the annotated tag to be sorted.
  • and an optional "Sort-Identifier" CSS or XPath Selector relative to Element-to-Sort matching a child annotated tag containing the text to base the sorting on.

Example

<html>
  <body>
    <div class="outer">
        <span class="inner">Y-Item <span class="name">B</span></span>
        <span class="inner">Z-Item <span class="name">A</span></span>
    </div>
    <div class="outer">
        <span class="inner">W-Item <span class="name">D</span></span>
        <span class="inner">X-Item <span class="name">C</span></span>
    </div>
  </body>
</html>

Annotation Rules:

{
    "div[class*='outer']": ["outer"],
    "span[class*='inner']": ["inner"],
    "span[class*='name']": ["name"]
}

Sort Selectors

[
        [("outer", ""), ("outer > inner", "name")]
]

In this example we will first sort the annotated outer tags, then inner tags sorted by name:

<text><outer><inner>Z-Item<name>A</name></inner>
<inner>Y-Item<name>B</name></inner></outer><outer><inner>X-Item<name>C</name></inner>
<inner>W-Item<name>D</name></inner></outer></text>

ChangeDetection User Interface

Extraction Method can now be selected. Extract text only is active by default and it applies the current way of extracting text.
image

Annotatation Rules

When Extract annotated text is selected, additional options that are only available for annotated text become visible.
image

Sort Annotated text by matched tags

image

Notes

Why not Inscriptis Annotations?

Initially I planned to use Inscriptis annotations, which is also why the final implementation and syntax is close to it.
However, Inscriptis annotations did not work reliably.

So instead I wrote a custom dom traversal built on beautifulsoup. This also allowed to switch out the Inscriptis way of selecting elements with fully fledged css selectors, which I find to fit better into changedetection.io

Whitespace

A side-effect of this is that whitespace is preserved/added differently in the annotated text.
I spent quite some time looking into this, but I believe the current set of rules is a good fit for annotated text. Especially, because annotated text gives you the ability to annotate text, preserving additional structure which can be used to distinguish text from different elements.

Why not a new Processor?

I thought about this for a while but concluded that there's too many overlaps between Text extraction and Annotated Text to separate them.
I think it is more convenient to be able to easily switch between them. And I believe that a new processor should feel like a standalone addition (like Re-stock) that does not share settings with other processors.

watch-settings.js

Being new to change-detection, it cost me a few hours of my life trying to understand why the extraction method was not working in the Preview... Turns out it's because of the jquery gathering the selected states.

Well, RadioFields are now also working correctly when using with the Preview.

Tests

I added Tests for both annotated_text and sorting.
There might be an option to add some more integration tests, but I was not sure about the need for that.

Future

I would like to look into ways of processing annotated text with Jinja2 in the notification body.

@dgtlmoon
Copy link
Owner

if i understand correctly, you are implementing cssselect as a new filter type or ?

@dgtlmoon
Copy link
Owner

hmmm i think its nice but these annotation rulesets are really not well known, the UI would need to educate the users more about what problem it solves

to me its not really clear what the problem is that it solves? its just the format of the text or?

@nicoell
Copy link
Contributor Author

nicoell commented Mar 21, 2025

if i understand correctly, you are implementing cssselect as a new filter type or ?

I added cssselect as a requirement, because it is used in sort_annotated_text_by_selectors to use css selectors on an lxml.html tree., here: return root.cssselect(selector)

to me its not really clear what the problem is that it solves? its just the format of the text or?

For me this a solution to these problems:
I track different sites with a list of entries where I am interested in multiple values associated with them: Title, Short excerpt, Author, DateCreated, DateEdited, and a few more. If any of these change, entries are added or removed, I want to detect the change.
But not every site has all of these values and the structure of where they appear on the website varies.
Extracting the plain text form, I lose track of what the extracted text is. Is the date in my text the creation date or the last edited date? Where does the title end and where does the excerpt start? I have all of this information when I set up the include and remove filters in change-detection and I would like to preserve it.

Often entries sometimes are moved around on the sites. There's no actual change but I also cannot sort text alphabetically, because I have a lot of information that belongs together.
Now I can annotate it:

<entry>
<title>Title</title>
<author>Author</author>
...
</entry>

and sort the whole entry-block based on its title.

hmmm i think its nice but these annotation rulesets are really not well known, the UI would need to educate the users more about what problem it solves

Hm, yeah i think describing what the Annoation and sorting can be used for instead of the current explanation of what is would be better. I also think linking to a wiki page would be helpful, I could write one.

An attempt at better UI explanations.

Annotation Rules:
Annotations map HTML elements to XML tags in the extracted text, preserving semantics found in the original HTML source (like title, author, price, or dates) that allow more effective downstream processing.

Sort annotated text:
Sorting annotated text allows reordering annotated XML elements while preserving their structure (e.g. sorting <entry><title>Title</title><author>Author</author></entry> blocks by title).

nicoell added 3 commits March 23, 2025 12:38
…ext (default) and annotated_text extraction

- annotated text allows to annotate text of matched css selector with xml tags
- Basic explanation now tries to educate the user more about what problem the feature solves
- Advanced help and tips were improved for more clarity
- Improved formatting to be in line with descriptions of other forms
@nicoell
Copy link
Contributor Author

nicoell commented Mar 23, 2025

I improved the UI descriptions with a67871e

Let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants