Add new feature to extract annotated text in xml structure and sorting rules #3038

nicoell · 2025-03-20T21:58:35Z

In addition to the existing method of text extraction using Inscriptis and get_text, a new method to extract texts with annotations is added.

Annotated Text

Annotated text allows to map tags to a css selector. All elements matched by the css selector will be annotated using xml annotations.

Why is this helpful?

Annotated text allows to preserve structure found in the html source.
This can be extremely helpful - it is mandatory for my use cases - when the content of detected changes needs to be further processed down the line. (e.g. when sent to custom notification endpoints).

I also added a way to sort annotated text preserving structure, as described further below.

Example

<html>
   <head><title>Title<title></head>
   <body>
     Some initial text<br>
     <p>Which is across multiple lines</p>
     <a href="/first_link"> More Text </a>
     <br>
     So let's see what happens.  <br>
     <a href="second_link.com"> Even More Text </a>
     <p class="item">This is an item with title <span class="test">Item Title</span></p>
   </body>
</html>

With the annotation rules:

{
    "a": ["hyperlink", "a"],
    "span[class*='test']": ["item-title"],
    "p[class*='item']": ["item"]
}

Creates the following annotated text:

<text>Title
Some initial text
Which is across multiple lines
<hyperlink><a>More Text</a></hyperlink>
So let's see what happens.
<hyperlink><a>Even More Text</a></hyperlink><item>
This is an item with title<item-title>Item Title</item-title>
</item></text>

Sorting annotated text

Either a single CSS or XPath Selector or a pair of two selectors (parent -> child) can be defined.
These selectors are run on the annotated text, so they need to match the annotation xml tags.

Sort Selectors consist of:

an "Element-to-Sort" CSS or XPath Selector matching the annotated tag to be sorted.
and an optional "Sort-Identifier" CSS or XPath Selector relative to Element-to-Sort matching a child annotated tag containing the text to base the sorting on.

Example

<html>
  <body>
    <div class="outer">
        <span class="inner">Y-Item <span class="name">B</span></span>
        <span class="inner">Z-Item <span class="name">A</span></span>
    </div>
    <div class="outer">
        <span class="inner">W-Item <span class="name">D</span></span>
        <span class="inner">X-Item <span class="name">C</span></span>
    </div>
  </body>
</html>

Annotation Rules:

{
    "div[class*='outer']": ["outer"],
    "span[class*='inner']": ["inner"],
    "span[class*='name']": ["name"]
}

Sort Selectors

[
        [("outer", ""), ("outer > inner", "name")]
]

In this example we will first sort the annotated outer tags, then inner tags sorted by name:

<text><outer><inner>Z-Item<name>A</name></inner>
<inner>Y-Item<name>B</name></inner></outer><outer><inner>X-Item<name>C</name></inner>
<inner>W-Item<name>D</name></inner></outer></text>

ChangeDetection User Interface

Extraction Method can now be selected. Extract text only is active by default and it applies the current way of extracting text.

Annotatation Rules

When Extract annotated text is selected, additional options that are only available for annotated text become visible.

Sort Annotated text by matched tags

Notes

Why not Inscriptis Annotations?

Initially I planned to use Inscriptis annotations, which is also why the final implementation and syntax is close to it.
However, Inscriptis annotations did not work reliably.

I found one bug with the XmlExctractor that I reported here: XmlExtractor and HtmlExtractor produce wrong tag order weblyzard/inscriptis#93
But I also got completely wrong results with the label offsets, which might be what is described in Label offest no accurate in case of table weblyzard/inscriptis#92 but I'm not sure about that.

So instead I wrote a custom dom traversal built on beautifulsoup. This also allowed to switch out the Inscriptis way of selecting elements with fully fledged css selectors, which I find to fit better into changedetection.io

Whitespace

A side-effect of this is that whitespace is preserved/added differently in the annotated text.
I spent quite some time looking into this, but I believe the current set of rules is a good fit for annotated text. Especially, because annotated text gives you the ability to annotate text, preserving additional structure which can be used to distinguish text from different elements.

Why not a new Processor?

I thought about this for a while but concluded that there's too many overlaps between Text extraction and Annotated Text to separate them.
I think it is more convenient to be able to easily switch between them. And I believe that a new processor should feel like a standalone addition (like Re-stock) that does not share settings with other processors.

watch-settings.js

Being new to change-detection, it cost me a few hours of my life trying to understand why the extraction method was not working in the Preview... Turns out it's because of the jquery gathering the selected states.

Well, RadioFields are now also working correctly when using with the Preview.

Tests

I added Tests for both annotated_text and sorting.
There might be an option to add some more integration tests, but I was not sure about the need for that.

Future

I would like to look into ways of processing annotated text with Jinja2 in the notification body.

dgtlmoon · 2025-03-20T22:54:08Z

if i understand correctly, you are implementing cssselect as a new filter type or ?

dgtlmoon · 2025-03-20T22:55:48Z

hmmm i think its nice but these annotation rulesets are really not well known, the UI would need to educate the users more about what problem it solves

to me its not really clear what the problem is that it solves? its just the format of the text or?

nicoell · 2025-03-21T08:27:55Z

if i understand correctly, you are implementing cssselect as a new filter type or ?

I added cssselect as a requirement, because it is used in sort_annotated_text_by_selectors to use css selectors on an lxml.html tree., here: return root.cssselect(selector)

to me its not really clear what the problem is that it solves? its just the format of the text or?

For me this a solution to these problems:
I track different sites with a list of entries where I am interested in multiple values associated with them: Title, Short excerpt, Author, DateCreated, DateEdited, and a few more. If any of these change, entries are added or removed, I want to detect the change.
But not every site has all of these values and the structure of where they appear on the website varies.
Extracting the plain text form, I lose track of what the extracted text is. Is the date in my text the creation date or the last edited date? Where does the title end and where does the excerpt start? I have all of this information when I set up the include and remove filters in change-detection and I would like to preserve it.

Often entries sometimes are moved around on the sites. There's no actual change but I also cannot sort text alphabetically, because I have a lot of information that belongs together.
Now I can annotate it:

<entry>
<title>Title</title>
<author>Author</author>
...
</entry>

and sort the whole entry-block based on its title.

hmmm i think its nice but these annotation rulesets are really not well known, the UI would need to educate the users more about what problem it solves

Hm, yeah i think describing what the Annoation and sorting can be used for instead of the current explanation of what is would be better. I also think linking to a wiki page would be helpful, I could write one.

An attempt at better UI explanations.

Annotation Rules:
Annotations map HTML elements to XML tags in the extracted text, preserving semantics found in the original HTML source (like title, author, price, or dates) that allow more effective downstream processing.

Sort annotated text:
Sorting annotated text allows reordering annotated XML elements while preserving their structure (e.g. sorting <entry><title>Title</title><author>Author</author></entry> blocks by title).

…ext (default) and annotated_text extraction - annotated text allows to annotate text of matched css selector with xml tags

…ag contents

- Basic explanation now tries to educate the user more about what problem the feature solves - Advanced help and tips were improved for more clarity - Improved formatting to be in line with descriptions of other forms

nicoell · 2025-03-23T14:15:35Z

I improved the UI descriptions with a67871e

Let me know what you think.

nicoell added 3 commits March 23, 2025 12:38

Add method to extract annotated text and add allow choosing between t…

102ee11

…ext (default) and annotated_text extraction - annotated text allows to annotate text of matched css selector with xml tags

Allow sorting tags in annotated text based on its contents or child t…

be33519

…ag contents

Improve UI descriptions for annotated text extraction and sorting

a67871e

- Basic explanation now tries to educate the user more about what problem the feature solves - Advanced help and tips were improved for more clarity - Improved formatting to be in line with descriptions of other forms

nicoell force-pushed the annotated-text branch from 75221e2 to a67871e Compare March 23, 2025 14:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new feature to extract annotated text in xml structure and sorting rules #3038

Add new feature to extract annotated text in xml structure and sorting rules #3038

nicoell commented Mar 20, 2025 •

edited

Loading

dgtlmoon commented Mar 20, 2025

dgtlmoon commented Mar 20, 2025

nicoell commented Mar 21, 2025

nicoell commented Mar 23, 2025

Add new feature to extract annotated text in xml structure and sorting rules #3038

Are you sure you want to change the base?

Add new feature to extract annotated text in xml structure and sorting rules #3038

Conversation

nicoell commented Mar 20, 2025 • edited Loading

Annotated Text

Why is this helpful?

Example

Sorting annotated text

Example

ChangeDetection User Interface

Annotatation Rules

Sort Annotated text by matched tags

Notes

Why not Inscriptis Annotations?

Whitespace

Why not a new Processor?

watch-settings.js

Tests

Future

dgtlmoon commented Mar 20, 2025

dgtlmoon commented Mar 20, 2025

nicoell commented Mar 21, 2025

nicoell commented Mar 23, 2025

nicoell commented Mar 20, 2025 •

edited

Loading