-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new feature to extract annotated text in xml structure and sorting rules #3038
base: master
Are you sure you want to change the base?
Conversation
if i understand correctly, you are implementing |
hmmm i think its nice but these annotation rulesets are really not well known, the UI would need to educate the users more about what problem it solves to me its not really clear what the problem is that it solves? its just the format of the text or? |
I added cssselect as a requirement, because it is used in
For me this a solution to these problems: Often entries sometimes are moved around on the sites. There's no actual change but I also cannot sort text alphabetically, because I have a lot of information that belongs together.
and sort the whole entry-block based on its title.
Hm, yeah i think describing what the Annoation and sorting can be used for instead of the current explanation of what is would be better. I also think linking to a wiki page would be helpful, I could write one. An attempt at better UI explanations. Annotation Rules: Sort annotated text: |
…ext (default) and annotated_text extraction - annotated text allows to annotate text of matched css selector with xml tags
- Basic explanation now tries to educate the user more about what problem the feature solves - Advanced help and tips were improved for more clarity - Improved formatting to be in line with descriptions of other forms
I improved the UI descriptions with a67871e Let me know what you think. |
In addition to the existing method of text extraction using Inscriptis and get_text, a new method to extract texts with annotations is added.
Annotated Text
Annotated text allows to map tags to a css selector. All elements matched by the css selector will be annotated using xml annotations.
Why is this helpful?
Annotated text allows to preserve structure found in the html source.
This can be extremely helpful - it is mandatory for my use cases - when the content of detected changes needs to be further processed down the line. (e.g. when sent to custom notification endpoints).
I also added a way to sort annotated text preserving structure, as described further below.
Example
With the annotation rules:
Creates the following annotated text:
Sorting annotated text
Either a single CSS or XPath Selector or a pair of two selectors (parent -> child) can be defined.
These selectors are run on the annotated text, so they need to match the annotation xml tags.
Sort Selectors consist of:
Example
Annotation Rules:
Sort Selectors
In this example we will first sort the annotated outer tags, then inner tags sorted by name:
ChangeDetection User Interface
Extraction Method can now be selected.

Extract text only
is active by default and it applies the current way of extracting text.Annotatation Rules
When Extract annotated text is selected, additional options that are only available for annotated text become visible.

Sort Annotated text by matched tags
Notes
Why not Inscriptis Annotations?
Initially I planned to use Inscriptis annotations, which is also why the final implementation and syntax is close to it.
However, Inscriptis annotations did not work reliably.
So instead I wrote a custom dom traversal built on beautifulsoup. This also allowed to switch out the Inscriptis way of selecting elements with fully fledged css selectors, which I find to fit better into changedetection.io
Whitespace
A side-effect of this is that whitespace is preserved/added differently in the annotated text.
I spent quite some time looking into this, but I believe the current set of rules is a good fit for annotated text. Especially, because annotated text gives you the ability to annotate text, preserving additional structure which can be used to distinguish text from different elements.
Why not a new Processor?
I thought about this for a while but concluded that there's too many overlaps between Text extraction and Annotated Text to separate them.
I think it is more convenient to be able to easily switch between them. And I believe that a new processor should feel like a standalone addition (like Re-stock) that does not share settings with other processors.
watch-settings.js
Being new to change-detection, it cost me a few hours of my life trying to understand why the extraction method was not working in the Preview... Turns out it's because of the jquery gathering the selected states.
Well, RadioFields are now also working correctly when using with the Preview.
Tests
I added Tests for both annotated_text and sorting.
There might be an option to add some more integration tests, but I was not sure about the need for that.
Future
I would like to look into ways of processing annotated text with Jinja2 in the notification body.