Add LTO (Legal Tribune Online) publisher #799

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

elias-polyapp wants to merge 1 commit into flairNLP:master from elias-polyapp:feature/add-lto-publisher

elias-polyapp commented Oct 20, 2025

Description

This PR adds support for Legal Tribune Online (LTO), a German legal news publisher.

Changes

Added LTOParser with support for:
- Title extraction from og:title meta tag
- Article body with summary, paragraphs, and subheadings
- Author extraction
- Publishing date parsing
- Topics extraction from keywords
Configured publisher with three sources:
- RSS Feed: https://www.lto.de/rss/feed.xml
- NewsMap: https://www.lto.de/googlenews-sitemap.xml
- Sitemap: https://www.lto.de/sitemap.xml
Generated unit tests for parser validation
Updated supported publishers documentation

Testing

All unit tests pass successfully:

✅ test_annotations[LTO]
✅ test_parsing[LTO]
✅ test_reserved_attribute_names[LTO]

Checklist

Parser implemented following attribute guidelines
Unit tests generated and passing
Documentation updated
Code follows project style (would run black, isort, mypy before final submission)


          Add LTO (Legal Tribune Online) publisher

29e0ada

- Add LTOParser with support for title, body, authors, publishing_date, and topics
- Configure publisher with RSS feed, NewsMap, and Sitemap sources
- Generate unit tests for parser validation
- Update supported publishers documentation

addie9800 self-assigned this

addie9800 requested changes

View reviewed changes

Collaborator

addie9800 left a comment

Thank you so much for adding LTO 🚀. I only have a couple of comments before we are ready to merge.

src/fundus/publishers/de/__init__.py

    
                      sources=[

                          RSSFeed("https://www.lto.de/rss/feed.xml"),

                          NewsMap("https://www.lto.de/googlenews-sitemap.xml"),

                          Sitemap("https://www.lto.de/sitemap.xml"),

Collaborator

addie9800 Oct 23, 2025

It seems like there are some unnecessary sitemaps in this index sitemap. We should restrict this to URLs of the form: https://www.lto.de/sitemap-type/article/page-x/sitemap.xml using the sitemap_filter argument.

src/fundus/publishers/de/lto.py

    
                      @attribute

                      def publishing_date(self) -> Optional[datetime.datetime]:

                          # Try to get date from meta tag or page content

Collaborator

addie9800 Oct 23, 2025

Perhaps you can use your solution as a fallback, if self.precomputed.meta.get("date") fails. In most cases, I checked that it should be sufficient.

src/fundus/publishers/de/lto.py

    
              class LTOParser(ParserProxy):

                  class V1(BaseParser):

Collaborator

addie9800 Oct 23, 2025

It seems like the implementation of images is missing.

src/fundus/publishers/de/lto.py

    
              class LTOParser(ParserProxy):

                  class V1(BaseParser):

                      _paragraph_selector = CSSSelector("div.article-text-wrapper > p")

Collaborator

addie9800 Oct 23, 2025

In this article some bloat xp/LTO-Redaktion is also selected erroneously.

src/fundus/publishers/de/lto.py

    
                      @attribute

                      def topics(self) -> List[str]:

                          keywords = self.precomputed.meta.get("keywords")

                          if keywords:

Collaborator

addie9800 Oct 23, 2025

This part is not necessary, you can use the utility function generic_topic_parsing

src/fundus/publishers/de/lto.py

    
                      @attribute

                      def topics(self) -> List[str]:

                          keywords = self.precomputed.meta.get("keywords")

Collaborator

addie9800 Oct 23, 2025

I am afraid the keywords here are primarily used for SEO and don't really reflect the content of the article. It would be better to take the keywords from the section Mehr zum Thema at the bottom of the article.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet