Skip to content

Conversation

@elias-polyapp
Copy link

Description

This PR adds support for Legal Tribune Online (LTO), a German legal news publisher.

Changes

Testing

All unit tests pass successfully:

  • ✅ test_annotations[LTO]
  • ✅ test_parsing[LTO]
  • ✅ test_reserved_attribute_names[LTO]

Checklist

  • Parser implemented following attribute guidelines
  • Unit tests generated and passing
  • Documentation updated
  • Code follows project style (would run black, isort, mypy before final submission)

- Add LTOParser with support for title, body, authors, publishing_date, and topics
- Configure publisher with RSS feed, NewsMap, and Sitemap sources
- Generate unit tests for parser validation
- Update supported publishers documentation
@addie9800 addie9800 self-assigned this Oct 23, 2025
Copy link
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for adding LTO 🚀. I only have a couple of comments before we are ready to merge.

sources=[
RSSFeed("https://www.lto.de/rss/feed.xml"),
NewsMap("https://www.lto.de/googlenews-sitemap.xml"),
Sitemap("https://www.lto.de/sitemap.xml"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there are some unnecessary sitemaps in this index sitemap. We should restrict this to URLs of the form: https://www.lto.de/sitemap-type/article/page-x/sitemap.xml using the sitemap_filter argument.


@attribute
def publishing_date(self) -> Optional[datetime.datetime]:
# Try to get date from meta tag or page content
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you can use your solution as a fallback, if self.precomputed.meta.get("date") fails. In most cases, I checked that it should be sufficient.



class LTOParser(ParserProxy):
class V1(BaseParser):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the implementation of images is missing.


class LTOParser(ParserProxy):
class V1(BaseParser):
_paragraph_selector = CSSSelector("div.article-text-wrapper > p")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this article some bloat xp/LTO-Redaktion is also selected erroneously.

@attribute
def topics(self) -> List[str]:
keywords = self.precomputed.meta.get("keywords")
if keywords:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is not necessary, you can use the utility function generic_topic_parsing


@attribute
def topics(self) -> List[str]:
keywords = self.precomputed.meta.get("keywords")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid the keywords here are primarily used for SEO and don't really reflect the content of the article. It would be better to take the keywords from the section Mehr zum Thema at the bottom of the article.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants