Skip to content

[BUG] Relative linked categories aren't recognized #666

Open
@BRNMan

Description

@BRNMan

Describe the bug
If categories on a website are linked through links of the form <a href="/[cat name]"/> the current code will not find them.
This line is responsible for filtering out the correct categories since the domain is None in these links, but I'm not sure what a better alternative would be.

theverge.com is used in one of our source tests and is no longer working because of this. There don't seem to be too many easy indicators on the site for what a category could be since most class names are obfuscated.

I suspect this may be the case in other websites.

To Reproduce

config = newspaper.Config()
config.disable_category_cache = True
source = newspaper.Source("https://www.theverge.com", config=config)

source.build()
print(source.category_urls())
print(source.feed_urls())

Output:
['https://www.theverge.com/about-the-verge', 'https://www.theverge.com/', 'https://www.theverge.com/ethics-statement']
['https://www.theverge.com/rss/index.xml']

Expected behavior
We should have another indicator of what categories are valid if they are only linked internally.

System information

  • OS: Linux
  • Python version 3.12
  • Library version 9.3.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions