Description
Describe the bug
If categories on a website are linked through links of the form <a href="/[cat name]"/> the current code will not find them.
This line is responsible for filtering out the correct categories since the domain is None in these links, but I'm not sure what a better alternative would be.
theverge.com is used in one of our source tests and is no longer working because of this. There don't seem to be too many easy indicators on the site for what a category could be since most class names are obfuscated.
I suspect this may be the case in other websites.
To Reproduce
config = newspaper.Config()
config.disable_category_cache = True
source = newspaper.Source("https://www.theverge.com", config=config)
source.build()
print(source.category_urls())
print(source.feed_urls())
Output:
['https://www.theverge.com/about-the-verge', 'https://www.theverge.com/', 'https://www.theverge.com/ethics-statement']
['https://www.theverge.com/rss/index.xml']
Expected behavior
We should have another indicator of what categories are valid if they are only linked internally.
System information
- OS: Linux
- Python version 3.12
- Library version 9.3.1