Open
Description
Describe the bug
Trying to install newskpaper4k via pip. And getting the error:
ImportError: lxml.html.clean module is now a separate project lxml_html_clean.
To Reproduce
Steps to reproduce the behavior, please post any code you used and the website you tried to parse/process:
- pip install newspaper4k
- See the following traceback:
[stderr] from newspaper import Article as NPArticle
[stderr] File "/usr/local/lib/python3.11/site-packages/newspaper/__init__.py", line 17, in <module>
[stderr] from .api import (
[stderr] File "/usr/local/lib/python3.11/site-packages/newspaper/api.py", line 8, in <module>
[stderr] from .article import Article
[stderr] File "/usr/local/lib/python3.11/site-packages/newspaper/article.py", line 21, in <module>
[stderr] from . import network
[stderr] File "/usr/local/lib/python3.11/site-packages/newspaper/network.py", line 15, in <module>
[stderr] from newspaper import parsers
[stderr] File "/usr/local/lib/python3.11/site-packages/newspaper/parsers.py", line 18, in <module>
[stderr] import lxml.html.clean
[stderr] File "/usr/local/lib/python3.11/site-packages/lxml/html/clean.py", line 18, in <module>
[stderr] raise ImportError(
[stderr] ImportError: lxml.html.clean module is now a separate project lxml_html_clean.
[stderr] Install lxml[html_clean] or lxml_html_clean directly.
Expected behavior
Installation via pip should've worked.
System information
- OS: python3.11-slim in Docker
- Python version [3.11]
- newspaper4k [0.9.1]
- lxml [5.1.0]
Workaround
Anyone who's having this issue, for now just add lxml[html_clean]==5.2.0 in your requirements.txt file.
Quickfix
To quickly fix the issue in this repo, for now we can edit this line in pyproject,toml file and pin the version of lxml below 5.x:
https://github.com/AndyTheFactory/newspaper4k/blob/b5b20976bd320f89ffa25b8d4a7a94d190ee549a/pyproject.toml#L34C3-L34C15