Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to get elements #479

Open
fengsanyunyan opened this issue Sep 4, 2021 · 2 comments
Open

failed to get elements #479

fengsanyunyan opened this issue Sep 4, 2021 · 2 comments

Comments

@fengsanyunyan
Copy link

fengsanyunyan commented Sep 4, 2021

I'm new to requests-html and just installed several days ago.
when followed the Tutorial :

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://python.org/')
about = r.html.find('#about', first=True)
print(about.text)

The expected output as Tutorial described is :

About
Applications
Quotes
Getting Started
Help
Python Brochure

But actually I got the following:

About
Applications
Quotes
Getting Started
Help
Python Brochure
Downloads
All releases
Source code
Windows
macOS
Other Platforms
License
Alternative Implementations
Documentation
.
.
.
Submit Website Bug
Status
Copyright ©2001-2021.  Python Software Foundation  Legal Statements  Privacy Policy  Powered by Heroku
window.jQuery || document.write('<script src="/static/js/libs/jquery-1.8.2.min.js"><\/script>') window.jQuery || document.write('<script src="/static/js/libs/jquery-ui-1.12.1.min.js"><\/script>')

which is from element <li id="about" ... to the end of the whole html document.

anyone konws solution of this issue?

@kennethreitz

@bilalkhann16
Copy link

I tried the code snippet you provided and got the correct output.

About
Applications
Quotes
Getting Started
Help
Python Brochure

@AbstractDataType
Copy link

my lxml is 4.9.0

I also find this problem. I found that this is because the find() do not get correct response.
in requests_html.py:210 , it tries to find matched element and init it. so the problem is during the initialization of the target elements. so i check the code of Class Element and Basepraser.
in requests_html.py:107, when you call the html attribute, because the _html is None, it use the etree.tostring(self.element, encoding='unicode',method='html').strip(). actually, the self.element is the core content of the whole Element object, is a <'lxml.html.HtmlElement'> object. that means it should return the
'<li aria-haspopup="true" class="tier-1 element-1 " id="about">\n<a class="" href="/about/" title="">About</a>\n<ul aria-hidden="true" class="subnav menu" role="menu">\n<li class="tier-2 element-1" role="treeitem"><a href="/about/apps/" title="">Applications</a></li>\n<li class="tier-2 element-2" role="treeitem"><a href="/about/quotes/" title="">Quotes</a></li>\n<li class="tier-2 element-3" role="treeitem"><a href="/about/gettingstarted/" title="">Getting Started</a></li>\n<li class="tier-2 element-4" role="treeitem"><a href="/about/help/" title="">Help</a></li>\n<li class="tier-2 element-5" role="treeitem"><a href="http://brochure.getpython.info/" title="">Python Brochure</a></li>\n</ul>\n</li>'
but it actually return the whole html of the webpage.

the problem is because of a bug(?) of etree.tosring. if you change
etree.tostring(self.element, encoding='unicode').strip()
to
etree.tostring(self.element, encoding='unicode',method='html').strip()
and also in line 97, the problem can be solved.

and, if the html get correct response, the text is also correct , because text use pq, pq use lxml ,which use html at last.

AbstractDataType added a commit to AbstractDataType/requests-html that referenced this issue Jun 6, 2022
to fix a bug of wrong parse result when use find() with lxml==4.9.0(maybe lower). see issues psf#469 and psf#479.
AbstractDataType added a commit to AbstractDataType/requests-html that referenced this issue Jun 6, 2022
fix the bug caused by lxml==4.9.0(maybe lower?). see issue psf#469 and psf#479.
AbstractDataType added a commit to AbstractDataType/requests-html that referenced this issue Jun 6, 2022
fix the bug caused by lxml==4.9.0(maybe lower?). see issue psf#469 and psf#479.
AbstractDataType added a commit to AbstractDataType/requests-html that referenced this issue Jun 6, 2022
fix the bug caused by lxml==4.9.0(maybe lower?). see issue psf#469 and psf#479 .
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants