Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support JMESPath now #181

Merged
merged 126 commits into from
Apr 11, 2023
Merged
Show file tree
Hide file tree
Changes from 119 commits
Commits
Show all changes
126 commits
Select commit Hold shift + click to select a range
63af8c5
Support jpath now
EchoShoot Jan 2, 2020
ddc8e73
Update test_selector_jpath.py
EchoShoot Jan 2, 2020
17f14d5
Update selector.py
EchoShoot Jan 2, 2020
4590544
Update test_selector_jpath.py
EchoShoot Jan 2, 2020
6151e3b
Update test_selector_jpath.py
EchoShoot Jan 2, 2020
546fcc9
Update test_selector_jpath.py
EchoShoot Jan 2, 2020
0a70390
Update selector.py
EchoShoot Jan 2, 2020
85c7b58
rename
EchoShoot Jan 3, 2020
3846ea7
rename
EchoShoot Jan 3, 2020
2741727
Merge branch 'master' into master
Gallaecio Jun 2, 2020
9235304
Update parsel/selector.py
Gallaecio Jun 2, 2020
03289a6
Restore import separation line
Gallaecio Jun 2, 2020
ecf37b8
Improve the API documentation of Selector.jmespath
Gallaecio Jun 2, 2020
c731c35
Remove pointless exception handling
Gallaecio Jun 2, 2020
24cf915
Do not ignore jmespath-found None values when possible
Gallaecio Jun 2, 2020
9508f0e
Improve jmespath documentation
Gallaecio Jun 2, 2020
fe44a3d
Revert unrelated setup.py changes
Gallaecio Jun 2, 2020
d84e957
Remove unorthodox attribution text
Gallaecio Jun 2, 2020
5a57a3c
Style changes to tests
Gallaecio Jun 2, 2020
8ad0330
datas → data
Gallaecio Jun 2, 2020
b01f5a1
Simplify jmespath implementation
Gallaecio Jun 2, 2020
ca3bc23
Fix or silence Pylint issues
Gallaecio Jun 2, 2020
15c9118
Fix test expectation
Gallaecio Jun 2, 2020
344d4df
Refactor jmespath support
Gallaecio Jun 3, 2020
fa7c32e
Simplify Selector.__init__
Gallaecio Jun 3, 2020
7fe4d5e
Return None in case of invalid JSON
Gallaecio Jun 17, 2020
3f9b779
Complete test coverage
Gallaecio Jun 17, 2020
f7cd122
Fix backward compatibility
Gallaecio Jun 17, 2020
58f5b77
Fix Python 2 support in new tests
Gallaecio Jun 17, 2020
a01ebfb
Fix the documentation build
Gallaecio Jun 17, 2020
4e3ce0f
Complete test coverage
Gallaecio Jun 17, 2020
aba3c40
Fix tests in Python 2, again
Gallaecio Jun 17, 2020
ed9e2b5
Do not set a minimum jmespath version
Gallaecio Jun 18, 2020
9cceac2
Merge branch 'master' into master
Gallaecio Mar 20, 2021
bb4f994
Merge remote-tracking branch 'upstream/master' into jmespath
Gallaecio Mar 14, 2022
d0c98b7
Apply Black
Gallaecio Mar 14, 2022
50e18e7
Remove six usage
Gallaecio Mar 14, 2022
ccf6b63
Apply Black
Gallaecio Mar 14, 2022
8a10f23
format → f-string
Gallaecio Mar 14, 2022
9e22492
Apply Black
Gallaecio Mar 14, 2022
2da08ea
Add tests for jmespath functions
felipeboffnunes Jun 14, 2022
5f03cea
Merge branch 'master' into master
felipeboffnunes Jun 14, 2022
fa25f75
Merge branch 'master' into master
felipeboffnunes Jun 14, 2022
06ffdc3
Test for jmespath functions.
felipeboffnunes Jun 14, 2022
f854cc0
Docs for JMESPath Selector
felipeboffnunes Jun 17, 2022
ca14d63
end of line
felipeboffnunes Jun 17, 2022
2465f4e
black
felipeboffnunes Jun 17, 2022
918089c
instantiate jmespath_selector on usage doc
felipeboffnunes Jun 17, 2022
f0d42d3
instantiate jmespath_selector on usage doc
felipeboffnunes Jun 17, 2022
72e7b3c
black, again
felipeboffnunes Jun 17, 2022
51042ca
remove jmes function test
felipeboffnunes Jun 17, 2022
f6a316c
bring jmespath closer to other selectors
felipeboffnunes Jun 17, 2022
5954e79
black
felipeboffnunes Jun 17, 2022
f6e59ce
list format for jmespath selector example
felipeboffnunes Jun 17, 2022
c1c98f0
small adjust on wording for jmespath introduction
felipeboffnunes Jun 17, 2022
b374384
Readme example
felipeboffnunes Jun 20, 2022
6b60a8b
Readme separated examples
felipeboffnunes Jun 20, 2022
98daa08
usage adjusts from feedback
felipeboffnunes Jun 20, 2022
84751de
missing ! char
felipeboffnunes Jun 20, 2022
ef5a6ea
adjustments
felipeboffnunes Jun 20, 2022
475bdf5
adjustments
felipeboffnunes Jun 20, 2022
33e8d38
adjustments
felipeboffnunes Jun 20, 2022
2e3b633
adjustments
felipeboffnunes Jun 20, 2022
654fb26
adjustments
felipeboffnunes Jun 20, 2022
6226397
adjustments
felipeboffnunes Jun 20, 2022
6780a06
Update docs/usage.rst
felipeboffnunes Jun 20, 2022
e1c6838
Update docs/usage.rst
felipeboffnunes Jun 20, 2022
aa803c7
Update docs/usage.rst
felipeboffnunes Jun 20, 2022
2efc71d
adjustments from revisions
felipeboffnunes Jun 20, 2022
3b74896
Merge remote-tracking branch 'felipeboffnunes/jmespath' into jmespath
felipeboffnunes Jun 20, 2022
b6f1e3a
change type on __str__ to `query` for all cases
felipeboffnunes Jun 25, 2022
5034df0
adjusting usage.rst tests to reflect changes
felipeboffnunes Jun 25, 2022
835458c
black selector.py
felipeboffnunes Jun 25, 2022
6b1729b
black selector.py
felipeboffnunes Jun 25, 2022
54065be
missing adjustments
felipeboffnunes Jun 25, 2022
f6f3bf9
removing logic for previous solution
felipeboffnunes Jun 25, 2022
497ecba
black selector.py
felipeboffnunes Jun 25, 2022
045ad12
test_selector update xpath to query
felipeboffnunes Jun 25, 2022
3c81e3f
black selector.py
felipeboffnunes Jun 25, 2022
b424483
go back to original usage.rst get elements examples
felipeboffnunes Jun 25, 2022
276cf9f
black selector.py???
felipeboffnunes Jun 25, 2022
8fe06ce
html_selector
felipeboffnunes Jun 25, 2022
ccc7abb
Readme feedback
felipeboffnunes Jun 25, 2022
a86ca2b
Cover JMESPath support in the documentation
felipeboffnunes Jun 27, 2022
be8dbb3
Merge remote-tracking branch 'upstream/master'
Gallaecio Jun 27, 2022
fa2085e
Remove deprecated-method from pylintsrc
felipeboffnunes Jul 1, 2022
a68fecf
Simpler approach to examples for extracting with selectors
felipeboffnunes Jul 1, 2022
05cec44
Removing type arg for jmespath
felipeboffnunes Jul 1, 2022
b53e71f
JSON encoding jmespath selector results + unit test
felipeboffnunes Jul 1, 2022
4349ce4
Merge branch 'echoshoot-master' into jmespath
felipeboffnunes Jul 1, 2022
14d17fd
merge
felipeboffnunes Jul 1, 2022
470a31b
Merge branches 'master' and 'master' of https://github.com/EchoShoot/…
felipeboffnunes Jul 1, 2022
5f98857
Merge branch 'echoshoot-master' into jmespath
felipeboffnunes Jul 1, 2022
305306f
adjust json encoding to reg with type json
felipeboffnunes Jul 1, 2022
01a4fb2
black
felipeboffnunes Jul 1, 2022
1c519d9
prepending r for re
felipeboffnunes Jul 1, 2022
3c00325
typo
felipeboffnunes Jul 1, 2022
cf8a693
black: shippuden
felipeboffnunes Jul 1, 2022
ecce282
raise ValueError if data from selector is not str when using re
felipeboffnunes Jul 20, 2022
58dedc2
tests for raiseValue and to_string
felipeboffnunes Jul 20, 2022
8895fd7
remove if treatment
felipeboffnunes Aug 4, 2022
f4d19de
Adjust test
felipeboffnunes Aug 4, 2022
599cc79
test for None on data
felipeboffnunes Aug 4, 2022
c75723b
black
felipeboffnunes Aug 4, 2022
39cdad3
Remove lazy loading... I guess?
felipeboffnunes Sep 12, 2022
69a8517
merge isinstance
felipeboffnunes Sep 12, 2022
d000b54
Merge remote-tracking branch 'scrapy/master'
Gallaecio Oct 27, 2022
7536aeb
README: add missing link
Gallaecio Oct 28, 2022
321a3d1
README: simplify the code example
Gallaecio Oct 28, 2022
bddbc6d
Usage: mention cssselect CSS support
Gallaecio Oct 28, 2022
d20b4ba
JMESPATH → JMESPath
Gallaecio Oct 28, 2022
e9d5ada
Update the Selector docstring
Gallaecio Oct 28, 2022
812b2bf
Selector.css: do not murate state
Gallaecio Oct 28, 2022
c7f53d4
Address coverage warning about mixing --include and --source
Gallaecio Oct 28, 2022
746a0da
Warn about passing both text and root
Gallaecio Oct 28, 2022
ff8e946
Reorganize Selector.__init__
Gallaecio Oct 28, 2022
bd0ad80
Raise ValueError if root=etree and type in {'json', 'text'}
Gallaecio Oct 28, 2022
0fcff9a
Make the start of the jmespath() implementation more readable
Gallaecio Oct 28, 2022
10b78a8
Address issues reported by CI
Gallaecio Oct 28, 2022
dd0a4c9
Optimize _load_json_or_none usage
Gallaecio Oct 28, 2022
c69e0c6
Support text="null" as JSON input
Gallaecio Oct 28, 2022
64569a7
Fix scenario of invalid JSON
Gallaecio Oct 28, 2022
7232241
Solve issues reported by CI
Gallaecio Oct 28, 2022
3b3ec90
Merge remote-tracking branch 'scrapy/master' into jmespath
Gallaecio Apr 10, 2023
9f18a37
Merge remote-tracking branch 'scrapy/master' into jmespath
Gallaecio Apr 10, 2023
53d5146
Add a missing period
Gallaecio Apr 11, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .coveragerc
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
[run]
branch = true
include = parsel/*

[report]
exclude_lines =
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ pip-log.txt

# Unit test / coverage reports
.coverage
/coverage.xml
.tox
nosetests.xml
htmlcov
Expand Down
42 changes: 29 additions & 13 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,9 +19,16 @@ Parsel
:alt: Coverage report


Parsel is a BSD-licensed Python_ library to extract and remove data from HTML_
and XML_ using XPath_ and CSS_ selectors, optionally combined with
`regular expressions`_.
Parsel is a BSD-licensed Python_ library to extract data from HTML_, JSON_, and
XML_ documents.

It supports:

- CSS_ and XPath_ expressions for HTML and XML documents

- JMESPath_ expressions for JSON documents

- `Regular expressions`_

Find the Parsel online documentation at https://parsel.readthedocs.org.

Expand All @@ -30,15 +37,18 @@ Example (`open online demo`_):
.. code-block:: python

>>> from parsel import Selector
>>> selector = Selector(text="""<html>
<body>
<h1>Hello, Parsel!</h1>
<ul>
<li><a href="http://example.com">Link 1</a></li>
<li><a href="http://scrapy.org">Link 2</a></li>
</ul>
</body>
</html>""")
>>> text = """
<html>
<body>
<h1>Hello, Parsel!</h1>
<ul>
<li><a href="http://example.com">Link 1</a></li>
<li><a href="http://scrapy.org">Link 2</a></li>
</ul>
<script type="application/json">{"a": ["b", "c"]}</script>
</body>
</html>"""
>>> selector = Selector(text=text)
>>> selector.css('h1::text').get()
'Hello, Parsel!'
>>> selector.xpath('//h1/text()').re(r'\w+')
Expand All @@ -47,12 +57,18 @@ Example (`open online demo`_):
... print(li.xpath('.//@href').get())
http://example.com
http://scrapy.org

>>> selector.css('script::text').jmespath("a").get()
'b'
>>> selector.css('script::text').jmespath("a").getall()
['b', 'c']

.. _CSS: https://en.wikipedia.org/wiki/Cascading_Style_Sheets
.. _HTML: https://en.wikipedia.org/wiki/HTML
.. _JMESPath: https://jmespath.org/
.. _JSON: https://en.wikipedia.org/wiki/JSON
.. _open online demo: https://colab.research.google.com/drive/149VFa6Px3wg7S3SEnUqk--TyBrKplxCN#forceEdit=true&sandboxMode=true
.. _Python: https://www.python.org/
.. _regular expressions: https://docs.python.org/library/re.html
.. _XML: https://en.wikipedia.org/wiki/XML
.. _XPath: https://en.wikipedia.org/wiki/XPath

2 changes: 2 additions & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,8 @@

# nitpicky = True # https://github.com/scrapy/cssselect/pull/110
nitpick_ignore = [
("py:class", "ExpressionError"),
("py:class", "SelectorSyntaxError"),
("py:class", "cssselect.xpath.GenericTranslator"),
("py:class", "cssselect.xpath.HTMLTranslator"),
("py:class", "cssselect.xpath.XPathExpr"),
Expand Down
68 changes: 44 additions & 24 deletions docs/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,32 +4,38 @@
Usage
=====

Create a :class:`~parsel.selector.Selector` object for the HTML or XML text
that you want to parse::
Create a :class:`~parsel.selector.Selector` object for your input text.

For HTML or XML, use `CSS`_ or `XPath`_ expressions to select data::

>>> from parsel import Selector
>>> text = "<html><body><h1>Hello, Parsel!</h1></body></html>"
>>> selector = Selector(text=text)
>>> html_text = "<html><body><h1>Hello, Parsel!</h1></body></html>"
>>> html_selector = Selector(text=html_text)
>>> html_selector.css('h1')
[<Selector query='descendant-or-self::h1' data='<h1>Hello, Parsel!</h1>'>]
>>> html_selector.xpath('//h1') # the same, but now with XPath
[<Selector query='//h1' data='<h1>Hello, Parsel!</h1>'>]

Then use `CSS`_ or `XPath`_ expressions to select elements::
For JSON, use `JMESPath`_ expressions to select data::

>>> selector.css('h1')
[<Selector xpath='descendant-or-self::h1' data='<h1>Hello, Parsel!</h1>'>]
>>> selector.xpath('//h1') # the same, but now with XPath
[<Selector xpath='//h1' data='<h1>Hello, Parsel!</h1>'>]
>>> json_text = '{"title":"Hello, Parsel!"}'
>>> json_selector = Selector(text=json_text)
>>> json_selector.jmespath('title')
[<Selector query='title' data='Hello, Parsel!'>]

And extract data from those elements::
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved

>>> selector.css('h1::text').get()
>>> html_selector.xpath('//h1/text()').get()
'Hello, Parsel!'
>>> selector.xpath('//h1/text()').getall()
>>> json_selector.jmespath('title').getall()
['Hello, Parsel!']

.. _CSS: https://www.w3.org/TR/selectors
.. _XPath: https://www.w3.org/TR/xpath
.. _JMESPath: https://jmespath.org/

Learning CSS and XPath
======================
Learning expression languages
=============================

`CSS`_ is a language for applying styles to HTML documents. It defines
selectors to associate those styles with specific HTML elements. Resources to
Expand All @@ -39,20 +45,34 @@ learn CSS_ selectors include:

- `XPath/CSS Equivalents in Wikibooks`_

Parsel support for CSS selectors comes from cssselect, so read about `CSS
selectors supported by cssselect`_.

.. _CSS selectors supported by cssselect: https://cssselect.readthedocs.io/en/latest/#supported-selectors

`XPath`_ is a language for selecting nodes in XML documents, which can also be
used with HTML. Resources to learn XPath_ include:

- `XPath Tutorial in W3Schools`_

- `XPath cheatsheet`_

You can use either CSS_ or XPath_. CSS_ is usually more readable, but some
things can only be done with XPath_.
For HTML and XML input, you can use either CSS_ or XPath_. CSS_ is usually
more readable, but some things can only be done with XPath_.

JMESPath_ allows you to declaratively specify how to extract elements from
a JSON document. Resources to learn JMESPath_ include:

- `JMESPath Tutorial`_

- `JMESPath Specification`_

.. _CSS selectors in the MDN: https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors
.. _XPath cheatsheet: https://devhints.io/xpath
.. _XPath Tutorial in W3Schools: https://www.w3schools.com/xml/xpath_intro.asp
.. _XPath/CSS Equivalents in Wikibooks: https://en.wikibooks.org/wiki/XPath/CSS_Equivalents
.. _JMESPath Tutorial: https://jmespath.org/tutorial.html
.. _JMESPath Specification: https://jmespath.org/specification.html


Using selectors
Expand Down Expand Up @@ -95,12 +115,12 @@ So, by looking at the :ref:`HTML code <topics-selectors-htmlcode>` of that
page, let's construct an XPath for selecting the text inside the title tag::

>>> selector.xpath('//title/text()')
[<Selector xpath='//title/text()' data='Example website'>]
[<Selector query='//title/text()' data='Example website'>]

You can also ask the same thing using CSS instead::

>>> selector.css('title::text')
[<Selector xpath='descendant-or-self::title/text()' data='Example website'>]
[<Selector query='descendant-or-self::title/text()' data='Example website'>]

To actually extract the textual data, you must call the selector ``.get()``
or ``.getall()`` methods, as follows::
Expand Down Expand Up @@ -597,10 +617,10 @@ returns ``True`` for nodes that have all of the specified HTML classes::
... """)
...
>>> sel.xpath('//p[has-class("foo")]')
[<Selector xpath='//p[has-class("foo")]' data='<p class="foo bar-baz">First</p>'>,
<Selector xpath='//p[has-class("foo")]' data='<p class="foo">Second</p>'>]
[<Selector query='//p[has-class("foo")]' data='<p class="foo bar-baz">First</p>'>,
<Selector query='//p[has-class("foo")]' data='<p class="foo">Second</p>'>]
>>> sel.xpath('//p[has-class("foo", "bar-baz")]')
[<Selector xpath='//p[has-class("foo", "bar-baz")]' data='<p class="foo bar-baz">First</p>'>]
[<Selector query='//p[has-class("foo", "bar-baz")]' data='<p class="foo bar-baz">First</p>'>]
>>> sel.xpath('//p[has-class("foo", "bar")]')
[]

Expand Down Expand Up @@ -1011,8 +1031,8 @@ directly by their names::

>>> sel.remove_namespaces()
>>> sel.xpath("//link")
[<Selector xpath='//link' data='<link rel="alternate" type="text/html...'>,
<Selector xpath='//link' data='<link rel="next" type="application/at...'>,
[<Selector query='//link' data='<link rel="alternate" type="text/html...'>,
<Selector query='//link' data='<link rel="next" type="application/at...'>,
...]

If you wonder why the namespace removal procedure isn't called always by default
Expand Down Expand Up @@ -1057,8 +1077,8 @@ And try to select the links again, now using an "atom:" prefix
for the "link" node test::

>>> sel.xpath("//atom:link", namespaces={"atom": "http://www.w3.org/2005/Atom"})
[<Selector xpath='//atom:link' data='<link xmlns="http://www.w3.org/2005/A...'>,
<Selector xpath='//atom:link' data='<link xmlns="http://www.w3.org/2005/A...'>,
[<Selector query='//atom:link' data='<link xmlns="http://www.w3.org/2005/A...'>,
<Selector query='//atom:link' data='<link xmlns="http://www.w3.org/2005/A...'>,
...]

You can pass several namespaces (here we're using shorter 1-letter prefixes)::
Expand Down
Loading