Skip to content

Commit 17ee419

Browse files
authored
Merge pull request #25 from scrapinghub/ae_poet_0.3.0
Upgrade to autoextract-poet 0.3.0
2 parents 82a552a + 4a2bd4f commit 17ee419

File tree

4 files changed

+61
-27
lines changed

4 files changed

+61
-27
lines changed

CHANGES.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,14 @@
11
Changes
22
=======
33

4+
0.7.0 (2021-08-05)
5+
------------------
6+
7+
* Support for all Automatic Extraction API page types by upgrading to
8+
``autoextract-poet`` 0.3.0
9+
* Rename Scrapinghub references to Zyte
10+
* Update README
11+
412
0.6.1 (2021-06-02)
513
------------------
614

README.rst

Lines changed: 34 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -81,21 +81,20 @@ library.
8181
Within the spider, consuming the AutoExtract result is as easy as::
8282

8383
import scrapy
84-
from autoextract_poet import AutoExtractArticleData
84+
from autoextract_poet.pages import AutoExtractArticlePage
8585

8686
class SampleSpider(scrapy.Spider):
87-
8887
name = "sample"
8988

90-
def parse(self, response, article: AutoExtractArticleData):
89+
def parse(self, response, article_page: AutoExtractArticlePage):
9190
# We're making two requests here:
9291
# - one through Scrapy to build the response argument
93-
# - another through providers to build the article argument
94-
yield article.to_item()
92+
# - the other through the providers to build the article_page argument
93+
yield article_page.to_item()
9594

9695
Note that on the example above, we're going to perform two requests:
9796

98-
* one goes through Scrapy (it might use Crawlera, Splash or no proxy at all, depending on your configuration)
97+
* one goes through Scrapy (it might use Smart Proxy, Splash or no proxy at all, depending on your configuration)
9998
* another goes through AutoExtract API using `zyte-autoextract`_
10099

101100
If you don't need the additional request going through Scrapy,
@@ -105,16 +104,31 @@ This will ignore the Scrapy request and only the AutoExtract API will be fetched
105104
For example::
106105

107106
import scrapy
108-
from autoextract_poet import AutoExtractArticleData
107+
from autoextract_poet.pages import AutoExtractArticlePage
109108
from scrapy_poet import DummyResponse
110109

111110
class SampleSpider(scrapy.Spider):
112-
113111
name = "sample"
114112

115-
def parse(self, response: DummyResponse, article: AutoExtractArticleData):
113+
def parse(self, response: DummyResponse, article_page: AutoExtractArticlePage):
116114
# We're making a single request here to build the article argument
117-
yield article.to_item()
115+
yield article_page.to_item()
116+
117+
118+
The examples above extract an article from the page, but you may want to
119+
extract a different type of item, like a product or a job posting. It is
120+
as easy as using the correct type annotation in the callback. This
121+
is how the callback looks like if we need to extract a real state
122+
from the page::
123+
124+
def parse(self,
125+
response: DummyResponse,
126+
real_estate_page: AutoExtractRealEstatePage):
127+
yield real_estate_page.to_item()
128+
129+
You can even use ``AutoExtractWebPage`` if what you need is the raw browser HTML to
130+
extract some additional data. Visit the full list of `supported page types`_
131+
to get a better idea of the supported pages.
118132

119133
Configuration
120134
^^^^^^^^^^^^^
@@ -164,27 +178,30 @@ You can capture those exceptions using an error callback (``errback``)::
164178

165179
import scrapy
166180
from autoextract.aio.errors import RequestError
181+
from autoextract_poet.pages import AutoExtractArticlePage
167182
from scrapy_autoextract.errors import QueryError
183+
from scrapy_poet import DummyResponse
168184
from twisted.python.failure import Failure
169185

170186
class SampleSpider(scrapy.Spider):
171-
172187
name = "sample"
173188
urls = [...]
174189

175190
def start_requests(self):
176191
for url in self.urls:
177-
yield scrapy.Request(url, callback=self.parse_article, errback=self.errback_article)
192+
yield scrapy.Request(url, callback=self.parse_article,
193+
errback=self.errback_article)
178194

179-
def parse_article(self, response: DummyResponse, article: AutoExtractArticleData):
180-
yield article.to_item()
195+
def parse_article(self, response: DummyResponse,
196+
article_page: AutoExtractArticlePage):
197+
yield article_page.to_item()
181198

182199
def errback_article(self, failure: Failure):
183200
if failure.check(RequestError):
184-
self.logger.error(f"RequestError on {failure.request.url})
201+
self.logger.error(f"RequestError on {failure.request.url}")
185202

186203
if failure.check(QueryError):
187-
self.logger.error(f"QueryError: {failure.message})
204+
self.logger.error(f"QueryError: {failure.value.message}")
188205

189206
See `Scrapy documentation <https://docs.scrapy.org/en/latest/topics/request-response.html#using-errbacks-to-catch-exceptions-in-request-processing>`_
190207
for more details on how to capture exceptions using request's errback.
@@ -254,9 +271,6 @@ When using the AutoExtract middleware, there are some limitations.
254271
When using the AutoExtract providers, be aware that:
255272

256273
* With scrapy-poet integration, retry requests don't go through Scrapy
257-
* Not all data types are supported with scrapy-poet,
258-
currently only Articles, Products and Product Lists are supported with
259-
`autoextract-poet`_
260274

261275
.. _`web-poet`: https://github.com/scrapinghub/web-poet
262276
.. _`scrapy-poet`: https://github.com/scrapinghub/scrapy-poet
@@ -267,3 +281,4 @@ When using the AutoExtract providers, be aware that:
267281
.. _`Scrapy's asyncio documentation`: https://docs.scrapy.org/en/latest/topics/asyncio.html
268282
.. _`Request-level error`: https://doc.scrapinghub.com/autoextract.html#request-level
269283
.. _`Query-level error`: https://doc.scrapinghub.com/autoextract.html#query-level
284+
.. _`supported page types`: https://autoextract-poet.readthedocs.io/en/stable/_autosummary/autoextract_poet.pages.html#module-autoextract_poet.pages

setup.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -18,16 +18,16 @@ def get_version():
1818
setup(
1919
name=NAME,
2020
version=get_version(),
21-
author='Scrapinghub Inc',
22-
author_email='info@scrapinghub.com',
23-
maintainer='Scrapinghub Inc',
24-
maintainer_email='info@scrapinghub.com',
25-
description='Scrapinghub AutoExtract API integration for Scrapy',
21+
author='Zyte Group Ltd',
22+
author_email='info@zyte.com',
23+
maintainer='Zyte Group Ltd',
24+
maintainer_email='info@zyte.com',
25+
description='Zyte Automatic Extraction API integration for Scrapy',
2626
long_description=open('README.rst').read(),
2727
url='https://github.com/scrapinghub/scrapy-autoextract',
2828
packages=find_packages(),
2929
install_requires=[
30-
'autoextract-poet>=0.2.1',
30+
'autoextract-poet>=0.3.0',
3131
'zyte-autoextract>=0.7.0',
3232
'scrapy-poet>=0.2.0',
3333
'aiohttp',

tests/test_providers.py

Lines changed: 13 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,11 @@
1515
from autoextract_poet import (
1616
AutoExtractArticleData, AutoExtractProductData, AutoExtractHtml)
1717
from tests.utils import assert_stats, request_error, async_test
18-
from autoextract_poet.page_inputs import AutoExtractData
18+
from autoextract_poet.page_inputs import AutoExtractData, \
19+
AutoExtractArticleListData, AutoExtractCommentsData, \
20+
AutoExtractForumPostsData, AutoExtractJobPostingData, \
21+
AutoExtractProductListData, AutoExtractRealEstateData, \
22+
AutoExtractReviewsData, AutoExtractVehicleData
1923
from scrapy import Spider
2024
from scrapy.crawler import Crawler
2125
from scrapy_autoextract.providers import (
@@ -26,10 +30,17 @@
2630

2731
DATA_INPUTS = (
2832
AutoExtractArticleData,
33+
AutoExtractArticleListData,
34+
AutoExtractCommentsData,
35+
AutoExtractForumPostsData,
36+
AutoExtractJobPostingData,
2937
AutoExtractProductData,
38+
AutoExtractProductListData,
39+
AutoExtractRealEstateData,
40+
AutoExtractReviewsData,
41+
AutoExtractVehicleData,
3042
)
3143

32-
3344
def test_stop_spider_on_account_disabled(mocker: MockerFixture):
3445
class Engine:
3546
close_spider = mocker.Mock()

0 commit comments

Comments
 (0)