@@ -81,21 +81,20 @@ library.
81
81
Within the spider, consuming the AutoExtract result is as easy as::
82
82
83
83
import scrapy
84
- from autoextract_poet import AutoExtractArticleData
84
+ from autoextract_poet.pages import AutoExtractArticlePage
85
85
86
86
class SampleSpider(scrapy.Spider):
87
-
88
87
name = "sample"
89
88
90
- def parse(self, response, article: AutoExtractArticleData ):
89
+ def parse(self, response, article_page: AutoExtractArticlePage ):
91
90
# We're making two requests here:
92
91
# - one through Scrapy to build the response argument
93
- # - another through providers to build the article argument
94
- yield article .to_item()
92
+ # - the other through the providers to build the article_page argument
93
+ yield article_page .to_item()
95
94
96
95
Note that on the example above, we're going to perform two requests:
97
96
98
- * one goes through Scrapy (it might use Crawlera , Splash or no proxy at all, depending on your configuration)
97
+ * one goes through Scrapy (it might use Smart Proxy , Splash or no proxy at all, depending on your configuration)
99
98
* another goes through AutoExtract API using `zyte-autoextract `_
100
99
101
100
If you don't need the additional request going through Scrapy,
@@ -105,16 +104,31 @@ This will ignore the Scrapy request and only the AutoExtract API will be fetched
105
104
For example::
106
105
107
106
import scrapy
108
- from autoextract_poet import AutoExtractArticleData
107
+ from autoextract_poet.pages import AutoExtractArticlePage
109
108
from scrapy_poet import DummyResponse
110
109
111
110
class SampleSpider(scrapy.Spider):
112
-
113
111
name = "sample"
114
112
115
- def parse(self, response: DummyResponse, article: AutoExtractArticleData ):
113
+ def parse(self, response: DummyResponse, article_page: AutoExtractArticlePage ):
116
114
# We're making a single request here to build the article argument
117
- yield article.to_item()
115
+ yield article_page.to_item()
116
+
117
+
118
+ The examples above extract an article from the page, but you may want to
119
+ extract a different type of item, like a product or a job posting. It is
120
+ as easy as using the correct type annotation in the callback. This
121
+ is how the callback looks like if we need to extract a real state
122
+ from the page::
123
+
124
+ def parse(self,
125
+ response: DummyResponse,
126
+ real_estate_page: AutoExtractRealEstatePage):
127
+ yield real_estate_page.to_item()
128
+
129
+ You can even use ``AutoExtractWebPage `` if what you need is the raw browser HTML to
130
+ extract some additional data. Visit the full list of `supported page types `_
131
+ to get a better idea of the supported pages.
118
132
119
133
Configuration
120
134
^^^^^^^^^^^^^
@@ -164,27 +178,30 @@ You can capture those exceptions using an error callback (``errback``)::
164
178
165
179
import scrapy
166
180
from autoextract.aio.errors import RequestError
181
+ from autoextract_poet.pages import AutoExtractArticlePage
167
182
from scrapy_autoextract.errors import QueryError
183
+ from scrapy_poet import DummyResponse
168
184
from twisted.python.failure import Failure
169
185
170
186
class SampleSpider(scrapy.Spider):
171
-
172
187
name = "sample"
173
188
urls = [...]
174
189
175
190
def start_requests(self):
176
191
for url in self.urls:
177
- yield scrapy.Request(url, callback=self.parse_article, errback=self.errback_article)
192
+ yield scrapy.Request(url, callback=self.parse_article,
193
+ errback=self.errback_article)
178
194
179
- def parse_article(self, response: DummyResponse, article: AutoExtractArticleData):
180
- yield article.to_item()
195
+ def parse_article(self, response: DummyResponse,
196
+ article_page: AutoExtractArticlePage):
197
+ yield article_page.to_item()
181
198
182
199
def errback_article(self, failure: Failure):
183
200
if failure.check(RequestError):
184
- self.logger.error(f"RequestError on {failure.request.url})
201
+ self.logger.error(f"RequestError on {failure.request.url}" )
185
202
186
203
if failure.check(QueryError):
187
- self.logger.error(f"QueryError: {failure.message})
204
+ self.logger.error(f"QueryError: {failure.value. message}" )
188
205
189
206
See `Scrapy documentation <https://docs.scrapy.org/en/latest/topics/request-response.html#using-errbacks-to-catch-exceptions-in-request-processing >`_
190
207
for more details on how to capture exceptions using request's errback.
@@ -254,9 +271,6 @@ When using the AutoExtract middleware, there are some limitations.
254
271
When using the AutoExtract providers, be aware that:
255
272
256
273
* With scrapy-poet integration, retry requests don't go through Scrapy
257
- * Not all data types are supported with scrapy-poet,
258
- currently only Articles, Products and Product Lists are supported with
259
- `autoextract-poet `_
260
274
261
275
.. _`web-poet` : https://github.com/scrapinghub/web-poet
262
276
.. _`scrapy-poet` : https://github.com/scrapinghub/scrapy-poet
@@ -267,3 +281,4 @@ When using the AutoExtract providers, be aware that:
267
281
.. _`Scrapy's asyncio documentation` : https://docs.scrapy.org/en/latest/topics/asyncio.html
268
282
.. _`Request-level error` : https://doc.scrapinghub.com/autoextract.html#request-level
269
283
.. _`Query-level error` : https://doc.scrapinghub.com/autoextract.html#query-level
284
+ .. _`supported page types` : https://autoextract-poet.readthedocs.io/en/stable/_autosummary/autoextract_poet.pages.html#module-autoextract_poet.pages
0 commit comments