Initial implementation #1
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
To do:
Docs:
Read the entire scrapy-crawl-maps docs and complete it.
Write docs about writing custom nodes.
Document that inputs may be sent to multiple nodes, and so mutable
inputs should not be modified in place. Recommend using copy() or
deepcopy() as needed.
Document how it is possible to download from a non-fetch node using
crawler.engine.download(). Recommend using a fetch node instead to let
spider middlewares work as usual, but recommend it for complex cases
where it is needed, e.g. combining multiple requests in a node.
Test that combining multiple requests in a node works as expected this
way.
In crawl map definition docs, document how to find supported node
types (link to the reference of built-in docs, explain how to read
the reference (node type ID, ports, args), and give an example.
Also mention the ability of writing your own node types. And of
course, explain what a node type is, as opposed to a node.
Make sure that all APIs from the global
__init__
are covered inthe reference docs.
Document how to read stats.
Document the io logic: how outputs are duplicated when ports point
to multiple nodes, and how spiderless nodes are executed only once
or once per request depending on where on the map they are.
Update the docs to include actual diagrams along with JSON examples, as a
visual aid.
Use https://www.sphinx-doc.org/en/master/usage/extensions/graphviz.html
See if it is possible to automate the graphviz code generation from the
JSON, to make things easier going forward.
Consider rendering the node in a separate tab. Figure out if we can build
some custom Sphinx directive that handles all of this automatically.
Move away from having CSS selectors on crawl maps. Remove the node
types and instead promote writing custom nodes.
Document supported port types: request, response, item, string.
Changes related to the schema and crawl map builders:
Design a JSON-based language to support exposing to the UI all
supported nodes, links and settings through spider metadata, and
implement it in all spiders minimizing duplication (i.e. try to reuse
code, e.g. from I/O specs used at the code level).
Support assigning groups to global options. The initial plan is to have
2 groups, Configuration and Job options, the latter of which will
include additional options specific to Scrapy Cloud (e.g. SC units).
Groups render as tabs in the UI next to the main tab containing the
visual crawl map (Builder).
Also allow marking an option to prompt for overriding it when running a
job.
Allow nodes to define extra metadata, e.g.
A documentation URL
An example output item
Support short syntax for port links when there are multiple ports but
only 1 has a matching type.
Support hard-coding an initial crawl map for CrawlMapSpider,
implementing a spider based on the Scrapy tutorial.
In tests, make sure that those initial crawl maps are valid crawl maps.
Work on an API to make existing spiders expose a crawl map and log stats
based on their callbacks, e.g. with some decotators.
Have CrawlMap validate as much as possible as part of Pydantic
validation, before instantiation.
Support inputless fetch nodes.
Enforce and test new restrictions:
Disallow subgraphs involving multiple response input ports, on same
or different nodes. That includes:
A node with 2 input ports of response type.
A node with 1 input port of response type that directly or
indirectly, without a fetch node (request → response) in the
middle, outputs a request or item to another node that also has
a response as input that it gets from a different fetch.
and another one of item type.
However, allow:
Response processing nodes (response → response).
A node with 1 input port of response type that directly or
indirectly, without a fetch node (request → response) in the
middle, outputs a request or item to another node that also has
a response as input that it gets from the same fetch node.
Add more node types:
output port for matches and an output port for non-matches. Include a
test for it.
Improve test coverage:
Add a test where the same output port is connected to 2 input ports,
causing the crawl map to send the same data to both input ports.
Add a test that checks that checks which requests are sent from start()
and which requests are sent from a callback.
urls1 → fetch1 → parser1 → fetch1 → parser3
urls2 → fetch2 → parser2 → fetch3 → parser3
Urls from urls1 and urls2 should be start requests, while URls from
parser1 and parser2 sould be sent from a callback.
Add a test for a next page link node that allows limiting the
pagination.
Test all combinations (that make sense) of port types in single-input,
single-output nodes.
| Consider (skip the ones that don't make sense):
| request → request
| request → response
| request → item
| request → string
| response → request
| response → response
| response → item
| response → string
| item → request
| item → response
| item → item
| item → string
| string → request
| string → response
| string → item
| string → string
Test all combinations of multiple inputs and single output or multiple
outputs from a single input that make sense.
Test a scenario where strings and URLs are mixed in a node to build
requests, as a proof of concept for such nodes, testing how the
implementation would have to be done so that requests are sent as soon
as possible, but ensuring that all requests are eventually sent. It
probably requires some special handling of the input queue of strings,
since it should be only read once but its output should be reused for
every input request. Maybe we can implement a utility function for
that.
Test scenarios where nodes fail to defined the required methods, define
them incorrectly or raise exceptions.
Test that a crawl map spec defining more than 1 node type with the same
type ID fails (e.g. 2 “fetch” node types, corresponding to different
node classes that both use the same type string).
Test persistent request metadata across 2 fetch nodes.
urls → fetch1 → node1 → fetch2 → node2 → fetch1
If node1 sets metadata, it should be available in requests from fetch2.
Test that a spider node that outputs requests can link to itself.
Review warnings triggered by tox -e py.
Test passing a crawl map as a file path or as a Python dict
instead of a string. Also test passing an invalid file path, a path
to an unexistign file, a path to a folder, and other error
scenarios. Include testing for leading spaces when passing a
string.
Code cleanup:
Cleanly discard outputs not connected to anything, do not keep them in
memory.
Try to use _resolve_deps in _iter_output to reduce its complexity.
Cache _resolve_deps.
Refactor the implementation, specially _iter_output, to make it more
readable.
Switch to Ruff, e.g. to get all sorted automatically.
Solve ignored PLR0912 and PLR0915 issues
Self review:
Rendered docs.
PR.