Initial implementation #1

Gallaecio · 2025-05-21T14:34:16Z

To do:

Docs:
1. Read the entire scrapy-crawl-maps docs and complete it.
2. Write docs about writing custom nodes.
3. Document that inputs may be sent to multiple nodes, and so mutable
  inputs should not be modified in place. Recommend using copy() or
  deepcopy() as needed.
4. Document how it is possible to download from a non-fetch node using
  crawler.engine.download(). Recommend using a fetch node instead to let
  spider middlewares work as usual, but recommend it for complex cases
  where it is needed, e.g. combining multiple requests in a node.
  
  Test that combining multiple requests in a node works as expected this
  way.
5. In crawl map definition docs, document how to find supported node
  types (link to the reference of built-in docs, explain how to read
  the reference (node type ID, ports, args), and give an example.
  Also mention the ability of writing your own node types. And of
  course, explain what a node type is, as opposed to a node.
6. Make sure that all APIs from the global __init__ are covered in
  the reference docs.
7. Document how to read stats.
8. Document the io logic: how outputs are duplicated when ports point
  to multiple nodes, and how spiderless nodes are executed only once
  or once per request depending on where on the map they are.
9. Update the docs to include actual diagrams along with JSON examples, as a
  visual aid.
  
  Use https://www.sphinx-doc.org/en/master/usage/extensions/graphviz.html
  
  See if it is possible to automate the graphviz code generation from the
  JSON, to make things easier going forward.
  
  Consider rendering the node in a separate tab. Figure out if we can build
  some custom Sphinx directive that handles all of this automatically.
10. Move away from having CSS selectors on crawl maps. Remove the node
  types and instead promote writing custom nodes.
11. Document supported port types: request, response, item, string.
Changes related to the schema and crawl map builders:
1. Design a JSON-based language to support exposing to the UI all
  supported nodes, links and settings through spider metadata, and
  implement it in all spiders minimizing duplication (i.e. try to reuse
  code, e.g. from I/O specs used at the code level).
  1. Support assigning groups to global options. The initial plan is to have
    2 groups, Configuration and Job options, the latter of which will
    include additional options specific to Scrapy Cloud (e.g. SC units).
    Groups render as tabs in the UI next to the main tab containing the
    visual crawl map (Builder).
    
    Also allow marking an option to prompt for overriding it when running a
    job.
2. Allow nodes to define extra metadata, e.g.
  - A documentation URL
  - An example output item
3. Support short syntax for port links when there are multiple ports but
  only 1 has a matching type.
4. Support hard-coding an initial crawl map for CrawlMapSpider,
  implementing a spider based on the Scrapy tutorial.
  
  In tests, make sure that those initial crawl maps are valid crawl maps.
Work on an API to make existing spiders expose a crawl map and log stats
based on their callbacks, e.g. with some decotators.
Have CrawlMap validate as much as possible as part of Pydantic
validation, before instantiation.
Support inputless fetch nodes.
Enforce and test new restrictions:
- Disallow subgraphs involving multiple response input ports, on same
  or different nodes. That includes:
  - A node with 2 input ports of response type.
  - A node with 1 input port of response type that directly or
    indirectly, without a fetch node (request → response) in the
    middle, outputs a request or item to another node that also has
    a response as input that it gets from a different fetch.
    
    and another one of item type.
  However, allow:
  - Response processing nodes (response → response).
  - A node with 1 input port of response type that directly or
    indirectly, without a fetch node (request → response) in the
    middle, outputs a request or item to another node that also has
    a response as input that it gets from the same fetch node.
Add more node types:
1. Add a UrlRegexSplitter node that supports input requests and has an
  output port for matches and an output port for non-matches. Include a
  test for it.
Improve test coverage:
1. Add a test where the same output port is connected to 2 input ports,
  causing the crawl map to send the same data to both input ports.
2. Add a test that checks that checks which requests are sent from start()
  and which requests are sent from a callback.
  
  urls1 → fetch1 → parser1 → fetch1 → parser3
  urls2 → fetch2 → parser2 → fetch3 → parser3
  
  Urls from urls1 and urls2 should be start requests, while URls from
  parser1 and parser2 sould be sent from a callback.
3. Add a test for a next page link node that allows limiting the
  pagination.
4. Test all combinations (that make sense) of port types in single-input,
  single-output nodes.
  
  | Consider (skip the ones that don't make sense):
  | request → request
  | request → response
  | request → item
  | request → string
  | response → request
  | response → response
  | response → item
  | response → string
  | item → request
  | item → response
  | item → item
  | item → string
  | string → request
  | string → response
  | string → item
  | string → string
5. Test all combinations of multiple inputs and single output or multiple
  outputs from a single input that make sense.
6. Test a scenario where strings and URLs are mixed in a node to build
  requests, as a proof of concept for such nodes, testing how the
  implementation would have to be done so that requests are sent as soon
  as possible, but ensuring that all requests are eventually sent. It
  probably requires some special handling of the input queue of strings,
  since it should be only read once but its output should be reused for
  every input request. Maybe we can implement a utility function for
  that.
7. Test scenarios where nodes fail to defined the required methods, define
  them incorrectly or raise exceptions.
8. Test that a crawl map spec defining more than 1 node type with the same
  type ID fails (e.g. 2 “fetch” node types, corresponding to different
  node classes that both use the same type string).
9. Test persistent request metadata across 2 fetch nodes.
  
  urls → fetch1 → node1 → fetch2 → node2 → fetch1
  
  If node1 sets metadata, it should be available in requests from fetch2.
10. Test that a spider node that outputs requests can link to itself.
11. Review warnings triggered by tox -e py.
12. Test passing a crawl map as a file path or as a Python dict
  instead of a string. Also test passing an invalid file path, a path
  to an unexistign file, a path to a folder, and other error
  scenarios. Include testing for leading spaces when passing a
  string.
Code cleanup:
1. Cleanly discard outputs not connected to anything, do not keep them in
  memory.
2. Try to use _resolve_deps in _iter_output to reduce its complexity.
3. Cache _resolve_deps.
4. Refactor the implementation, specially _iter_output, to make it more
  readable.
5. Switch to Ruff, e.g. to get all sorted automatically.
6. Solve ignored PLR0912 and PLR0915 issues
Self review:
1. Rendered docs.
2. PR.

Copilot

Pull Request Overview

This PR introduces the initial implementation of documentation, CI/CD workflows, pre-commit hooks, and configuration files to support the scrapy-crawl-maps project.

Added Sphinx documentation configuration and a custom Sphinx extension for cross references
Configured ReadTheDocs, pre-commit hooks (Ruff and Blacken-docs), and GitHub workflows for testing and publishing
Included a Codecov configuration for coverage reporting

Reviewed Changes

Copilot reviewed 18 out of 31 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
docs/conf.py	Sphinx configuration for project documentation
docs/_ext/init.py	Setup for custom Sphinx cross-reference directives
.readthedocs.yml	ReadTheDocs configuration for building the docs
.pre-commit-config.yaml	Pre-commit hook configurations for code quality tools
.github/workflows/test.yml	CI configuration for running tests across multiple Python versions
.github/workflows/publish.yml	CI configuration for publishing releases to PyPI
.codecov.yml	Codecov configuration for managing code coverage reports

Files not reviewed (13)

CHANGES.rst: Language not supported
LICENSE: Language not supported
README.rst: Language not supported
docs/Makefile: Language not supported
docs/api.rst: Language not supported
docs/changes.rst: Language not supported
docs/customize.rst: Language not supported
docs/index.rst: Language not supported
docs/make.bat: Language not supported
docs/map.rst: Language not supported
docs/requirements.txt: Language not supported
docs/setup.rst: Language not supported
docs/tutorial.rst: Language not supported

Comments suppressed due to low confidence (1)

.pre-commit-config.yaml:11

The YAML indentation for the blacken-docs hook appears to be off, which might cause parsing errors. Please adjust the indentation to align with the required structure.

- id: blacken-docs

WIP

4a30014

Gallaecio requested review from kmike, wRAR and Copilot May 21, 2025 14:34

Copilot AI reviewed May 21, 2025

View reviewed changes

Address AI feedback

3f31396

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initial implementation #1

Initial implementation #1

Uh oh!

Gallaecio commented May 21, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Initial implementation #1

Are you sure you want to change the base?

Initial implementation #1

Uh oh!

Conversation

Gallaecio commented May 21, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!