Skip to content

docs: Add guide for running crawler in web server #1174

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Pijukatel
Copy link
Collaborator

Description

Add guide for running crawler in web server

Issues

@Pijukatel Pijukatel added documentation Improvements or additions to documentation. t-tooling Issues with this label are in the ownership of the tooling team. labels Apr 25, 2025
@github-actions github-actions bot added this to the 113rd sprint - Tooling team milestone Apr 25, 2025
@Pijukatel Pijukatel requested a review from Copilot April 25, 2025 12:17
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a guide for running the crawler in a web server by including new FastAPI server and crawler code examples along with configuration updates.

  • Updated pyproject.toml to include new file paths and disable specific error codes for the web server examples.
  • Added a FastAPI server example (server.py) to illustrate how to run the crawler from a web endpoint.
  • Introduced an asynchronous crawler implementation (crawler.py) with lifecycle management using an async context manager.

Reviewed Changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 1 comment.

File Description
pyproject.toml Updated configuration to include new file mappings for docs examples and added mypy overrides.
docs/guides/code_examples/running_in_web_server/server.py Introduces a FastAPI server with endpoints for running and interacting with a crawler.
docs/guides/code_examples/running_in_web_server/crawler.py Adds an asynchronous crawler setup with a default request handler and lifecycle management.
Files not reviewed (1)
  • docs/guides/running_in_web_server.mdx: Language not supported

@Pijukatel Pijukatel requested review from vdusek and Mantisus April 25, 2025 12:20
@Pijukatel Pijukatel marked this pull request as ready for review April 25, 2025 12:20
Copy link
Collaborator

@Mantisus Mantisus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


# Set up a web server

There are many popular web server frameworks for Python, such as Flask, Django, Pyramid, ... In this guide, we will use the [FastAPI](https://fastapi.tiangolo.com/) to keep things simple.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

links to the mentioned projects?

- `/` - The index is just giving short description of the server with example link to the second endpoint.
- `/scrape` - This is the endpoint that receives a `url` parameter and returns the page title scraped from the URL

To run the example server, make sure that you have installed the [fastapi[standard]](https://fastapi.tiangolo.com/#installation) and you can use the command `fastapi dev server.py` from the directory where the example code is located.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we have a separate triple-backticks (```) command here for executing the server?


This will be our core server setup:

<CodeBlock className="language-python">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we have 2 files here, could we use filename arg for code block?


We will create a standard <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> and use the `keep_alive=true` option to keep the crawler running even if there are no requests currently in the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>. This way it will always be waiting for new requests to come in.

<CodeBlock className="language-python">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we have 2 files here, could we use filename arg for code block?

@@ -244,8 +247,15 @@ module = [
"cookiecutter.*", # Untyped and stubs not available
"inquirer.*", # Untyped and stubs not available
]
disable_error_code = ["misc"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry - what is this?

import Crawler from '!!raw-loader!./code_examples/running_in_web_server/crawler.py';
import Server from '!!raw-loader!./code_examples/running_in_web_server/server.py';

# Running in web server
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be here, as titles are rendered based on the title field in the --- header.

Suggested change
# Running in web server


We will build a simple HTTP server that receives a page URL and returns the page title in the response.

# Set up a web server
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2nd level heading (1st only for page title)


To run the example server, make sure that you have installed the [fastapi[standard]](https://fastapi.tiangolo.com/#installation) and you can use the command `fastapi dev server.py` from the directory where the example code is located.

# Create a crawler
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2nd level heading (1st only for page title)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature parity: Support for running Crawlee in a web server environment
3 participants