Skip to content

feat: Add statistics_log_format parameter to BasicCrawler#1061

Merged
vdusek merged 11 commits into
apify:masterfrom
Mantisus:disable-table-logs
Mar 18, 2025
Merged

feat: Add statistics_log_format parameter to BasicCrawler#1061
vdusek merged 11 commits into
apify:masterfrom
Mantisus:disable-table-logs

Conversation

@Mantisus

@Mantisus Mantisus commented Mar 7, 2025

Copy link
Copy Markdown
Collaborator

Description

  • Add the use_table_logs parameter that allows disabling tables in logs. This makes log parsing easier when needed.

Issues

@Mantisus

Mantisus commented Mar 7, 2025

Copy link
Copy Markdown
Collaborator Author

Thinking about this task, I believe we shouldn't add any third-party logger. A flag that disables tables in logs, which make data parsing difficult, is sufficient.

This will allow users to use any logger that's compatible with the standard one and enables customization of log output.

Example for loguru

import inspect
import logging

from loguru import logger

class InterceptHandler(logging.Handler):
    def emit(self, record: logging.LogRecord) -> None:
        # Get corresponding Loguru level if it exists.
        try:
            level: str | int = logger.level(record.levelname).name
        except ValueError:
            level = record.levelno

        # Find caller from where originated the logged message.
        frame, depth = inspect.currentframe(), 0
        while frame:
            filename = frame.f_code.co_filename
            is_logging = filename == logging.__file__
            is_frozen = 'importlib' in filename and '_bootstrap' in filename
            if depth > 0 and not (is_logging | is_frozen):
                break
            frame = frame.f_back
            depth += 1

        logger.opt(depth=depth, exception=record.exc_info).log(level, record.getMessage())


logger.add('crawler.log', serialize=True, level='INFO')
logging.basicConfig(handlers=[InterceptHandler()], level=logging.INFO, force=True)

crawler = BeautifulSoupCrawler(configure_logging=False, use_table_logs=False)

Log record:

{
    "text": "2025-03-07 16:51:09.947 | INFO     | crawlee.crawlers._basic._basic_crawler:run:580 - Final request statistics: requests_finished: 1; requests_failed: 0; retry_histogram: [1]; request_avg_failed_duration: None; request_avg_finished_duration: 0.795506; requests_finished_per_minute: 73; requests_failed_per_minute: 0; request_total_duration: 0.795506; requests_total: 1; crawler_runtime: 0.818803\n",
    "record": {
        "elapsed": { "repr": "0:00:01.921982", "seconds": 1.921982 },
        "exception": null,
        "extra": {},
        "file": {
            "name": "_basic_crawler.py",
            "path": "/src/crawlee/crawlers/_basic/_basic_crawler.py"
        },
        "function": "run",
        "level": { "icon": "ℹ️", "name": "INFO", "no": 20 },
        "line": 580,
        "message": "Final request statistics: requests_finished: 1; requests_failed: 0; retry_histogram: [1]; request_avg_failed_duration: None; request_avg_finished_duration: 0.795506; requests_finished_per_minute: 73; requests_failed_per_minute: 0; request_total_duration: 0.795506; requests_total: 1; crawler_runtime: 0.818803",
        "module": "_basic_crawler",
        "name": "crawlee.crawlers._basic._basic_crawler",
        "process": { "id": 32118, "name": "MainProcess" },
        "thread": { "id": 139760540858176, "name": "MainThread" },
        "time": {
            "repr": "2025-03-07 16:51:09.947345+00:00",
            "timestamp": 1741366269.947345
        }
    }
}

@janbuchar

Copy link
Copy Markdown
Collaborator

I like this take. Could you add this as an example to the docs about setting up JSON logs?

@Mantisus Mantisus requested review from janbuchar and vdusek and removed request for janbuchar and vdusek March 7, 2025 17:27
@Mantisus Mantisus self-assigned this Mar 7, 2025

@Pijukatel Pijukatel left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Maybe add some tiny test that checks the non-default option(default option is already covered by existing tests.)

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Only two minor notes 🙂...

Comment thread docs/examples/json_logging.mdx Outdated
Comment thread docs/examples/code_examples/configure_json_logging.py
configure_logging: NotRequired[bool]
"""If True, the crawler will set up logging infrastructure automatically."""

use_table_logs: NotRequired[bool]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking, wouldn't it be better to have something like statistics_log_format: Literal["table", "inline"]? I think that most people won't know what a "table log" is...

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea

Comment thread src/crawlee/crawlers/_basic/_basic_crawler.py Outdated
Mantisus and others added 3 commits March 11, 2025 14:01
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - resolve Honza's suggestions before merging

@janbuchar janbuchar left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, feel free to resolve it at will. Otherwise LGTM

Comment thread docs/examples/json_logging.mdx Outdated
@janbuchar janbuchar changed the title feat: Add use_table_logs parameter to control using tables in logs feat: Add statistics_log_format parameter to BasicCrawler.__init__ Mar 17, 2025
@Mantisus Mantisus force-pushed the disable-table-logs branch from 06bfb38 to 40c390f Compare March 17, 2025 20:17
@vdusek vdusek changed the title feat: Add statistics_log_format parameter to BasicCrawler.__init__ feat: Add statistics_log_format parameter to BasicCrawler Mar 18, 2025
@vdusek vdusek added the t-tooling Issues with this label are in the ownership of the tooling team. label Mar 18, 2025
@vdusek vdusek merged commit 635ae4a into apify:master Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t-tooling Issues with this label are in the ownership of the tooling team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add an option for JSON-compatible logs

4 participants