feat: Add statistics_log_format parameter to BasicCrawler#1061
Conversation
|
Thinking about this task, I believe we shouldn't add any third-party logger. A flag that disables tables in logs, which make data parsing difficult, is sufficient. This will allow users to use any logger that's compatible with the standard one and enables customization of log output. Example for loguru import inspect
import logging
from loguru import logger
class InterceptHandler(logging.Handler):
def emit(self, record: logging.LogRecord) -> None:
# Get corresponding Loguru level if it exists.
try:
level: str | int = logger.level(record.levelname).name
except ValueError:
level = record.levelno
# Find caller from where originated the logged message.
frame, depth = inspect.currentframe(), 0
while frame:
filename = frame.f_code.co_filename
is_logging = filename == logging.__file__
is_frozen = 'importlib' in filename and '_bootstrap' in filename
if depth > 0 and not (is_logging | is_frozen):
break
frame = frame.f_back
depth += 1
logger.opt(depth=depth, exception=record.exc_info).log(level, record.getMessage())
logger.add('crawler.log', serialize=True, level='INFO')
logging.basicConfig(handlers=[InterceptHandler()], level=logging.INFO, force=True)
crawler = BeautifulSoupCrawler(configure_logging=False, use_table_logs=False)Log record: {
"text": "2025-03-07 16:51:09.947 | INFO | crawlee.crawlers._basic._basic_crawler:run:580 - Final request statistics: requests_finished: 1; requests_failed: 0; retry_histogram: [1]; request_avg_failed_duration: None; request_avg_finished_duration: 0.795506; requests_finished_per_minute: 73; requests_failed_per_minute: 0; request_total_duration: 0.795506; requests_total: 1; crawler_runtime: 0.818803\n",
"record": {
"elapsed": { "repr": "0:00:01.921982", "seconds": 1.921982 },
"exception": null,
"extra": {},
"file": {
"name": "_basic_crawler.py",
"path": "/src/crawlee/crawlers/_basic/_basic_crawler.py"
},
"function": "run",
"level": { "icon": "ℹ️", "name": "INFO", "no": 20 },
"line": 580,
"message": "Final request statistics: requests_finished: 1; requests_failed: 0; retry_histogram: [1]; request_avg_failed_duration: None; request_avg_finished_duration: 0.795506; requests_finished_per_minute: 73; requests_failed_per_minute: 0; request_total_duration: 0.795506; requests_total: 1; crawler_runtime: 0.818803",
"module": "_basic_crawler",
"name": "crawlee.crawlers._basic._basic_crawler",
"process": { "id": 32118, "name": "MainProcess" },
"thread": { "id": 139760540858176, "name": "MainThread" },
"time": {
"repr": "2025-03-07 16:51:09.947345+00:00",
"timestamp": 1741366269.947345
}
}
} |
|
I like this take. Could you add this as an example to the docs about setting up JSON logs? |
Pijukatel
left a comment
There was a problem hiding this comment.
Nice. Maybe add some tiny test that checks the non-default option(default option is already covered by existing tests.)
vdusek
left a comment
There was a problem hiding this comment.
Nice! Only two minor notes 🙂...
| configure_logging: NotRequired[bool] | ||
| """If True, the crawler will set up logging infrastructure automatically.""" | ||
|
|
||
| use_table_logs: NotRequired[bool] |
There was a problem hiding this comment.
I'm thinking, wouldn't it be better to have something like statistics_log_format: Literal["table", "inline"]? I think that most people won't know what a "table log" is...
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
vdusek
left a comment
There was a problem hiding this comment.
LGTM - resolve Honza's suggestions before merging
janbuchar
left a comment
There was a problem hiding this comment.
One nit, feel free to resolve it at will. Otherwise LGTM
use_table_logs parameter to control using tables in logsstatistics_log_format parameter to BasicCrawler.__init__
06bfb38 to
40c390f
Compare
statistics_log_format parameter to BasicCrawler.__init__statistics_log_format parameter to BasicCrawler
Description
use_table_logsparameter that allows disabling tables in logs. This makes log parsing easier when needed.Issues