Should we handle logs asynchronously?
- Problem statement
- Goals
- Design
- Cases and run modes
- Limitations
- Test environment
- How to run
- Adding your own case
- Results
- Conclusions
- TLDR
- In practice, logging to
stdoutis usually done via blocking I/O - Blocking I/O can be offloaded to a separate thread at some cost
- Under load, does offloading logging to a separate thread pay off in terms of HTTP response time?
- Perform comparative analysis of common logging approaches under a fixed load profile by median response time from ASGI app endpoint
- Find out whether the most efficient logging setups actually process logs off the main thread
- Set of independent ASGI app factories (
FastAPI) is defined; each factory exposes single endpoint that differs only by logging strategy. Handler does nothing but call logger in a loop - Separate factory without logging serves as
baselinefor estimating overhead on response time - Benchmarker application receives list of cases (metadata + app factory) and processes them sequentially
- For each case FastAPI app is started in separate process via
uvicornon free port. After a simple healthcheck (request to case endpoint withn=1), separatesubprocessis started withlocustin headless mode for same case - By default
locustis configured to hit endpoint at roughly uniform rate of ~100 RPS:- 50 "users" are created
- each generates around 2 RPS
- users are spawned at 10 users/sec
- run time is
60sby default
- Each case uses same number of logs per HTTP request (
n_logs_req), passed as parameter to app and controlling how many times logger is called inside handler - Log format is maximally standardized: all loggers are brought to common JSON format aligned with
aiologgeroutput (timestamp, level, file path, function, line number, message).aiologgerformat is chosen as reference because it is hardest to customize; other loggers can be matched to it via their formatters. This allows comparing primarily logging overhead, not differences in log format locustresults for each case are stored in CSV file. From it, aggregated row (Name == "Aggregated") is read and the following values are extracted:Median Response TimeAverage Response Time- actual
Requests/s - number of requests and failures
- Based on these data, comparative analysis is built:
- for
baseline, its median response time is fixed - for all other cases, median and overhead relative to
baselineare calculated (in milliseconds and percent)
- for
- Finally, summary table is printed for all cases, showing
baseline, best variant, and full list of median response times
Each case is separate ASGI app factory (FastAPI) with single endpoint that writes n_logs_req logs in a loop and
does nothing useful beyond that. Cases differ only by logging strategy, including choice of JSON serializer
(standard library json vs orjson):
- No logging (
baseline)baseline— endpoint without logging, used as reference to measure overhead on response time
- Standard
logging(stdlib)stdlib_stream_json/stdlib_stream_orjson— synchronousloggingwith stream handler, JSON viajsonororjsonstdlib_queue_json/stdlib_queue_orjson—loggingwith queue-based handler/listener, JSON viajsonororjson
aiologgeraiologger_no_await_json/aiologger_no_await_orjson—aiologgerwithoutawaitin handler, JSON viajsonororjsonaiologger_with_await_json/aiologger_with_await_orjson—aiologgerwithawaitin handler, JSON viajsonororjson
structlogstructlog_sync_json/structlog_sync_orjson—structlogwith synchronous API (log.info(...)), JSON viajsonororjsonstructlog_async_json/structlog_async_orjson—structlogwith async API (await log.ainfo(...)), JSON viajsonororjsonstructlog_experimental_orjson—structlogwith synchronous API andBytesLoggerFactory, JSON directly viaorjson.dumpsto minimize overhead
For convenience, several predefined case sets ("modes") are provided:
all
Full set:baselineand alllogging/aiologger/structlogvariantsjson
Only cases where JSON logging uses stdlibjson:baselineand all*_jsonvariants (stdlib_*_json,aiologger_*_json,structlog_*_json)orjson
Only cases where JSON is serialized viaorjson:baselineand all*_orjsonvariants, includingstructlog_experimental_orjsonfastest
Set of cases that in this methodology usually end up in top part of table by median response time for given load profile. Useful when you want to quickly inspect "fast" configurations without running fullallstable_vs_unstable
Compact set withbaselineand several contrasting logging approaches: some behave more stably (median and relative ordering of cases change little between runs), others are noticeably more sensitive to load changes and environment noise. This mode is useful if you want to play with--n-logs-reqand--timeand manually see how different logging strategies react to increased load and longer runs
baseline is always present in all modes so that overhead can be compared against single reference point.
- Single environment.
Results are valid only for specific combination of hardware, OS, Python version and library versions (uvicorn,fastapi,locust, loggers). Absolute response times are not meant to be universal benchmarks; primary interest is relative differences between cases under these conditions - Single app process.
For each case single-processuvicornis started (one worker). Scenarios with multiple workers, separate logging processes or distributed systems are not considered - Only
stdoutand JSON logs.
All cases write structured JSON logs tostdout. Variants with file output, network logging, external log systems or non-JSON/plain-text logging are not compared - Fixed load profile.
Load is defined by single scenario: ~100 RPS for60swith simpleGETrequests doing no useful work in endpoint besides logging. Behaviour under other load profiles, run durations or request types is not examined - Limited set of metrics.
Analysis relies on aggregatedlocuststatistics (rowName == "Aggregated"):Median Response Time,Average Response Time, actualRequests/s, number of requests and failures. Tail latencies (p95/p99), full latency distributions, CPU/memory usage etc. are not considered - Sequential case runs.
Cases are executed sequentially in single process. By default, order is fixed;--shuffleoption randomly shuffles it and reduces systematic position bias. There is 5-second pause between cases (COOLDOWN_BETWEEN_CASES_S) to slightly smooth out CPU warm-up effects, but system drift (heating, background processes, etc.) is not fully eliminated - Limited set of logger configurations.
Comparison covers specificlogging,aiologgerandstructlogconfigurations used in this project. Results do not automatically generalize to other configurations, handlers or processor chains of same libraries - Stochastic results.
Even with fixed settings, individual runs may differ significantly due to environment noise (background processes, scheduler, warm-up, etc.). For queue-based and background logging configurations (e.g.stdlib_queue_*,aiologger_no_await_*) spread can be especially noticeable. In practice, it makes sense to run benchmark multiple times and look at stability of relative ordering rather than relying on single run
Intel Core i5-8250U (4c/8t), 8 GB RAM, LMDE 6, Python 3.13.7. Exact dependency versions are in uv.lock.
- Dev prerequisites
uv sync --group dev
source .venv/bin/activate
pre-commit install- Run and wait for results
benchmark
# for help, use `benchmark -h`It is recommended to run benchmark in regular OS terminal, not in embedded IDE terminal
usage: benchmark [-h] [--mode {all,fastest,json,orjson,stable_vs_unstable}] [--n-logs-req N_LOGS_REQ] [--time RUN_TIME] [--shuffle]
--mode {all,fastest,json,orjson,stable_vs_unstable}— preset of cases (see "Cases and modes"). Default:fastest--n-logs-req N_LOGS_REQ— number of logs per HTTP request (default:400)--time RUN_TIME—locustrun time, e.g.30s,60s,2m(default:60s)--shuffle— shuffle case order before running to reduce impact of fixed order on results (default: off)
You can also manually run app factory module and locust separately.
For example, for factory yourfactory on http://127.0.0.1:8000, locust run may look like:
locust -f src/benchmark/adapters/locustfile.py --headless -u 50 -r 10 -t 1m --tag yourfactory --n-logs-req 400 --host "http://127.0.0.1:8000"- Add factory under
src/benchmark/adapters/app_factories - Add task to
src/benchmark/adapters/locustfile.py - Add case to
src/benchmark/cases.py - If needed, register separate
modein:main.parse_args(choices)main.main.factories
- Under given settings (400 logs/endpoint, ~100 RPS, 60s, JSON to
stdout)baselineconsistently showed median response time around ~20 ms and served as reference point for estimating overhead - For every pair "same logging approach +
jsonvsorjson",orjsonvariant never had higher median than its counterpart on stdlibjson structlog_experimental_orjson,structlog_sync_orjsonandstdlib_stream_orjsonconsistently appeared in upper part of median ranking among logging cases- Configurations using
awaitin handler (aiologger_with_await_*,structlog_async_*) had median values orders of magnitude higher thanbaselineand their synchronous/no-await counterparts (hundreds–thousands of milliseconds versus tens) - For
stdlib_queue_*andaiologger_no_await_orjsonmedian varied noticeably between runs: in some runs they were close to leaders, in others they lagged significantly behind stream/structlog-sync
As illustration, here is sample run for "stable" orjson-based cases:
| Case | Median (ms) | Overhead vs baseline |
|---|---|---|
baseline |
20 | 0% (+0 ms) |
structlog_experimental_orjson |
110 | +450% (+90 ms) |
structlog_sync_orjson |
120 | +500% (+100 ms) |
stdlib_stream_orjson |
150 | +650% (+130 ms) |
aiologger_with_await_orjson |
570 | +2750% (+550 ms) |
structlog_async_orjson |
1700 | +8400% (+1680 ms) |
Numbers are for single run on specific machine and are meant only as illustration of relative ordering
To illustrate behaviour under changing load, four cases from stable_vs_unstable mode were considered separately:
baseline— no loggingstructlog_experimental_orjson— representative of fast and relatively stable configurationsstdlib_queue_orjsonandaiologger_no_await_orjson— representatives of potentially fast but more load-/noise-sensitive configurations (queue and fire-and-forget)
These cases were run under four combinations of n_logs_req and run_time:
- 400 logs, 1 minute (
400 / 1m) - 600 logs, 1 minute (
600 / 1m) - 400 logs, 2 minutes (
400 / 2m) - 600 logs, 2 minutes (
600 / 2m)
Approximate median response times for one set of runs:
| Profile | baseline |
structlog_experimental_orjson |
stdlib_queue_orjson |
aiologger_no_await_orjson |
|---|---|---|---|---|
| 400 logs / 1m | 18 ms | 110 ms | 47 ms | 160 ms |
| 600 logs / 1m | 19 ms | 170 ms | 630 ms | 720 ms |
| 400 logs / 2m | 24 ms | 120 ms | 30 ms | 80 ms |
| 600 logs / 2m | 23 ms | 160 ms | 630 ms | 680 ms |
baselinestayed around 20–25 ms across all profilesstructlog_experimental_orjsongrew with load more smoothly: median increased with more logs and longer runs but stayed in tens–hundreds of milliseconds rangestdlib_queue_orjsonandaiologger_no_await_orjsonlooked good under base load (400 / 1m) but asn_logs_reqincreased, their median jumped into hundreds-of-milliseconds range and significantly exceededstructlog_experimental_orjsonin overhead
Currently, repeated benchmark runs, collecting statistics across runs and tail analysis (p95/p99) are not automated:
this can be done manually by running benchmark multiple times with same parameters. When comparing results, it makes
sense to look not only at single metric value (median, p95), but also at the min–max spread across runs to see how much
the results vary
- Under given settings (400 logs/endpoint, ~100 RPS, 60s, JSON to
stdout), logging significantly increases median response time compared tobaselinewithout logs: there is no "almost free" logging at this load level - JSON serializer choice matters: in all tested configurations
orjsonvariant was never slower than its stdlibjsoncounterpart and typically showed lower overhead - Synchronous
structlog+orjsonand stream-basedlogging+orjsonconfigurations look like most robust choice when you need predictable median latency under intensive logging - Queue-based configurations (
stdlib_queue_*) and fire-and-forget (aiologger_no_await_orjson) can provide low median latency under light load, but show significantly higher run-to-run variability and degradation when number of logs increase. They are not a good choice when you have strict latency and stability requirements - Using async logger APIs with
awaitin handler (aiologger_with_await_*,structlog_async_*) consistently produced median latencies orders of magnitude higher than baseline and synchronous/no-await counterparts (hundreds–thousands of milliseconds vs tens). For tested profile, logging withawaitin HTTP critical path is the most expensive option and is not recommended when low latency is priority - For tested configurations and load profile, offloading logging from main thread to separate thread or background
worker does not provide stable win in HTTP response time. In terms of median response time, simple synchronous
solutions with fast JSON serializer (e.g.
orjson) look preferable - All conclusions apply to specific configurations, load profile and environment of this setup and describe relative behaviour of tested cases, not universal ranking of logging solutions
— So, should we handle logs asynchronously?
— Only if you can afford to sacrifice low and stable latency
If in doubt, use synchronous logging to stdout with orjson, and avoid await in the hot path