CUPS 2.4 appears to freeze indefinitely under Production load #1259

tbigby-kristin · 2025-05-12T06:01:26Z

tbigby-kristin
May 12, 2025

Hi team,

Unfortunately we have an issue similar to #1128 but as @vliaskov notes, the CUPS server becomes unresponsive indefinitely... in our case up to an hour before we manually restarted it.

Not sure whether to add this as a new issue or add to #1128.

In our case, we are using CUPS as a Linux print server for the school, Windows and Mac students and staff send their print jobs to the CUPS server, we use PaperCut MF's CUPS print provider to track billing, and then CUPS sends the print job to the printer (Canon photocopiers in our environment).

We recently went live with CUPS 2.4.11 and began to notice this issue, so we upgraded to 2.4.12 (specifically this commit: ac85dcb plus my fix for #1249), and are still noticing the freeze. We run CUPS in a Docker container on Ubuntu 24.04 - not using the Ubuntu CUPS packages though but our own custom build.
Before rolling out 2.4.11, we had a CUPS 1.4 server running on CentOS 6 (quite the upgrade!), but that did not have this freezing issue.

I added a Docker Healthcheck to detect when this occurs and to auto-restart the container if it occurs, to limit the impact on users. The healthcheck runs inside the container and is just running a curl command to https://127.0.0.1:631/printers.

Essentially when the freeze occurs, CUPS stops responding or accepting HTTP connections - we can't open the web UI, print jobs fail to go through, and the Docker healthcheck curl command fails with a timeout. However, I can see that the CUPS process is still running at the time.

Our logs of these restarts show that the freeze occurs about once a day, but only during working hours when there is reasonable printing load, it never occurs overnight when no-one is printing. For that matter, it never occurs on our Staging container, that can stay running for days without any issue.

As it only occurs rarely throughout the day, it's been hard to track down any useful log information, but when I did turn CUPS to Debug logging and added an error_log capture, the last few lines before a freeze show a Client being accepted, but there is no "Connection now encrypted" message as would be normal for a new Client:

D [08/May/2025:11:10:50 +1200] cupsdSetBusyState: newbusy="Active clients, printing jobs, and dirty files", busy="Printing jobs and dirty files"
D [08/May/2025:11:10:50 +1200] [Client 133631] Server address is "172.22.0.2".
D [08/May/2025:11:10:50 +1200] [Client 133631] Accepted from 10.x.x.x:55295 (IPv4)
D [08/May/2025:11:10:50 +1200] [Client 133631] Waiting for request.

In the past 5 minutes of logs, that client IP had just run 'IPP Get-Jobs' requests for the two printer queues it has installed, which had all been successful.

There's an awful lot in the Debug error_log file that would need sanitising before uploading, given that it only occurs in Production, but let me know if there's specific debug lines to check for and I can collect those log details.

The Debug Report ran 4 seconds before the end of the log and showed:

D [08/May/2025:11:10:46 +1200] Report: clients=228
D [08/May/2025:11:10:46 +1200] Report: jobs=500
D [08/May/2025:11:10:46 +1200] Report: jobs-active=4
D [08/May/2025:11:10:46 +1200] Report: printers=59
D [08/May/2025:11:10:46 +1200] Report: stringpool-string-count=48997
D [08/May/2025:11:10:46 +1200] Report: stringpool-alloc-bytes=27512
D [08/May/2025:11:10:46 +1200] Report: stringpool-total-bytes=987496

Our MaxClients is set to 1000 and during the 5 mins of the error_log, it fluctuates between 200 and 350 clients.

Before we started the 'auto-restart' Healthcheck, we docker execd into the container while it was frozen.

The CUPS process was in Sleeping state:

root@1c100f8abbf2:/# ps ax
    PID TTY      STAT   TIME COMMAND
      1 ?        Ss     0:00 /bin/bash /entrypoint.sh
      8 ?        Ss   500:38 cupsd -F

Netstat showed most connections to CUPS in CLOSE_WAIT status, which I understand means that the end-user's device requested to close the connection, but CUPS hadn't processed that yet. There are a few ESTABLISHED connections in which the 'Receive Queue' has data that hasn't been processed by CUPS yet.

root@1c100f8abbf2:/# netstat -anp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp      197      0 0.0.0.0:631             0.0.0.0:*               LISTEN      8/cupsd
tcp        0      0 127.0.0.11:39341        0.0.0.0:*               LISTEN      -
tcp        1      0 172.22.0.2:631          10.x.x.x:63338      CLOSE_WAIT  8/cupsd
tcp      872      0 172.22.0.2:631          10.x.x.x:60771      CLOSE_WAIT  -
tcp        1      0 172.22.0.2:631          10.x.x.x:52567     CLOSE_WAIT  8/cupsd
tcp        1      0 172.22.0.2:631          10.x.x.x:63476      CLOSE_WAIT  8/cupsd
tcp        1      0 172.22.0.2:631          10.x.x.x:63464      CLOSE_WAIT  8/cupsd
tcp        1      0 172.22.0.2:631          10.x.x.x:63297      CLOSE_WAIT  8/cupsd
tcp        1      0 172.22.0.2:631          10.x.x.x:51634         CLOSE_WAIT  8/cupsd
tcp        1      0 172.22.0.2:631          10.x.x.x:53414       CLOSE_WAIT  8/cupsd
tcp        1      0 172.22.0.2:631          10.x.x.x:52891       CLOSE_WAIT  8/cupsd
tcp      872      0 172.22.0.2:631          10.x.x.x:60737      CLOSE_WAIT  -
tcp        0      0 172.22.0.2:631          10.x.x.x:61621     ESTABLISHED 8/cupsd
tcp      474      0 172.22.0.2:631          10.x.x.x:54340     ESTABLISHED -
...
tcp   123612      0 127.0.0.1:631           127.0.0.1:33168         ESTABLISHED 8/cupsd
...
tcp   3084392      0 172.22.0.2:631          10.x.x.x:62948      ESTABLISHED 8/cupsd
...

(Notes: I have sanitised the individual client IPs on the 10.0.0.0 network but they are different IP addresses. I have also removed connections from PaperCut on port 9191 which are not relevant to CUPS. And I have provided only a sample of the connections, there are 371 connections that refer to cupsd)

cupsctl showed this:

root@1c100f8abbf2:/# cupsctl
_debug_logging=0
_remote_admin=1
_remote_any=0
_share_printers=0
_user_cancel_any=0
BrowseLocalProtocols=none
DefaultAuthType=Basic
DefaultPaperSize="A4"
ErrorPolicy=stop-printer
IdleExitTimeout=60
JobPrivateAccess=default
JobPrivateValues=default
MaxClients=1000
MaxLogSize=1m
PageLogFormat=
ServerAlias=*
ServerName=****.****.school.nz
SubscriptionPrivateAccess=default
SubscriptionPrivateValues=default
WebInterface=Yes

The scheduler appeared to be running but not responding to HTTP, via these two commands:

root@1c100f8abbf2:/# lpstat -r
scheduler is running
root@1c100f8abbf2:/# lpstat -h localhost:631 -E -r

(no response, gave up and pressed Ctrl-C)

Due to running CUPS in the Docker container, we run it in the foreground with a cupsd -F & command, in case that's relevant.

My suspicion is that somehow, CUPS scheduler loop in main.c has gotten stuck such that no new requests get accepted, and no existing connections are read from, based on the netstat data showing lots of 'Receive Queue' data.
But I can't find any obvious place where that could occur.

I'm hesitant to try to run CUPS in production on 'Debug2' logging level, but if that's all we can do to find out more about the problem, I can try it.
Otherwise I'm thinking to try adding some other Debug messages to the main.c scheduler and start trying to narrow down what code-path is getting stuck.

Since we've got the auto-restart Healthcheck in place, the problem isn't urgent for us, but it can still take up to a minute for the Healthcheck to restart CUPS so it would be good to find a fix.

Do you have any advice on where to look next?

tbigby-kristin · 2025-05-12T06:11:21Z

tbigby-kristin
May 12, 2025
Author

Sorry, forgot to say in relation to #1128, we do see:

"Unable to encrypt connection: Error in the pull function" and also "Unable to encrypt connection: Error in the push function"

From time to time in the logs. However, these don't always appear to be when CUPS freezes, they can happen any time without a bad effect. That said it's not unusual for there to be an "Error in the pull function" within 1 minute or so of the freeze.

We did also get the "Error in the pull function" message in the old CUPS 1.4 server logs too, which never seemed to have any bad effect.

17 replies

tbigby-kristin Jul 28, 2025
Author

Hi @vliaskov , sorry for leaving it without a reply.

Setting the healthcheck trigger to 15 seconds improved it (as in, reduced the frequency of the freezes), but we are still having freezes. It previously was getting up to a few per day; now the freeze healthcheck triggers an average of once a day. It didn't happen once during the recent school holidays though when only 20-30 admin staff were on campus.

There's no change in the Core Dumps, each freeze is still in _httpWait from http_gnutls_read.

I am trying to get the time to update the healthcheck, so that it logs if a freeze occurred for 10 seconds, but only restarts if the freeze continues for 30 seconds. That might get a little more data, although I don't want users to start being affected by being unable to print if the freeze continues for too long.

Ideally I would also like to add some sort of log which times from TLS Handshake Start to TLS Handshake Complete. I suspect the vast majority are complete in well under 1 second, but I'm curious whether there are any successful handshakes that take between 1 second and 10 seconds.

That would give me data to reduce the handshake timeout down to 1 or 2 seconds, from the hard-coded 10 seconds, which hopefully would improve the freezes.

Lastly, I guess it would be ideal if the TLS Handshake could run in a separate thread. But I think that would be a huge architectural change for CUPS so not likely to be quick!

Thanks, Tony

vliaskov Aug 20, 2025

Just a note that I have added an observation #827 (comment) in another bug (closed, but likely-to-be-reopened) . The handling of tls error codes, and in particular when and how errno is set from them, may be related to the issue discussed here as well.

tbigby-kristin Sep 21, 2025
Author

Hi all,

As an update, I have recently collected data and have lowered the TLS Handshake timeout from the original 10 seconds to 2 seconds, and 1 second. I'm still collecting the data from the 1s timeout and will have to wait until school restarts in a couple of weeks - holidays at the moment.

With a 10s timeout, we had approximately 2 freezes per day.
But with either a 2s timeout, or a 1s timeout, we have not had any freezes at all.

It seems to me that the lowest number of error in the pull function or error in the push function errors (which is a symptom of the TLS Handshake timing out) occurs when I have the 2s timeout set, although I haven't collected as much data for the 1s timeout yet.

I am likely, depending on the final data results in a couple of weeks' time, to set a 2s timeout as the permanent workaround for this problem.

If anyone following is interested in trying this, I changed both of:

cups/tls-gnutls.c in _httpTLSStart around line 1765:

-  if (!old_cb || old_timeout < 10.0)
-  {
-    DEBUG_puts("4_httpTLSStart: Setting timeout to 10 seconds.");
-    httpSetTimeout(http, 10.0, NULL, NULL);
-  }
+  DEBUG_puts("4_httpTLSStart: Setting timeout to 1 second.");
+  httpSetTimeout(http, 1.0, NULL, NULL);

and cups/http.c in http_set_wait around line 4614, with some debug prints to confirm:

-  else
+  else if (http->timeout_value > 0.9 && http->timeout_value < 1.1) {
+    DEBUG_printf(("1http_set_wait(%p) TB not blocking, detected custom timeout, timeout_value=%f, wait_value=%d. Setting to %d", (void *)http, http->timeout_value, http->wait_value, (int)(http->timeout_value * 1000)));
+    http->wait_value = (int)(http->timeout_value * 1000);
+  }
+  else {
+    DEBUG_printf(("1http_set_wait(%p) TB not blocking, timeout_value=%f, wait_value=%d. Setting to 10000", (void *)http, http->timeout_value, http->wait_value));
     http->wait_value = 10000;
+  }
+  DEBUG_printf(("1http_set_wait(%p) TB final wait_value=%d", (void *)http, http->wait_value));

michaelrsweet Sep 22, 2025
Maintainer

@tbigby-kristin I have been (separately) testing a similar change in libcups v3 to reduce the default timeout on non-blocking sockets to 1 second, and I think we will be merging those changes for CUPS 2.5 and earlier once we have a little more testing on our end.

tbigby-kristin Sep 23, 2025
Author

Thanks @michaelrsweet , that sounds really promising! I can report that I haven't heard of any problems so far printing to our CUPS server with either the 2 second or 1 second non-blocking timeout value. Of course you should still test as you normally would :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUPS 2.4 appears to freeze indefinitely under Production load #1259

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 17 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

CUPS 2.4 appears to freeze indefinitely under Production load #1259

Uh oh!

tbigby-kristin May 12, 2025

Replies: 1 comment · 17 replies

Uh oh!

tbigby-kristin May 12, 2025 Author

Uh oh!

Uh oh!

tbigby-kristin Jul 28, 2025 Author

Uh oh!

vliaskov Aug 20, 2025

Uh oh!

tbigby-kristin Sep 21, 2025 Author

Uh oh!

michaelrsweet Sep 22, 2025 Maintainer

Uh oh!

tbigby-kristin Sep 23, 2025 Author

tbigby-kristin
May 12, 2025

Replies: 1 comment 17 replies

tbigby-kristin
May 12, 2025
Author

tbigby-kristin Jul 28, 2025
Author

tbigby-kristin Sep 21, 2025
Author

michaelrsweet Sep 22, 2025
Maintainer

tbigby-kristin Sep 23, 2025
Author