Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SecureDrop installation test #25

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

deeplow
Copy link
Contributor

@deeplow deeplow commented Sep 11, 2024

First attempt at adding a test for SecureDrop.

assert_and_click("menu-vm-xterm");


assert_script_run('gpg --keyserver hkps://keys.openpgp.org --recv-key "2359 E653 8C06 13E6 5295 5E6C 188E DD3B 7B22 E6A3"');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_script_run depends on seeing serial console output - serial console from "work" VM isn't directly connected to the one of the host; for this to work you either need to run something like tail -F /var/log/xen/console/guest-work.log >> /dev/hvc0 in dom0 (we do that here), or do all that from dom0's terminal via qvm-run

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Would type_string and then "ret" work as well? I'm trying not to deviate to much from the original instructions so it's easy to update.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would work, but your test wouldn't detect if any of those command fails (other than possible some later step dom0 in dom0 failing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point 😔. I'll just go ahead an use qvm-run, then.

@deeplow deeplow force-pushed the add-securedrop-test branch 2 times, most recently from 3502fd7 to 2008e3f Compare September 12, 2024 10:54
@marmarek
Copy link
Member

Hint: add send_key('alt-f10') to see more output at once in xterm. Not relevant much when everything goes right, but helps quite a bit when debugging.

@deeplow deeplow force-pushed the add-securedrop-test branch 6 times, most recently from 5c8b79c to 3a2149a Compare September 12, 2024 12:39
@deeplow
Copy link
Contributor Author

deeplow commented Sep 12, 2024

Hint: add send_key('alt-f10') to see more output at once in xterm. Not relevant much when everything goes right, but helps quite a bit when debugging.

Thanks for the tip. I had seen that in some places and was wondering about its purpose. I'll add it in the next round.

assert_script_run('curl https://raw.githubusercontent.com/freedomofpress/securedrop/d91dc67/securedrop/tests/files/test_journalist_key.sec.no_passphrase | sudo tee /usr/share/securedrop-workstation-dom0-config/sd-journalist.sec');
assert_script_run('sdw-admin --validate');

assert_script_run('xfce4-power-manager -q'); # disable screen blanking during long command
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marmarek there's a command which takes quite a while and in the meantime the screen blanks. I don't think it's xscreensaver because I think that's killed at the beginning of the test. Then I tried to disable XFCE's power management, but didn't help.

Have you encountered this before?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My notes have this line:

x11_start_program('env xset s off', valid => 0);

but I'm not sure if that was enough either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I had to combine it with env xset -dpms for this to fully work.

And FYI I noticed that just with env xset s off it still blanked for a lot of the slow command (sdw-admin --apply), but oddly enough the screen showed up just the logs upload command (video). No idea what went on there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It unblanked on the key press.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! I totally forgot that it was literally typing each letter. That's why, then.

Copy link
Contributor Author

@deeplow deeplow Oct 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall that the above options were still not working perfectly (the screeen was still bllanking at some point). What seems to have solved it is enabling presentation mode. I haven't look at what it's doing under the hood. But it seems to work. And because the setting is persistent, I think it shouldn't need anymore all the xscreensaver exits.

canvas

@marmarek
Copy link
Member

Anyway:

# Test died: command 'sdw-admin --apply' timed out at /usr/lib/os-autoinst/autotest.pm line 412.

So, longer timeout? This is running virtualized, so runs slower than native.

And also, I recommend collecting and uploading logs. For example wrap it with script, or use tee (see https://github.com/QubesOS/openqa-tests-qubesos/blob/main/tests/update2.pm#L110-L111 for example). You can also do post-fail hook to collect extra info on failure too: https://github.com/QubesOS/openqa-tests-qubesos/blob/main/tests/update2.pm#L168-L174

@deeplow deeplow force-pushed the add-securedrop-test branch 8 times, most recently from bf7f90e to fb294a2 Compare September 13, 2024 16:48
@deeplow
Copy link
Contributor Author

deeplow commented Sep 16, 2024

So, longer timeout? This is running virtualized, so runs slower than native.

Fair point. I have added some timeout.

Now I am running into another issue. I have created a needle through the web interface added for this step an assert_and_click. However, when it runs, it's not even listing the needle. Do I need to add the needle's PNG and respective JSON to the commit?

@marmarek
Copy link
Member

Have you restarted the test after adding the needle? Or did you added it via developer mode?

@deeplow
Copy link
Contributor Author

deeplow commented Sep 16, 2024

I thought I had restarted it afterwards. But will try again. It for sure wasn't via developer mode. Let's see if it now finds the needle.

@marmarek
Copy link
Member

I see the issue: you haven't added the securedrop-launcher tag, it only has desktop tag (which shouldn't be there I think). I guess you added it by clicking on an earlier screenshot (you can do that too, but then you need to adjust tags manually, as the default will be about that other screenshot).

@deeplow
Copy link
Contributor Author

deeplow commented Sep 16, 2024

OK. Makes sense. I was afraid to create new tags. Where can I edit the needle? Or should I create a new one?

@marmarek
Copy link
Member

marmarek commented Sep 16, 2024

For this one I just edited it manually.
But generally create new one, and don't be afraid about adding tags. In fact, do add more of them :) for SD-specific needles add ENV-securedrop tag (in addition to any others).

@deeplow deeplow force-pushed the add-securedrop-test branch 7 times, most recently from de42ab9 to 4860126 Compare November 12, 2024 19:59
@marmarek
Copy link
Member

I see you're increasing timeout for upload_logs, are you uploading some truly huge logs? Otherwise, I think hitting a timeout there is a symptom of some other issue...

@deeplow deeplow force-pushed the add-securedrop-test branch 2 times, most recently from eafce98 to 752f9d7 Compare November 13, 2024 09:44
@deeplow
Copy link
Contributor Author

deeplow commented Nov 13, 2024

I see you're increasing timeout for upload_logs, are you uploading some truly huge logs? Otherwise, I think hitting a timeout there is a symptom of some other issue...

I think so as well. I was trying both approaches: investigating the conditions under which the slow upload happens and trying to see the limits of the timeout. So far I've found that uploading packages at the beginning of the tests does work out just fine. However, when it runs at the end of everything the upload speed is very slow and it times out.

And I don't think the issue is with the new templates having so many new packages because it even fails while uploading dom0. I think at this point I'll need your help figuring this out. I'm also fine with just adding a failok argument for now, but it's as you said before: having the list of packages will come in handy.

One thing that could triggering this is that I'm running curl_via_netvm twice. I did this because I lost dom0 networking after a restart (probably a sys-net restart). Do you think this could be the issue?

@marmarek
Copy link
Member

One thing that could triggering this is that I'm running curl_via_netvm twice.

It shouldn't harm, it just defines curl as a shell function.

Is there some sys-net change? Idk, some firewall, or maybe some weird routing everything via tor? Or maybe it fails on reading the input file for some reason? Maybe some shell startup script (.bashrc etc) that could swallow/redirect all input?

My next idea is to debug it via developer mode, let me try

@deeplow
Copy link
Contributor Author

deeplow commented Nov 13, 2024

Is there some sys-net change? Idk, some firewall, or maybe some weird routing everything via tor? Or maybe it fails on reading the input file for some reason? Maybe some shell startup script (.bashrc etc) that could swallow/redirect all input?

As far as I'm aware, it doesn't change the updates to do them through tor and it doesn't modify sys-net in a way that would affect this. From skimming through the code, all the sys-net impact should be:

But I can ask the team when the sun rises on the other side of the ocean.

@deeplow
Copy link
Contributor Author

deeplow commented Nov 13, 2024

My next idea is to debug it via developer mode, let me try

Thanks! I haven't played around with developer mode, yet. After entering that mode, one needs to use VNC, correct? Or is it though the web interface in some menu I haven't found.

@marmarek
Copy link
Member

After entering that mode, one needs to use VNC, correct?

In this case, yes (which isn't exposed to the internet directly). The other use case for developer mode is to create needles interactively.

@marmarek
Copy link
Member

Ok, this is really weird, it looks like some network packets are lost... My current hypothesis is it doesn't like sys-net restart (kernel or qemu bug). I'll try to get smaller reproducer.
In the meantime, can you try to correlate this issue with worker the test runs on?

@deeplow
Copy link
Contributor Author

deeplow commented Nov 13, 2024

Interesting. Thanks for digging into it.

In the meantime, can you try to correlate this issue with worker the test runs on?

Will do!

@deeplow
Copy link
Contributor Author

deeplow commented Nov 13, 2024

In the meantime, can you try to correlate this issue with worker the test runs on?

There is one instance where it succeeded and another where it failed on salmon. Actually, I think all of the tests ran on salmon.

One thing that may also help here is that this morning I was checking if I ran the package logs upload at the beginning and it it passed that part and this was also on salmon.

So in reality we have only two tests where the uploads succeeded: one at the beginning of the test run and another (very rare, at the end).

@marmarek
Copy link
Member

I just tried to reproduce in a fresh job before SD gets installed, and it always works, also after sys-net restart, qubes restart and qubes+sys-net restart...

@deeplow
Copy link
Contributor Author

deeplow commented Nov 13, 2024

In other words, SD is probably in some way interfering with sys-net. Either that or the VM's network bandwidth is sensitive to long-running tests. My day is full of meetings but I can try locally to see if I get similar packet drops, but I doubt it since I'm using the same system for browsing and it works fine (although TCP could be making packet drops non-noticeable -- which I doubt is the case).

@marmarek
Copy link
Member

Time is unlikely the factor here, we have jobs running for 3h+ that upload a lot of logs (including multi-MB tarball of the whole /var/log) at the end.

@marmarek
Copy link
Member

So, I have no idea what is going on... it's probably some QEMU or e1000e driver issue, but I don't really know why it happens, and why only here.
I kinda suspect sys-net restart is related in some way, but not sure how exactly.
If I restart sys-net at the start of the job, I can get similar failure eventually, but very rarely (maybe 1 in 100 attempts or even less). And then reloading e1000e module in sys-net seems to help. I tried restarting sys-net few more times and results are inconclusive - at some times it made it better, but at other times it made it worse. Maybe it isn't just about restarting sys-net but some other thing at the same time?
But at the end it fails most of the time (idk, 99% of cases or such), and reloading the module doesn't help. Restarting sys-net and reloading the module doesn't help either.

All of that I tested by doing curl call in sys-net directly, to eliminate few moving parts.

I'll test some workarounds like adding --retry and --max-time to the curl call...

Details

This is how it looks on tcpdump from the host side (packets coming from qemu appear as coming from localhost):

04:36:44.072345 IP (tos 0x0, ttl 64, id 16376, offset 0, flags [DF], proto TCP (6), length 60)
    127.0.0.1.36244 > 127.0.0.1.20043: Flags [S], cksum 0xfe30 (incorrect -> 0xf201), seq 87966507, win 65495, options [mss 65495,sackOK,TS val 3799537470 ecr 0,nop,wscale 7], length 0
04:36:44.072375 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    127.0.0.1.20043 > 127.0.0.1.36244: Flags [S.], cksum 0xfe30 (incorrect -> 0x4a01), seq 2951528023, ack 87966508, win 65483, options [mss 65495,sackOK,TS val 3799537470 ecr 3799537470,nop,wscale 7], length 0
04:36:44.072399 IP (tos 0x0, ttl 64, id 16377, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.36244 > 127.0.0.1.20043: Flags [.], cksum 0xfe28 (incorrect -> 0x70bd), ack 1, win 512, options [nop,nop,TS val 3799537470 ecr 3799537470], length 0
04:36:44.074104 IP (tos 0x0, ttl 64, id 16378, offset 0, flags [DF], proto TCP (6), length 280)
    127.0.0.1.36244 > 127.0.0.1.20043: Flags [P.], cksum 0xff0c (incorrect -> 0xa859), seq 1:229, ack 1, win 512, options [nop,nop,TS val 3799537471 ecr 3799537470], length 228
04:36:44.074125 IP (tos 0x0, ttl 64, id 22166, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20043 > 127.0.0.1.36244: Flags [.], cksum 0xfe28 (incorrect -> 0x6fd9), ack 229, win 510, options [nop,nop,TS val 3799537471 ecr 3799537471], length 0
04:36:44.281074 IP (tos 0x0, ttl 64, id 16379, offset 0, flags [DF], proto TCP (6), length 1512)
    127.0.0.1.36244 > 127.0.0.1.20043: Flags [P.], cksum 0x03dd (incorrect -> 0xc4bc), seq 229:1689, ack 1, win 512, options [nop,nop,TS val 3799537678 ecr 3799537471], length 1460
04:36:44.281098 IP (tos 0x0, ttl 64, id 22167, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20043 > 127.0.0.1.36244: Flags [.], cksum 0xfe28 (incorrect -> 0x6818), ack 1689, win 621, options [nop,nop,TS val 3799537678 ecr 3799537678], length 0
04:36:44.283811 IP (tos 0x0, ttl 64, id 16380, offset 0, flags [DF], proto TCP (6), length 1512)
    127.0.0.1.36244 > 127.0.0.1.20043: Flags [P.], cksum 0x03dd (incorrect -> 0xaca3), seq 1689:3149, ack 1, win 512, options [nop,nop,TS val 3799537681 ecr 3799537678], length 1460
04:36:44.283833 IP (tos 0x0, ttl 64, id 22168, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20043 > 127.0.0.1.36244: Flags [.], cksum 0xfe28 (incorrect -> 0x625e), ack 3149, win 621, options [nop,nop,TS val 3799537681 ecr 3799537681], length 0

and no further packets for this connection (not even retransmissions), looks like qemu isn't sending them. When looking from the sys-net side, all are sent, and there are retransmissions for the missing ones.

when it works, it looks like this:

06:45:54.364797 IP (tos 0x0, ttl 64, id 65392, offset 0, flags [DF], proto TCP (6), length 60)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [S], cksum 0xfe30 (incorrect -> 0xae12), seq 2227083767, win 65495, options [mss 65495,sackOK,TS val 3807287762 ecr 0,nop,wscale 7], length 0
06:45:54.364817 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [S.], cksum 0xfe30 (incorrect -> 0xd332), seq 2499070244, ack 2227083768, win 65483, options [mss 65495,sackOK,TS val 3807287762 ecr 3807287762,nop,wscale 7], length 0
06:45:54.364833 IP (tos 0x0, ttl 64, id 65393, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [.], cksum 0xfe28 (incorrect -> 0xf9ee), ack 1, win 512, options [nop,nop,TS val 3807287762 ecr 3807287762], length 0
06:45:54.366026 IP (tos 0x0, ttl 64, id 65394, offset 0, flags [DF], proto TCP (6), length 300)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [P.], cksum 0xff20 (incorrect -> 0xd07c), seq 1:249, ack 1, win 512, options [nop,nop,TS val 3807287763 ecr 3807287762], length 248
06:45:54.366040 IP (tos 0x0, ttl 64, id 11885, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [.], cksum 0xfe28 (incorrect -> 0xf8f6), ack 249, win 510, options [nop,nop,TS val 3807287763 ecr 3807287763], length 0
06:45:54.366237 IP (tos 0x0, ttl 64, id 65395, offset 0, flags [DF], proto TCP (6), length 1512)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [P.], cksum 0x03dd (incorrect -> 0x377b), seq 249:1709, ack 1, win 512, options [nop,nop,TS val 3807287764 ecr 3807287763], length 1460
06:45:54.366247 IP (tos 0x0, ttl 64, id 11886, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [.], cksum 0xfe28 (incorrect -> 0xf2d1), ack 1709, win 621, options [nop,nop,TS val 3807287764 ecr 3807287764], length 0
06:45:54.366277 IP (tos 0x0, ttl 64, id 65396, offset 0, flags [DF], proto TCP (6), length 1512)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [P.], cksum 0x03dd (incorrect -> 0x0746), seq 1709:3169, ack 1, win 512, options [nop,nop,TS val 3807287764 ecr 3807287764], length 1460
06:45:54.366283 IP (tos 0x0, ttl 64, id 11887, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [.], cksum 0xfe28 (incorrect -> 0xed24), ack 3169, win 614, options [nop,nop,TS val 3807287764 ecr 3807287764], length 0
06:45:54.366300 IP (tos 0x0, ttl 64, id 65397, offset 0, flags [DF], proto TCP (6), length 1512)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [P.], cksum 0x03dd (incorrect -> 0xe15d), seq 3169:4629, ack 1, win 512, options [nop,nop,TS val 3807287764 ecr 3807287764], length 1460
06:45:54.366311 IP (tos 0x0, ttl 64, id 11888, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [.], cksum 0xfe28 (incorrect -> 0xe777), ack 4629, win 607, options [nop,nop,TS val 3807287764 ecr 3807287764], length 0
06:45:54.366325 IP (tos 0x0, ttl 64, id 65398, offset 0, flags [DF], proto TCP (6), length 1512)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [P.], cksum 0x03dd (incorrect -> 0x61fa), seq 4629:6089, ack 1, win 512, options [nop,nop,TS val 3807287764 ecr 3807287764], length 1460
06:45:54.366330 IP (tos 0x0, ttl 64, id 11889, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [.], cksum 0xfe28 (incorrect -> 0xe1ca), ack 6089, win 600, options [nop,nop,TS val 3807287764 ecr 3807287764], length 0
06:45:54.366344 IP (tos 0x0, ttl 64, id 65399, offset 0, flags [DF], proto TCP (6), length 1512)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [P.], cksum 0x03dd (incorrect -> 0x3073), seq 6089:7549, ack 1, win 512, options [nop,nop,TS val 3807287764 ecr 3807287764], length 1460
06:45:54.366349 IP (tos 0x0, ttl 64, id 11890, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [.], cksum 0xfe28 (incorrect -> 0xdc1d), ack 7549, win 593, options [nop,nop,TS val 3807287764 ecr 3807287764], length 0
06:45:54.366491 IP (tos 0x0, ttl 64, id 65400, offset 0, flags [DF], proto TCP (6), length 1512)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [P.], cksum 0x03dd (incorrect -> 0xf323), seq 7549:9009, ack 1, win 512, options [nop,nop,TS val 3807287764 ecr 3807287764], length 1460
06:45:54.366498 IP (tos 0x0, ttl 64, id 11891, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [.], cksum 0xfe28 (incorrect -> 0xd670), ack 9009, win 586, options [nop,nop,TS val 3807287764 ecr 3807287764], length 0
06:45:54.366518 IP (tos 0x0, ttl 64, id 65401, offset 0, flags [DF], proto TCP (6), length 1512)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [P.], cksum 0x03dd (incorrect -> 0xf3b9), seq 9009:10469, ack 1, win 512, options [nop,nop,TS val 3807287764 ecr 3807287764], length 1460
06:45:54.366523 IP (tos 0x0, ttl 64, id 11892, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [.], cksum 0xfe28 (incorrect -> 0xd0c3), ack 10469, win 579, options [nop,nop,TS val 3807287764 ecr 3807287764], length 0
06:45:54.366537 IP (tos 0x0, ttl 64, id 65402, offset 0, flags [DF], proto TCP (6), length 1512)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [P.], cksum 0x03dd (incorrect -> 0x0adf), seq 10469:11929, ack 1, win 512, options [nop,nop,TS val 3807287764 ecr 3807287764], length 1460
06:45:54.366542 IP (tos 0x0, ttl 64, id 11893, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [.], cksum 0xfe28 (incorrect -> 0xcb16), ack 11929, win 572, options [nop,nop,TS val 3807287764 ecr 3807287764], length 0
06:45:54.366556 IP (tos 0x0, ttl 64, id 65403, offset 0, flags [DF], proto TCP (6), length 1512)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [P.], cksum 0x03dd (incorrect -> 0x3c53), seq 11929:13389, ack 1, win 512, options [nop,nop,TS val 3807287764 ecr 3807287764], length 1460
06:45:54.366561 IP (tos 0x0, ttl 64, id 11894, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [.], cksum 0xfe28 (incorrect -> 0xc569), ack 13389, win 565, options [nop,nop,TS val 3807287764 ecr 3807287764], length 0
06:45:54.366574 IP (tos 0x0, ttl 64, id 65404, offset 0, flags [DF], proto TCP (6), length 1512)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [P.], cksum 0x03dd (incorrect -> 0xdefd), seq 13389:14849, ack 1, win 512, options [nop,nop,TS val 3807287764 ecr 3807287764], length 1460
06:45:54.366579 IP (tos 0x0, ttl 64, id 11895, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [.], cksum 0xfe28 (incorrect -> 0xbfbc), ack 14849, win 558, options [nop,nop,TS val 3807287764 ecr 3807287764], length 0
06:45:54.366686 IP (tos 0x0, ttl 64, id 65405, offset 0, flags [DF], proto TCP (6), length 1512)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [P.], cksum 0x03dd (incorrect -> 0xbc08), seq 14849:16309, ack 1, win 512, options [nop,nop,TS val 3807287764 ecr 3807287764], length 1460
06:45:54.366706 IP (tos 0x0, ttl 64, id 65406, offset 0, flags [DF], proto TCP (6), length 1512)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [P.], cksum 0x03dd (incorrect -> 0x65c1), seq 16309:17769, ack 1, win 512, options [nop,nop,TS val 3807287764 ecr 3807287764], length 1460
06:45:54.366713 IP (tos 0x0, ttl 64, id 11896, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [.], cksum 0xfe28 (incorrect -> 0xb9c9), ack 16309, win 621, options [nop,nop,TS val 3807287764 ecr 3807287764], length 0
06:45:54.366716 IP (tos 0x0, ttl 64, id 11897, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [.], cksum 0xfe28 (incorrect -> 0xb420), ack 17769, win 610, options [nop,nop,TS val 3807287764 ecr 3807287764], length 0
06:45:54.366733 IP (tos 0x0, ttl 64, id 65407, offset 0, flags [DF], proto TCP (6), length 1512)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [P.], cksum 0x03dd (incorrect -> 0xd87e), seq 17769:19229, ack 1, win 512, options [nop,nop,TS val 3807287764 ecr 3807287764], length 1460
06:45:54.366738 IP (tos 0x0, ttl 64, id 11898, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [.], cksum 0xfe28 (incorrect -> 0xae73), ack 19229, win 603, options [nop,nop,TS val 3807287764 ecr 3807287764], length 0
06:45:54.366753 IP (tos 0x0, ttl 64, id 65408, offset 0, flags [DF], proto TCP (6), length 1280)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [P.], cksum 0x02f5 (incorrect -> 0xbfed), seq 19229:20457, ack 1, win 512, options [nop,nop,TS val 3807287764 ecr 3807287764], length 1228
06:45:54.366758 IP (tos 0x0, ttl 64, id 11899, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [.], cksum 0xfe28 (incorrect -> 0xa9ad), ack 20457, win 597, options [nop,nop,TS val 3807287764 ecr 3807287764], length 0
06:45:54.369007 IP (tos 0x0, ttl 64, id 11900, offset 0, flags [DF], proto TCP (6), length 241)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [P.], cksum 0xfee5 (incorrect -> 0x32ca), seq 1:190, ack 20457, win 621, options [nop,nop,TS val 3807287766 ecr 3807287764], length 189
06:45:54.369021 IP (tos 0x0, ttl 64, id 65409, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [.], cksum 0xfe28 (incorrect -> 0xa942), ack 190, win 511, options [nop,nop,TS val 3807287766 ecr 3807287766], length 0
06:45:54.372973 IP (tos 0x0, ttl 64, id 65410, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [F.], cksum 0xfe28 (incorrect -> 0xa93c), seq 20457, ack 190, win 512, options [nop,nop,TS val 3807287770 ecr 3807287766], length 0
06:45:54.373165 IP (tos 0x0, ttl 64, id 11901, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.20023 > 127.0.0.1.47168: Flags [F.], cksum 0xfe28 (incorrect -> 0xa8ca), seq 190, ack 20458, win 621, options [nop,nop,TS val 3807287770 ecr 3807287770], length 0
06:45:54.373182 IP (tos 0x0, ttl 64, id 65411, offset 0, flags [DF], proto TCP (6), length 52)
    127.0.0.1.47168 > 127.0.0.1.20023: Flags [.], cksum 0xfe28 (incorrect -> 0xa937), ack 191, win 512, options [nop,nop,TS val 3807287770 ecr 3807287770], length 0

@marmarek
Copy link
Member

marmarek commented Nov 14, 2024

as for a temporary band-aid, try to add something like this to your branch:

diff --git a/lib/networking.pm b/lib/networking.pm
index c0053f2..c1d4215 100644
--- a/lib/networking.pm
+++ b/lib/networking.pm
@@ -102,7 +102,7 @@ curl() {
     fi
 
     if [ -n "\$inputfile" ]; then
-        qvm-run --no-gui -p sys-net "curl \$allargs" <\$inputfile
+        qvm-run --no-gui -p sys-net "curl --max-time 10 \$allargs" <\$inputfile
     else
         qvm-run --no-gui -p sys-net "curl \$allargs"
     fi

and failok=>1; I've checked that successful uploads complete within 2s, so 10s timeout feels safe.

It won't fix uploading, but at least won't block the whole thing on this issue. And when we'll manage to fix uploading, it should automatically start working again without test change.

@deeplow
Copy link
Contributor Author

deeplow commented Nov 14, 2024

This is strange indeed. Thanks for investigating in any case. I have some implementation questions:

as for a temporary band-aid, try to add something like this to your branch:

Will do. Should I also add it to the else condition for consistency's sake?

and failok=>1

Do you mean adding the failok parameter on save_and_upload_log() and call it only in tests/securedrop/upload_packages.pm? Or do you mean directly in upload_logs() inside save_and_upload_logs(), thus setting failok to all tests that upload packages?

@marmarek
Copy link
Member

Will do. Should I also add it to the else condition for consistency's sake?

Not sure. The issue we observe is only about sending data (POST requests with some payload), not GET request. I have checked that 10s is more than enough for upload calls, I haven't inspected logs for download calls (I don't remember fetching big files with this function, but probably safer to skip timeout on download for now (see that fetching curl-wrapper works).

Do you mean adding the failok parameter on save_and_upload_log() and call it only in tests/securedrop/upload_packages.pm? Or do you mean directly in upload_logs() inside save_and_upload_logs(), thus setting failok to all tests that upload packages?

I'd prefer only in securedrop test, so add the function parameter.

In a specific set of circumstances [1] dropped packets were causing
the log uploads to take a long time and eventually fail. Even
though the client would send retries, they would not reach the host.

Adding a timeout limit ensures early failures.

[1]: QubesOS#25 (comment)

Co-Authored-By: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants