Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

silence rsync progress in harv-install #641

Merged
merged 2 commits into from
Feb 7, 2024
Merged

silence rsync progress in harv-install #641

merged 2 commits into from
Feb 7, 2024

Conversation

petiepooo
Copy link
Contributor

@petiepooo petiepooo commented Jan 23, 2024

Problem:
Because this is running without a TTY, it may be affecting IPMI-based installs that take hours to read the zst files off of virtual media.

Solution:
This patch removes the progress flag from rsync in the hopes that it prevents rsync from freezing during install.

Related Issue:

harvester/harvester#2651

Test plan:
Test install on slow IPMI virtual media

Because this is running without a TTY, it may be affecting IPMI-based installs that take hours to read the zst files off of virtual media.  This patch removes the progress flag from rsync in the hopes that it prevents rsync from freezing during install.
@bk201
Copy link
Member

bk201 commented Jan 24, 2024

Thanks for the contribution!
@starbops can you take a look, I thought the progress is to better determine if the progress is stuck.

@starbops
Copy link
Member

Hi @petiepooo Would you elaborate more on the problem you encountered? Do you mean mounting the ISO image via BMC will cause installation progress to become stuck if those rsync flags are on? Thank you!

@petiepooo
Copy link
Contributor Author

petiepooo commented Jan 24, 2024

Installation via BMC virtual optical disc is still hanging for me when installing via BMC console on a slow link. Like 2.4GHz WiFi-slow.

What I've done to get around it is to ssh into the install, wait for the second rsync command to be running, and run killall -STOP rsync; time md5sum /run/initramfs/live/bundle/harvester/images/*; killall -CONT rsync That pulls the source files into the page cache so that the rsync commands finish much quicker and the installation completes.

Without that, the installation just seems to hang after several minutes. I know that's been addressed once in #2651, but I believe unsuccessfully. Rsync is normally good at running long syncs in the background, but when trying an installation that is unsuccessful, I no longer see a high volume of traffic between the browser with the ISO mounted and the iDRAC port 5900. It seems to stop part way through rsync of the second image. Sending any sort of signal, such as HUP or USR1, to rsync kills it and harv-install exits with an error.

My working theory is that with --progress present, rsync is sending data to stdout, but the reading process is blocking while waiting for a line feed. Because no LF is sent until the file is 100% transferred (progress output includes ANSI codes but no LF), the stdio buffer is filling up and rsync is blocking waiting on room in its stdio buffer. Removing the --progress indicator would decrease output to stdout, thereby preventing that blocking... I believe.

Note that this only applies to really long rsync sessions, such as transferring that 1.9GB image file over a 240Kbps 802.11g WiFi link. When I ran that killall/md5sync/killall sequence, it took nearly two hours to come back with all five md5 hashes. However, with the source files already cached after that, the rsync completed quickly, as it was just copying them out to local disk then.

Again, rsync blocking on stdio is just a theory, however when running harv-install via the ISO installer, rsync progress is not displayed on the console anyway, so the --progress flag is moot. Removing it would not decrease the output sent to the IPMI console.

@petiepooo
Copy link
Contributor Author

Perhaps there's a way to only disable the --progress output if it's not being sent to a TTY. Or perhaps figure out how to not line-buffer its output so that progress is displayed on the BMC console... that would be even better, but is beyond my skill level.

At least, please try to duplicate the rsync freeze using 2.4GHz WiFi for your virtual optical disc BMC link, and if you can then see if removing the --progress fixes it.

@petiepooo petiepooo changed the title silent rsync progress in harv-install silence rsync progress in harv-install Jan 24, 2024
@tserong
Copy link
Contributor

tserong commented Feb 1, 2024

I don't have real hardware to try this on via IPMI, but I've simulated a very slow ISO install in a VM, and can confirm that the rsync --progress output is basically useless. In the following screenshot, the progress of rke2-images.linux-amd64-v1.25.9-rke2r1.tar.zst only appears at the end, once it's finished copying:

rsync-harvester-images

Same if I check /var/log/console.log:

time="2024-02-01T07:20:15Z" level=info msg="[stdout]: rke2-images.linux-amd64-v1.25.9-rke2r1.tar.zst"
time="2024-02-01T07:36:40Z" level=info msg="[stdout]: \r         32.77K   0%    0.00kB/s    0:00:00  \r
[... giant pile of progress junk...]
        958.56M 100%  951.57kB/s    0:16:23 (xfr#1, to-chk=0/1)"

i.e. all the progress output is batched up and dumped once the rsync is finished.

I'll see if I can figure out how to not buffer this stuff so we can get better console output, but if that fails I'm totally happy to take this - thanks for the PR!

@tserong
Copy link
Contributor

tserong commented Feb 2, 2024

I left my very slow ISO install running overnight, and when I came back in the morning it was still sitting there part way through an rsync (the console was the same as the screenshot in my previous comment). So I did some more detailed testing today. For the record, I was installing harvester in a VM, but I'd capped the speed of the virtual CD-ROM to 1MB/s. Here's the relevant libvirt config snippet:

<disk type="file" device="cdrom">
  <driver name="qemu" type="raw"/>
  <source file="/images/harvester-v1.2.1-amd64.iso" index="1"/>
  <backingStore/>
  <target dev="sda" bus="sata"/>
  <iotune>
    <total_bytes_sec>1000000</total_bytes_sec>
  </iotune>
  <readonly/>
  <boot order="1"/>
  <alias name="sata0-0-0"/>
  <address type="drive" controller="0" bus="0" target="0" unit="0"/>
</disk>

While harv-install was running, I ran an strace on the rsync process:

# pgrep rsync
2582
2583
2584
# strace -p 2582 2>&1|grep 'write(1'
write(1, "\r         11.57M   0%  976.40kB/"..., 46) = 46
write(1, "\r         12.62M   0%  976.40kB/"..., 46) = 46
write(1, "\r         13.66M   0%  976.40kB/"..., 46) = 46
write(1, "\r         14.71M   0%  976.40kB/"..., 46) = 46
write(1, "\r         15.76M   0%  976.63kB/"..., 46) = 46
[...]
write(1, "\r          1.93G  99%  976.87kB/"..., 46) = 46
write(1, "\r          1.93G  99%  976.40kB/"..., 46) = 46
write(1, "\r          1.93G  99%  976.63kB/"..., 46) = 46
write(1, "\r          1.93G 100%  970.50kB/"..., 65) = 65
write(1, "harvester-repo-images-v1.2.1.tar"..., 37) = 37
write(1, "\r         32.77K   0%   23.92kB/"..., 46) = 46
write(1, "\r         65.54K   0%   47.80kB/"..., 46) = 46
write(1, "\r          1.08M   1%  442.58kB/"..., 46) = 46
write(1, "\r          2.13M   3%  605.53kB/"..., 46) = 46
[...]

It prints out one of those progress lines every second. After it had been running for about half an hour, it just stopped dead. My record of the duration here is:

# tail /var/log/console.log
[...]
time="2024-02-02T07:17:35Z" level=info msg="[stdout]: Copying remaining images temporary location..."
time="2024-02-02T07:17:35Z" level=info msg="[stdout]: sending incremental file list"
time="2024-02-02T07:17:35Z" level=info msg="[stdout]: ./"
time="2024-02-02T07:17:35Z" level=info msg="[stdout]: harvester-images-v1.2.1.tar.zst"

# ls -lah /run/cos/target/var/lib/rancher/tmp/images
total 2.8G
drwxr-xr-x 2 root root 4.0K Feb  2 07:50 .
drwxr-xr-x 3 root root 4.0K Feb  2 07:01 ..
-rw------- 1 root root 939M Feb  2 08:07 .rancher-images-v2.7.5.tar.zst.HsibxB
-rw-r--r-- 1 root root 1.9G Oct 26 11:00 harvester-images-v1.2.1.tar.zst
-rw-r--r-- 1 root root  56M Oct 26 11:00 harvester-repo-images-v1.2.1.tar.zst

# stat /run/cos/target/var/lib/rancher/tmp/images/.rancher-images-v2.7.5.tar.zst.HsibxB
  File: /run/cos/target/var/lib/rancher/tmp/images/.rancher-images-v2.7.5.tar.zst.HsibxB
  Size: 984088576 	Blocks: 1922056    IO Block: 4096   regular file
Device: fd15h/64789d	Inode: 11010067    Links: 1
Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2024-02-02 07:50:59.063678060 +0000
Modify: 2024-02-02 08:07:28.702665361 +0000
Change: 2024-02-02 08:07:28.702665361 +0000
 Birth: 2024-02-02 07:50:59.063678060 +0000

The last line logged to the console showed it starting to copy harvester-images-v1.2.1.tar.zst. After that it did harvester-repo-images-v1.2.1.tar.zst and then it got part way through rancher-images-v2.7.5.tar.zst before locking up. Comparing the start time from the log (07:17:35) with the mtime on the last file (08:07:28) gives 50 minutes. Note that the length of most progress lines printed in the strace output is 46 bytes. 46 bytes * 60 lines per minute * 50 minutes == 138000, which is a bit over 128K. Is that a plausible size for a STDOUT buffer? Maybe :-) Whatever the case, I certainly reproduced the stalled rsync.

I also re-ran the install after removing --progress from the rsync invocations and it completed just fine (sure, it took a couple of hours, but it did complete), demonstrating that this change works.

To see if I could fix the progress display I experimented a bit with stdbuf, e.g.: stdbuf -oL rsync -ah --progress [...] and stdbuf -o0 rsync -ah --progress [...], but these don't work to unbuffer the output. It turns out the problem is the way we're reading stdout from harv-install.

Here's where we invoke harv-install:

if err := execute(ctx, g, env, "/usr/sbin/harv-install"); err != nil {

The execute() function in turn calls printToPanelAndLog():

func printToPanelAndLog(g *gocui.Gui, panel string, logPrefix string, reader io.Reader, lock *sync.Mutex) {
scanner := bufio.NewScanner(reader)
for scanner.Scan() {
logrus.Infof("%s: %s", logPrefix, scanner.Text())
lock.Lock()
printToPanel(g, scanner.Text(), panel)
lock.Unlock()
}
}

Note that we're using bufio.Scanner.Scan, which by default will read line-by-line, splitting on '\n'. The rsync --progress output uses '\r' to go back to the start of each line as it updates the progress display, but only emits the final '\n' once each file copy is complete. That's why we see nothing in the console while the copy is in progress, and is also why the final output logged to /var/log/console.log is a giant pile of progress junk.

I'm going to open a separate issue for enhancing the printToPanelAndLog() function to handle progress output, but in the meantime, let's take this PR given that it does fix the problem described. My one suggestion is to maybe tweak this PR to use rsync -ahv [...], because then we get a little summary information, including the transfer speed, once each rsync is complete, e.g.:

time="2024-02-02T08:45:14Z" level=info msg="[stdout]: Copying RKE2 images to the target location..."
time="2024-02-02T08:45:19Z" level=info msg="[stdout]: sending incremental file list"
time="2024-02-02T08:45:19Z" level=info msg="[stdout]: rke2-images.linux-amd64-v1.25.9-rke2r1.tar.zst"
time="2024-02-02T09:01:18Z" level=info msg="[stdout]: "
time="2024-02-02T09:01:18Z" level=info msg="[stdout]: sent 958.79M bytes  received 35 bytes  999.26K bytes/sec"
time="2024-02-02T09:01:18Z" level=info msg="[stdout]: total size is 958.56M  speedup is 1.00"
time="2024-02-02T09:01:18Z" level=info msg="[stdout]: Copying remaining images temporary location..."
time="2024-02-02T09:01:18Z" level=info msg="[stdout]: sending incremental file list"
time="2024-02-02T09:01:18Z" level=info msg="[stdout]: ./"
time="2024-02-02T09:01:18Z" level=info msg="[stdout]: harvester-images-v1.2.1.tar.zst"
time="2024-02-02T09:33:45Z" level=info msg="[stdout]: harvester-repo-images-v1.2.1.tar.zst"
time="2024-02-02T09:34:43Z" level=info msg="[stdout]: rancher-images-v2.7.5.tar.zst"
time="2024-02-02T10:01:24Z" level=info msg="[stdout]: rke2-images-multus.linux-amd64-v1.25.9-rke2r1.tar.zst"
time="2024-02-02T10:04:51Z" level=info msg="[stdout]: rke2-images.linux-amd64-v1.25.9-rke2r1.tar.zst"
time="2024-02-02T10:20:52Z" level=info msg="[stdout]: "
time="2024-02-02T10:20:52Z" level=info msg="[stdout]: sent 4.73G bytes  received 114 bytes  991.30K bytes/sec"
time="2024-02-02T10:20:52Z" level=info msg="[stdout]: total size is 4.73G  speedup is 1.00"

OTOH... Do we really care about logging how fast these transfers are? WDYT @starbops, @bk201?

@petiepooo
Copy link
Contributor Author

I'm glad to hear you were able to reproduce it, and that the PR as written resolves it.

I'm ok with updating rsync options from -a to -ahv if that's the consensus, although the diagnostics it adds are not that helpful IMO. An actual progress would be better so thank you for looking into printToPanelAndLog updates.

@starbops
Copy link
Member

starbops commented Feb 5, 2024

@tserong Thanks for the write-up. It pointed out the root cause very clearly. I'm wondering if the --outbuf=L flag of rsync would solve the issue (stalled rsync) and still provide the information that could benefit us.

       --outbuf=MODE
              This  sets  the  output  buffering mode.  The mode can be None (aka Unbuffered), Line, or Block (aka Full).  You may specify as little as a single letter for the
              mode, and use upper or lower case.

              The main use of this option is to change Full buffering to Line buffering when rsync’s output is going to a file or pipe.

@tserong
Copy link
Contributor

tserong commented Feb 7, 2024

@starbops I've now tested both --outbuf=L and --outbuf=N, but alas these don't solve the problem, i.e. I get the hang in both cases. Interestingly, --outbuf=N results in two calls to write() for each line of output:

write(1, "\r", 1)                       = 1
write(1, "        823.16M  52%  976.87kB/s"..., 45) = 45
write(1, "\r", 1)                       = 1
write(1, "        824.21M  52%  976.63kB/s"..., 45) = 45
write(1, "\r", 1)                       = 1
write(1, "        825.26M  52%  976.63kB/s"..., 45) = 45
write(1, "\r", 1)                       = 1
write(1, "        826.31M  52%  976.87kB/s"..., 45) = 45

...but there's no additional '\n' characters, so we still get stuck.

Copy link
Member

@starbops starbops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petiepooo Thank you for the contribution! There's just one nit that needs your help. Please see the following.

Comment on lines 164 to 167
rsync -ah --progress ${ISOMNT}/bundle/harvester/images/rke2-images.*.tar.zst $TARGET/$RKE2_IMAGES_DIR
rsync -a ${ISOMNT}/bundle/harvester/images/rke2-images.*.tar.zst $TARGET/$RKE2_IMAGES_DIR
echo "Copying remaining images temporary location..."
rsync -ah --progress ${ISOMNT}/bundle/harvester/images/ $TARGET/$TMP_IMAGES_DIR
rsync -ah --progress ${ISOMNT}/bundle/harvester/images-lists/ $TARGET/$IMAGES_LISTS_DIR
rsync -a ${ISOMNT}/bundle/harvester/images/ $TARGET/$TMP_IMAGES_DIR
rsync -a ${ISOMNT}/bundle/harvester/images-lists/ $TARGET/$IMAGES_LISTS_DIR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be helpful if you could add the missing -hv flags for rsync calls. Thank you!

Human-readable and verbose flags added per developer request.
@petiepooo
Copy link
Contributor Author

Done.

Thank you both for your work on this.

Copy link
Member

@starbops starbops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you! @petiepooo

@tserong tserong merged commit d6947e0 into harvester:master Feb 7, 2024
5 checks passed
@tserong
Copy link
Contributor

tserong commented Feb 7, 2024

@Mergifyio backport v1.2

Copy link

mergify bot commented Feb 7, 2024

backport v1.2

✅ Backports have been created

mergify bot pushed a commit that referenced this pull request Feb 7, 2024
* silent rsync progress in harv-install

Because this is running without a TTY, it may be affecting IPMI-based installs that take hours to read the zst files off of virtual media.  This patch removes the progress flag from rsync in the hopes that it prevents rsync from freezing during install.

* restore -h and add -v flags

Human-readable and verbose flags added per developer request.

(cherry picked from commit d6947e0)
bk201 pushed a commit that referenced this pull request Feb 20, 2024
* silent rsync progress in harv-install

Because this is running without a TTY, it may be affecting IPMI-based installs that take hours to read the zst files off of virtual media.  This patch removes the progress flag from rsync in the hopes that it prevents rsync from freezing during install.

* restore -h and add -v flags

Human-readable and verbose flags added per developer request.

(cherry picked from commit d6947e0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants