Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bootcode.bin randomly doesn't PXE boot correctly. #764

Closed
ali1234 opened this issue Mar 15, 2017 · 40 comments
Closed

bootcode.bin randomly doesn't PXE boot correctly. #764

ali1234 opened this issue Mar 15, 2017 · 40 comments

Comments

@ali1234
Copy link

ali1234 commented Mar 15, 2017

I am using the latest version of bootcode.bin: https://github.com/raspberrypi/firmware/blob/f85646a8831d9579c2a745478149598da1ecfde5/boot/bootcode.bin

It is the only file on my SD card. I am using a Raspberry Pi 3.

Sometimes (but rarely) PXE boot works and sometimes it does not. I have to power cycle the Pi several times to make it boot.

Looking at the failed tcpdump log you can see that dnsmasq is replying to the boot request, but the Pi ignores it and sends another, for a total of 5 requests. Then it tries to request tftp files from 0.0.0.0.

dnsmasq log of failed session:

dnsmasq-dhcp: 653460281 available DHCP subnet: xxx.xxx.xxx.255/255.255.255.0
dnsmasq-dhcp: 653460281 vendor class: PXEClient:Arch:00000:UNDI:002001
dnsmasq-dhcp: 653460281 PXE(enp0s31f6) b8:27:eb:xx:xx:xx proxy
dnsmasq-dhcp: 653460281 tags: enp0s31f6
dnsmasq-dhcp: 653460281 broadcast response
dnsmasq-dhcp: 653460281 sent size:  1 option: 53 message-type  2
dnsmasq-dhcp: 653460281 sent size:  4 option: 54 server-identifier  xxx.xxx.xxx.5
dnsmasq-dhcp: 653460281 sent size:  9 option: 60 vendor-class  50:58:xx:xx:xx:xx:xx:xx:xx
dnsmasq-dhcp: 653460281 sent size: 17 option: 97 client-machine-id  00:f5:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx...
dnsmasq-dhcp: 653460281 sent size: 32 option: 43 vendor-encap  06:01:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx...
dnsmasq-dhcp: 653460281 available DHCP subnet: xxx.xxx.xxx.255/255.255.255.0
dnsmasq-dhcp: 653460281 vendor class: PXEClient:Arch:00000:UNDI:002001
dnsmasq-dhcp: 653460281 PXE(enp0s31f6) b8:27:eb:xx:xx:xx proxy
dnsmasq-dhcp: 653460281 tags: enp0s31f6
dnsmasq-dhcp: 653460281 broadcast response
dnsmasq-dhcp: 653460281 sent size:  1 option: 53 message-type  2
dnsmasq-dhcp: 653460281 sent size:  4 option: 54 server-identifier  xxx.xxx.xxx.5
dnsmasq-dhcp: 653460281 sent size:  9 option: 60 vendor-class  50:58:xx:xx:xx:xx:xx:xx:xx
dnsmasq-dhcp: 653460281 sent size: 17 option: 97 client-machine-id  00:f5:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx...
dnsmasq-dhcp: 653460281 sent size: 32 option: 43 vendor-encap  06:01:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx...
dnsmasq-dhcp: 653460281 available DHCP subnet: xxx.xxx.xxx.255/255.255.255.0
dnsmasq-dhcp: 653460281 vendor class: PXEClient:Arch:00000:UNDI:002001
dnsmasq-dhcp: 653460281 PXE(enp0s31f6) b8:27:eb:xx:xx:xx proxy
dnsmasq-dhcp: 653460281 tags: enp0s31f6
dnsmasq-dhcp: 653460281 broadcast response
dnsmasq-dhcp: 653460281 sent size:  1 option: 53 message-type  2
dnsmasq-dhcp: 653460281 sent size:  4 option: 54 server-identifier  xxx.xxx.xxx.5
dnsmasq-dhcp: 653460281 sent size:  9 option: 60 vendor-class  50:58:xx:xx:xx:xx:xx:xx:xx
dnsmasq-dhcp: 653460281 sent size: 17 option: 97 client-machine-id  00:f5:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx...
dnsmasq-dhcp: 653460281 sent size: 32 option: 43 vendor-encap  06:01:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx...
dnsmasq-dhcp: 653460281 available DHCP subnet: xxx.xxx.xxx.255/255.255.255.0
dnsmasq-dhcp: 653460281 vendor class: PXEClient:Arch:00000:UNDI:002001
dnsmasq-dhcp: 653460281 PXE(enp0s31f6) b8:27:eb:xx:xx:xx proxy
dnsmasq-dhcp: 653460281 tags: enp0s31f6
dnsmasq-dhcp: 653460281 broadcast response
dnsmasq-dhcp: 653460281 sent size:  1 option: 53 message-type  2
dnsmasq-dhcp: 653460281 sent size:  4 option: 54 server-identifier  xxx.xxx.xxx.5
dnsmasq-dhcp: 653460281 sent size:  9 option: 60 vendor-class  50:58:xx:xx:xx:xx:xx:xx:xx
dnsmasq-dhcp: 653460281 sent size: 17 option: 97 client-machine-id  00:f5:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx...
dnsmasq-dhcp: 653460281 sent size: 32 option: 43 vendor-encap  06:01:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx...
dnsmasq-dhcp: 653460281 available DHCP subnet: xxx.xxx.xxx.255/255.255.255.0
dnsmasq-dhcp: 653460281 vendor class: PXEClient:Arch:00000:UNDI:002001
dnsmasq-dhcp: 653460281 PXE(enp0s31f6) b8:27:eb:xx:xx:xx proxy
dnsmasq-dhcp: 653460281 tags: enp0s31f6
dnsmasq-dhcp: 653460281 broadcast response
dnsmasq-dhcp: 653460281 sent size:  1 option: 53 message-type  2
dnsmasq-dhcp: 653460281 sent size:  4 option: 54 server-identifier  xxx.xxx.xxx.5
dnsmasq-dhcp: 653460281 sent size:  9 option: 60 vendor-class  50:58:xx:xx:xx:xx:xx:xx:xx
dnsmasq-dhcp: 653460281 sent size: 17 option: 97 client-machine-id  00:f5:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx...
dnsmasq-dhcp: 653460281 sent size: 32 option: 43 vendor-encap  06:01:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx...

tcpdump port tftp or port bootpc from failed session:

07:46:27.892426 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b8:27:eb:xx:xx:xx (oui Unknown), length 322
E..^......9..........D.C.J......&..9.....................'..............................................................................................................................................................................................................c.Sc5..7.+<C........B..]...^....a..................< PXEClient:Arch:00000:UNDI:002001.
07:46:27.892915 IP pxeserver.lan.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 314
E..V....@............C.D.B......&..9.....................'..............................................................................................................................................................................................................c.Sc5..6.....<	PXEClienta..................+ ...
..PXE	....Raspberry Pi Boot..
07:46:32.893294 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b8:27:eb:xx:xx:xx (oui Unknown), length 322
E..^......9..........D.C.J......&..9.....................'..............................................................................................................................................................................................................c.Sc5..7.+<C........B..]...^....a..................< PXEClient:Arch:00000:UNDI:002001.
07:46:32.893647 IP pxeserver.lan.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 314
E..V....@..v.........C.D.B......&..9.....................'..............................................................................................................................................................................................................c.Sc5..6.....<	PXEClienta..................+ ...
..PXE	....Raspberry Pi Boot..
07:46:38.640376 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b8:27:eb:xx:xx:xx (oui Unknown), length 322
E..^......9..........D.C.J......&..9.....................'..............................................................................................................................................................................................................c.Sc5..7.+<C........B..]...^....a..................< PXEClient:Arch:00000:UNDI:002001.
07:46:38.640795 IP pxeserver.lan.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 314
E..V.
..@.. .........C.D.B......&..9.....................'..............................................................................................................................................................................................................c.Sc5..6.....<	PXEClienta..................+ ...
..PXE	....Raspberry Pi Boot..
07:46:44.639980 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b8:27:eb:xx:xx:xx (oui Unknown), length 322
E..^......9..........D.C.J......&..9.....................'..............................................................................................................................................................................................................c.Sc5..7.+<C........B..]...^....a..................< PXEClient:Arch:00000:UNDI:002001.
07:46:44.640339 IP pxeserver.lan.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 314
E..V.a..@............C.D.B......&..9.....................'..............................................................................................................................................................................................................c.Sc5..6.....<	PXEClienta..................+ ...
..PXE	....Raspberry Pi Boot..
07:46:50.639971 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b8:27:eb:xx:xx:xx (oui Unknown), length 322
E..^......9..........D.C.J......&..9.....................'..............................................................................................................................................................................................................c.Sc5..7.+<C........B..]...^....a..................< PXEClient:Arch:00000:UNDI:002001.
07:46:50.640411 IP pxeserver.lan.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 314
E..V.b..@............C.D.B......&..9.....................'..............................................................................................................................................................................................................c.Sc5..6.....<	PXEClienta..................+ ...
..PXE	....Raspberry Pi Boot..
07:46:56.657068 IP 0.0.0.0.49153 > 0.0.0.0.tftp:  29 RRQ "autoboot.txt" octet tsize 0
E..9......:............E.%....autoboot.txt.octet.tsize.0.
07:46:57.657080 IP 0.0.0.0.49154 > 0.0.0.0.tftp:  27 RRQ "config.txt" octet tsize 0
E..7......:............E.#....config.txt.octet.tsize.0.
07:46:58.657351 IP 0.0.0.0.49155 > 0.0.0.0.tftp:  29 RRQ "recovery.elf" octet tsize 0
E..9......:............E.%....recovery.elf.octet.tsize.0.
07:46:59.657412 IP 0.0.0.0.49156 > 0.0.0.0.tftp:  26 RRQ "start.elf" octet tsize 0
E..6......:............E."....start.elf.octet.tsize.0.
07:47:00.773516 IP 0.0.0.0.49157 > 0.0.0.0.tftp:  26 RRQ "fixup.dat" octet tsize 0
E..6......:............E."....fixup.dat.octet.tsize.0.
@ghollingworth
Copy link
Contributor

You're using a proxy server for the DHCP proxy and TFTP boot, do you also have a standard DHCP server as well replying with an IP address?

From the output it looks like you've got everything set up correctly (option 43 looks fine), it will only continue if it finds both an option 43 which tells it that the dhcp server is also going to serve the files and it has an IP address. From the information above no IP address has been offered.

Gordon

@ali1234
Copy link
Author

ali1234 commented Mar 15, 2017

My ADSL router is a DHCP server, yes.

@ghollingworth
Copy link
Contributor

So can you dump that DHCP response as well?

@ali1234
Copy link
Author

ali1234 commented Mar 15, 2017

I ran "sudo tcpdump -A -i eth0 port tftp or port bootpc or port bootps or port 546 or port 547"

I did not capture any DHCP requests from the pxeserver to the main router. Output was the same as before.

@ghollingworth
Copy link
Contributor

To debug this what I would do is to use the managed switch I have on my desk to mirror all traffic to a port with a Raspberry Pi on it... I'm wondering whether it's a STP problem with the switch in the router?

One thing we're going to do soon is to add the ability to provide serial debug from the bootcode, which should help with this...

Gordon

@ali1234
Copy link
Author

ali1234 commented Mar 15, 2017

I have a dump from the router itself now:

non working:

08:42:03.321057 IP pxeserver.lan.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 314
08:42:08.321401 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b8:27:eb:xx:xx:xx (oui Unknown), length 322
08:42:08.322173 IP pxeserver.lan.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 314
08:42:11.069915 IP router.lan.bootps > xxx.xxx.xxx.113.bootpc: BOOTP/DHCP, Reply, length 343
08:42:14.070079 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b8:27:eb:xx:xx:xx (oui Unknown), length 322
08:42:14.070707 IP pxeserver.lan.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 314
08:42:14.071108 IP router.lan.bootps > xxx.xxx.xxx.113.bootpc: BOOTP/DHCP, Reply, length 343
08:42:20.069788 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b8:27:eb:xx:xx:xx (oui Unknown), length 322
08:42:20.070493 IP pxeserver.lan.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 314
08:42:20.070818 IP router.lan.bootps > xxx.xxx.xxx.113.bootpc: BOOTP/DHCP, Reply, length 343
08:42:25.069862 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b8:27:eb:xx:xx:xx (oui Unknown), length 322
08:42:25.070624 IP pxeserver.lan.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 314
08:42:25.070893 IP router.lan.bootps > xxx.xxx.xxx.113.bootpc: BOOTP/DHCP, Reply, length 343
08:42:31.086955 IP 0.0.0.0.49153 > 0.0.0.0.tftp:  29 RRQ "autoboot.txt" octet tsize 0
08:42:32.086983 IP 0.0.0.0.49154 > 0.0.0.0.tftp:  27 RRQ "config.txt" octet tsize 0
08:42:33.087249 IP 0.0.0.0.49155 > 0.0.0.0.tftp:  29 RRQ "recovery.elf" octet tsize 0
08:42:34.087316 IP 0.0.0.0.49156 > 0.0.0.0.tftp:  26 RRQ "start.elf" octet tsize 0
08:42:35.087460 IP 0.0.0.0.49157 > 0.0.0.0.tftp:  26 RRQ "fixup.dat" octet tsize 0

@ali1234
Copy link
Author

ali1234 commented Mar 15, 2017

Someone just turned on a Windows laptop on the network and now it started working correctly:

dnsmasq:

dnsmasq-dhcp: 653460281 available DHCP subnet: xxx.xxx.xxx.255/255.255.255.0
dnsmasq-dhcp: 653460281 vendor class: PXEClient:Arch:00000:UNDI:002001
dnsmasq-dhcp: 653460281 PXE(enp0s31f6) b8:27:eb:xx:xx:xx proxy
dnsmasq-dhcp: 653460281 tags: enp0s31f6
dnsmasq-dhcp: 653460281 broadcast response
dnsmasq-dhcp: 653460281 sent size:  1 option: 53 message-type  2
dnsmasq-dhcp: 653460281 sent size:  4 option: 54 server-identifier  xxx.xxx.xxx.5
dnsmasq-dhcp: 653460281 sent size:  9 option: 60 vendor-class  50:58:xx:xx:xx:xx:xx:xx:xx
dnsmasq-dhcp: 653460281 sent size: 17 option: 97 client-machine-id  00:f5:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx...
dnsmasq-dhcp: 653460281 sent size: 32 option: 43 vendor-encap  06:01:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx...
dnsmasq-dhcp: 653460281 available DHCP subnet: xxx.xxx.xxx.255/255.255.255.0
dnsmasq-dhcp: 653460281 vendor class: PXEClient:Arch:00000:UNDI:002001
dnsmasq-dhcp: 653460281 PXE(enp0s31f6) b8:27:eb:xx:xx:xx proxy
dnsmasq-dhcp: 653460281 tags: enp0s31f6
dnsmasq-dhcp: 653460281 broadcast response
dnsmasq-dhcp: 653460281 sent size:  1 option: 53 message-type  2
dnsmasq-dhcp: 653460281 sent size:  4 option: 54 server-identifier  xxx.xxx.xxx.5
dnsmasq-dhcp: 653460281 sent size:  9 option: 60 vendor-class  50:58:xx:xx:xx:xx:xx:xx:xx
dnsmasq-dhcp: 653460281 sent size: 17 option: 97 client-machine-id  00:f5:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx...
dnsmasq-dhcp: 653460281 sent size: 32 option: 43 vendor-encap  06:01:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx:xx...
dnsmasq-tftp: file /home/al/pi-tftp/tftpboot/autoboot.txt not found
dnsmasq-tftp: sent /home/al/pi-tftp/tftpboot/config.txt to xxx.xxx.xxx.114
dnsmasq-tftp: file /home/al/pi-tftp/tftpboot/recovery.elf not found
dnsmasq-tftp: sent /home/al/pi-tftp/tftpboot/start.elf to xxx.xxx.xxx.114
dnsmasq-tftp: sent /home/al/pi-tftp/tftpboot/fixup.dat to xxx.xxx.xxx.114
dnsmasq-tftp: file /home/al/pi-tftp/tftpboot/recovery.elf not found
dnsmasq-tftp: sent /home/al/pi-tftp/tftpboot/config.txt to xxx.xxx.xxx.114
dnsmasq-tftp: file /home/al/pi-tftp/tftpboot/dt-blob.bin not found
dnsmasq-tftp: file /home/al/pi-tftp/tftpboot/recovery.elf not found
dnsmasq-tftp: sent /home/al/pi-tftp/tftpboot/config.txt to xxx.xxx.xxx.114
dnsmasq-tftp: file /home/al/pi-tftp/tftpboot/bootcfg.txt not found
dnsmasq-tftp: sent /home/al/pi-tftp/tftpboot/cmdline.txt to xxx.xxx.xxx.114
dnsmasq-tftp: file /home/al/pi-tftp/tftpboot/recovery8.img not found
dnsmasq-tftp: file /home/al/pi-tftp/tftpboot/recovery8-32.img not found
dnsmasq-tftp: file /home/al/pi-tftp/tftpboot/recovery7.img not found
dnsmasq-tftp: file /home/al/pi-tftp/tftpboot/recovery.img not found
dnsmasq-tftp: file /home/al/pi-tftp/tftpboot/kernel8.img not found
dnsmasq-tftp: file /home/al/pi-tftp/tftpboot/kernel8-32.img not found
dnsmasq-tftp: file /home/al/pi-tftp/tftpboot/armstub8.bin not found
dnsmasq-tftp: error 0 Early terminate received from xxx.xxx.xxx.114
dnsmasq-tftp: failed sending /home/al/pi-tftp/tftpboot/kernel7.img to xxx.xxx.xxx.114
dnsmasq-tftp: file /home/al/pi-tftp/tftpboot/armstub8-32.bin not found
dnsmasq-tftp: file /home/al/pi-tftp/tftpboot/armstub7.bin not found
dnsmasq-tftp: file /home/al/pi-tftp/tftpboot/armstub.bin not found
dnsmasq-tftp: sent /home/al/pi-tftp/tftpboot/kernel7.img to xxx.xxx.xxx.114

router, tcpdump -i br0 port tftp or port bootpc or port bootps or port 546 or port 547:

09:00:40.285719 IP pxeserver.lan.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 314
09:00:45.725710 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b8:27:eb:xx:xx:xx (oui Unknown), length 322
09:00:45.726339 IP pxeserver.lan.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 314
09:00:45.727086 IP router.lan.bootps > rootfs.lan.bootpc: BOOTP/DHCP, Reply, length 343

pxeserver, tcpdump -A -i enp0s31f6 port tftp or port bootpc or port bootps or port 546 or port 547:

09:00:38.856729 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b8:27:eb:xx:xx:xx (oui Unknown), length 322
E..^......9..........D.C.J......&..9.....................'..............................................................................................................................................................................................................c.Sc5..7.+<C........B..]...^....a..................< PXEClient:Arch:00000:UNDI:002001.
09:00:38.857410 IP pxeserver.lan.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 314
E..VB...@.uG.........C.D.B......&..9.....................'..............................................................................................................................................................................................................c.Sc5..6.....<	PXEClienta..................+ ...
..PXE	....Raspberry Pi Boot..
09:00:44.297417 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from b8:27:eb:xx:xx:xx (oui Unknown), length 322
E..^......9..........D.C.J......&..9.....................'..............................................................................................................................................................................................................c.Sc5..7.+<C........B..]...^....a..................< PXEClient:Arch:00000:UNDI:002001.
09:00:44.297964 IP pxeserver.lan.bootps > 255.255.255.255.bootpc: BOOTP/DHCP, Reply, length 314
E..VD5..@.s..........C.D.B......&..9.....................'..............................................................................................................................................................................................................c.Sc5..6.....<	PXEClienta..................+ ...
..PXE	....Raspberry Pi Boot..
09:00:44.316425 IP rootfs.lan.49153 > pxeserver.lan.tftp:  29 RRQ "autoboot.txt" octet tsize 0
E..9...........r.......E.%....autoboot.txt.octet.tsize.0.

@ali1234
Copy link
Author

ali1234 commented Mar 15, 2017

Also I can run "brctl showstp br0" on my router. It shows a lot of information but I don't really know what I am looking for.

@ghollingworth
Copy link
Contributor

It looks like a problem we had before where we needed a broadcast packet to trigger the receiving of one of the packets (which is why turning on the Windows machine suddenly starts it...)

Be interesting to see if the receipt of the second DHCP reply was triggered by a broadcast packet...

@ali1234
Copy link
Author

ali1234 commented Mar 15, 2017

That is probably it. The Windows machine is spamming a lot of autoconfig junk continuously on both IPv4 and IPv6. The Pi and the pxe server are both connected directly to the router switch, but the windows machine is behind a second (unmanaged) switch. None of the machines are on wifi, but the router is bridging ethernet and two wifi radios.

@ghollingworth
Copy link
Contributor

I might have to see if it's possible the USB->ETH bridge is holding onto a packet for some reason, or that I've dropped a packet... But can't understand why this would be...

In the past we've found doing an occasional broadcast ping will fix the problem...

@ali1234
Copy link
Author

ali1234 commented Mar 19, 2017

I found it not working again this morning. Broadcast ping brought it to life:

ping -b 192.168.0.255

@puck
Copy link

puck commented Apr 28, 2017

I'm seeing this roughly one time in 3 when I network boot my RPi 3, tcpdump from the DHCP server:

00:09:54.601333 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from b8:27:eb:52:63:05, length 320
00:09:54.602403 IP 10.1.0.251.67 > 10.1.0.203.68: BOOTP/DHCP, Reply, length 366
00:09:55.861047 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from b8:27:eb:52:63:05, length 320
00:09:55.862176 IP 10.1.0.251.67 > 10.1.0.203.68: BOOTP/DHCP, Reply, length 366
00:09:56.894372 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from b8:27:eb:52:63:05, length 320
00:09:56.895385 IP 10.1.0.251.67 > 10.1.0.203.68: BOOTP/DHCP, Reply, length 366
00:09:57.915143 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from b8:27:eb:52:63:05, length 320
00:09:57.916157 IP 10.1.0.251.67 > 10.1.0.203.68: BOOTP/DHCP, Reply, length 366
00:09:59.606050 IP 0.0.0.0.68 > 255.255.255.255.67: BOOTP/DHCP, Request from b8:27:eb:52:63:05, length 320
00:09:59.607080 IP 10.1.0.251.67 > 10.1.0.203.68: BOOTP/DHCP, Reply, length 366

I find it can take several power off reboots of the RPi before it works correctly. When it does, it only makes DHCP request once.

I don't have any network switches on my home network which have mirror ports.

@ghollingworth
Copy link
Contributor

@puck Is your test with the bootcode.bin as a single file on the SD card?

@puck
Copy link

puck commented May 1, 2017

Hey @ghollingworth, no I'm networking booting and loading it from a TFTP server. I have no SD card in my RPi 3.

@puck
Copy link

puck commented May 1, 2017

I've realised I can make a poor man's wire tap using a laptop and two USB ethernet dongles. If you'd like a traffic capture, I should be able to do that tonight.

@ghollingworth
Copy link
Contributor

ghollingworth commented May 1, 2017 via email

@puck
Copy link

puck commented May 2, 2017

@ghollingworth putting bootcode.bin (only) on an SD card makes the boot process reliable - I rebooted about 10 times with no failures. However, it doesn't look for any files inside the serial number directory now, first boot hung because I had them all in the sub directory.. I had to symlink all the other files into my TFTP root for the boot process to succeed.

@pelwell
Copy link
Contributor

pelwell commented May 2, 2017

@puck That is what I would expect to see from a bootcode.bin not built in the last week. Try a more recent one: https://github.com/Hexxeh/rpi-firmware/blob/170150d2210a3bb1801ae165d54794101f28fc54/bootcode.bin

@puck
Copy link

puck commented May 2, 2017

Heh, yes, I just find bug #754 and tested again with the newest bootcode.bin - it now works with the serial directory. I was hoping to update this issue before someone responded. ;)

@pelwell
Copy link
Contributor

pelwell commented May 2, 2017

Sorry - you caught me on a good day. ;)

@tvk7
Copy link

tvk7 commented May 18, 2017

is there a workaround apart from putting bootcode.bin on a sd-card? For me it also looks like the Pi is trying several times to get an DHCP offer, but it always discards the reply and never reach out for the tftp server. Could this be an DHCP implementation issue? Broadcast happens a lot on my network.

@ghollingworth
Copy link
Contributor

If you're using bootcode.bin (and only that) on an SD card then it is using the fixed version of the code...

If it is ignoring the reply then that will be because the offer doesn't contain the TFTP server address in a suitably understandable manor. Can you tcpdump (or wireshark) the reply?

Thanks

@tvk7
Copy link

tvk7 commented May 19, 2017

Here is a tcpdump, maybe there is a option missing, hopefully.
The addresses are static assigned. What I saw in other discussions is, that when you assign addresses from a address pool the server make some checks which results in a delayed send of an dhcp offer which the pi then accepts?

BOOTP/DHCP, Request from b8:27:eb:eb:cc:6e, length: 320, hops:1, xid:0x26f30339, flags: [none]
Gateway IP: 10.11.108.4
Client Ethernet Address: b8:27:eb:eb:cc:6e
Vendor-rfc1048:
DHCP:DISCOVER
PR:VO+VC+BF+T128+T129+T130+T131+T132+T133+T134+T135+TFTP
ARCH:0
NDI:1.2.1
GUID:0.68.68.68.68.68.68.68.68.68.68.68.68.68.68.68.68
VC:"PXEClient:Arch:00000:UNDI:002001"
16:02:36.394082 00:50:56:b7:34:a8 > 00:00:5e:00:01:0b, ethertype IPv4 (0x0800), length 342: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: UDP (17), length: 328) 10.11.4.11.bootps > 10.11.108.4.bootps: BOOTP/DHCP, Reply, length: 300, hops:1, xid:0x26f30339, flags: [none]
Your IP: 10.11.108.226
Server IP: 10.11.5.141
Gateway IP: 10.11.108.4
Client Ethernet Address: b8:27:eb:eb:cc:6e
Vendor-rfc1048:
DHCP:OFFER
SID:10.11.4.11
LT:900
VO:82.97.115.112.98.101.114.114.121.32.80.105.32.66.111.111.116.32.32.32
TFTP:"10.11.5.141"
SM:255.255.252.0
16:02:37.430627 00:04:96:8b:bd:ad > 00:50:56:b7:34:a8, ethertype IPv4 (0x0800), length 362: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: UDP (17), length: 348) 10.11.108.4.bootps > 10.11.4.11.bootps: BOOTP/DHCP, Request from b8:27:eb:eb:cc:6e, length: 320, hops:1, xid:0x26f30339, flags: [none]
Gateway IP: 10.11.108.4
Client Ethernet Address: b8:27:eb:eb:cc:6e
Vendor-rfc1048:
DHCP:DISCOVER
PR:VO+VC+BF+T128+T129+T130+T131+T132+T133+T134+T135+TFTP
ARCH:0
NDI:1.2.1
GUID:0.68.68.68.68.68.68.68.68.68.68.68.68.68.68.68.68
VC:"PXEClient:Arch:00000:UNDI:002001"

16:02:37.431011 00:50:56:b7:34:a8 > 00:00:5e:00:01:0b, ethertype IPv4 (0x0800), length 342: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: UDP (17), length: 328) 10.11.4.11.bootps > 10.11.108.4.bootps: BOOTP/DHCP, Reply, length: 300, hops:1, xid:0x26f30339, flags: [none]
Your IP: 10.11.108.226
Server IP: 10.11.5.141
Gateway IP: 10.11.108.4
Client Ethernet Address: b8:27:eb:eb:cc:6e
Vendor-rfc1048:
DHCP:OFFER
SID:10.11.4.11
LT:900
VO:82.97.115.112.98.101.114.114.121.32.80.105.32.66.111.111.116.32.32.32
TFTP:"10.11.5.141"
SM:255.255.252.0

@ghollingworth
Copy link
Contributor

Looks to me like you've got the serial and client on different subnets, the Raspberry Pi bootrom doesn't support this. The bootcode.bin option does though

@tvk7
Copy link

tvk7 commented May 24, 2017

Yes, works with single bootcode.bin

@ali1234
Copy link
Author

ali1234 commented Jul 30, 2017

This still does not work for me. With latest bootcode.bin the results are exactly the same: dnsmasq recieves the request and sends the response, and the Pi ignores it, five times, then it stops.

@andig
Copy link

andig commented Aug 27, 2017

This still does not work for me. With latest bootcode.bin the results are exactly the same: dnsmasq recieves the request and sends the response, and the Pi ignores it, five times, then it stops.

Checking in here after experiencing exactly the same problem with brand new pi3 with latest firmware update here https://www.raspberrypi.org/forums/viewtopic.php?f=28&t=191778&p=1203002#p1203002

Reading through this issue it appears that it's not solved unless an SD card with bootcode.bin is used. Is this the expected behaviour?

@ali1234
Copy link
Author

ali1234 commented Feb 15, 2018

@andig in my case it does not work reliably even with updated bootcode.bin on an SD card. However, setting dhcp-reply-delay=1 in dnsmasq.conf does help a bit. Sometimes it still takes several tries before it works though.

@ghollingworth
Copy link
Contributor

DHCP-reply-delay is required in some cases because there is a bug in the silicon such that if it receives both the DHCP reply and the tftpboot server address in less than 2 seconds then the device will lock up forever.

bootcode.bin does not suffer from this problem

But it's possible something else is going wrong or the packets are being dropped by the switch

@bunyevacz
Copy link

bunyevacz commented Jul 5, 2018

I have made a 30 raspberry pi setup. One 24 port PoE enabled managed switch (ES-24-250W), and two unmanaged PoE switch (TP-Link TL-SG1008P, TP-Link TL-SF1008P).

I have a raspberry pi acting as master:

  • wifi -> ethernet bridge for internet connectivity
  • DHCP server
  • TFTP server
  • NFS server

And I have 29 raspberry pi 3, without any SD cards. So they boot from network entirely, and gets its power from PoE. All the raspberry pi have the official touchscreen. (even the master)

I can switch on and off each port of the managed PoE switch. Essentially the exact same as pluggin in and unplugging the cable.

Here are my experiences (22 raspberry pi clients, 1 raspberry pi server, 1 laptop):

  1. Switching on each raspberry pi at the same time from the PoE switch. Ie. within half a second or less.
    3 raspberry pis booting up fine (usually between 1-5)
    14 raspberry pis stuck at the rainbow screen (usually about half the pis)
    5 raspberry pis stays black (and consuming 1.5W or less)

If the screen is on, the raspberry pi consumes between 5.5-7W.

  1. Apply the dhcp-replay-delay=1 to dnsmasq.conf
    Almost all the raspberry pi gets to the rainbow at least. Usually:
    10 raspberry pis finishes booting
    10 raspberry pis stuck at rainbow
    2 raspberry pis stays black screen.

  2. Apply the dhcp-replay-delay=1, and power each raspberry pi with 10sec delay
    So the managed switch apply (via a bash script from my laptop) power to each PoE interface 10 sec apart.

Almost all raspberry pi boots up fine. Worst case was 20/22.
Once the raspberry pi boots up, it sends a heartbeat message to the master. The master figures out, which raspberry pi failed to boot, and unplug-replug power via the managed switch (the switch can provide which mac address belongs to which physical interface).

The timing is as follow:
0-16 black screen
7s: ethernet port starts to blink
16-18sec rainbow
50sec: finished booting into text autologin, and starts the X server
1m3s: X server started , it is now completely grey
1m15: chromium started in kiosk mode. Boot finished.

I'm on this problem (unreliable starting) on a week now. I'm generating ICMPv6 packets, because I believed this helps. Because if I start a computer or a raspberry pi with sd card, then all the others raspberry pis starting up more likely.
And judging from tcpdump, I believed ICMPv6 packets made a difference.
In reality it turns out, all it does is just makes dnsmasq occupied, which results response time going up, which results more likely raspberry pi starting.

I'm now optimizing boot time from nfs. Disabling all services which is not needed. But the hard part was figuring out the unreliability. I think a pointer on the official site, or the netboot tutorial would be more appropriate, rather then the rather vague "packets on the network helps booting up".
Also this

If it doesn't boot on the first attempt, keep trying. It can take a minute or so for 
the Raspberry Pi to boot, so be patient.

phrase is totally misleading and wrong. If a raspberry pi does not start (rainbow picture) after 20 sec, 99% sure it will never start at all no matter how patient you are.
I booted up the raspberry pis like 400 times or more, and only one raspberry pi managed to start after 1 minute mark: once.

When I started this adventure, I was totally unaware how untested this codepath is.:(

@aaronk6
Copy link

aaronk6 commented Mar 8, 2019

@bunyevacz Thanks for the thorough analysis! I think I’m seeing the same issue here. Are there any news on this matter? Were you able to improve your setup in the meantime, or is the procedure you’ve outlined (dhcp-replay-delay + mechanism to power-cycle non-booting Pis) still the best option?

@puck
Copy link

puck commented Mar 8, 2019

It would be interesting to know if the latest version of bootcode.bin helps things, I've found it has made my netboot RPis boot up much more reliably. Sadly, you do need to have it on an SD card in each RPi.

@aaronk6
Copy link

aaronk6 commented Mar 9, 2019

Hi @puck, I just tried this (formatting an SD card with FAT32 and putting the latest bootcode.bin onto it) and I can confirm it works much better. I was able to boot up the Pi 5 times in a row from network which never worked before.

Do I understand this correctly that an older version of the bootcode.bin is burned into the Pi’s chip where it cannot be updated? So when a new Pi model is released some time in the future, can we assume this will have a newer bootcode burned into the chip which will solve this issue, so I can get rid of the SD card?

@puck
Copy link

puck commented Mar 10, 2019

Hey @aaronk6, Good to hear that you had a positive result!

Yes, your understanding is correct. I'm not sure if the version of the bootcode.bin that is burned into the ROM is updated during each RPi models lifetime or not, that'd certainly be interesting to know.

I've considered that a SD card which only contains the bootcode.bin is much less likely to be corrupted (only read on boot), and even if it is corrupted, I haven't lost anything, to make it something I've not stressed about having in my netboot RPi's.

@aaronk6
Copy link

aaronk6 commented Mar 10, 2019

Hi @puck, yeah, the “bootcode.bin SD card” is definitely a workaround I can live with for the time being. As you say, when the card is rarely read from and never written to, it’s shouldn’t suffer from corruption any time soon, and even if it does, it’s very easy to replace with a spare SD card.

Anyone here reading this who knows about the rollout procedure for the boot code in ROM?

@JamesH65
Copy link
Contributor

Changing the on chip ROM requires a respin of the chip, I think just the metal layer but not sure, and that is only done when a new chip stepping is produced, and this is very rare. It's very expensive. So as you will have seen from Pi history, it only really changes when a new major model with a new chip comes out, so 1->2, 2->3 3->3b+. In the 3B->3B+ case, the change was from an A0 to B0 chip. It would not be worth producing a new chip just to change the boot code because of the cost. (est. $500k at least IIRC)

@aaronk6
Copy link

aaronk6 commented Mar 11, 2019

@JamesH65, thanks for the insights! So I guess I’ll re-try with the Pi 4 some time in 2020 🙂

@aaronk6
Copy link

aaronk6 commented Jun 26, 2019

Did anyone try whether things have improved with the Raspberry Pi 4? I don’t have mine yet.

@pelwell
Copy link
Contributor

pelwell commented Jun 26, 2019

Pi 4 doesn't network- (or USB-) boot yet, and when it does it won't use bootcode.bin, so it's off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants