Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network connection unreliable on nucleo_f767zi #77794

Open
jkrautmacher opened this issue Aug 30, 2024 · 4 comments
Open

Network connection unreliable on nucleo_f767zi #77794

jkrautmacher opened this issue Aug 30, 2024 · 4 comments
Assignees
Labels
area: Ethernet bug The issue is a bug, or the PR is fixing a bug platform: STM32 ST Micro STM32 priority: low Low impact/importance bug Waiting for response Waiting for author's response

Comments

@jkrautmacher
Copy link
Contributor

jkrautmacher commented Aug 30, 2024

Describe the bug

While working on a quite minimal Zephyr firmware for the nucleo_f767zi board I noticed that the network connection is unreliable. As shown below this is reproducible with the net/telnet sample.

The bug can be noticed by the following indicators:

  • ICMP and any IP-based communication (IPv4 and IPv6) does not work
  • only the orange, not the green LED on the RJ45 connector of the board is lighting up
  • the stm_eth thread has significantly lower stack usage (see shell output of kernel stacks)

The bug appears after roughly 40 % of the boot processes. It was so far not observed that network communication breaks later if it was operational directly after booting the board.

To Reproduce

Steps to reproduce the behavior:

  1. connect the nucleo_f767zi board with an Ethernet cable to a Linux PC
  2. connect the PC also via USB to the onboard ST-Link
  3. configure the static IPv4 address 192.0.2.2/24 on the PC
  4. install the st-info tool (see e.g. source repo)
  5. build and flash samples/net/telnet as described in the Getting Started Guide
  6. execute the Python script below without any arguments (sorry, GitHub did not allow me to attach a Python file)
#!/usr/bin/python3


import argparse
import subprocess
import time


HELP = """This script can be used to debug an issue where the Zephyr-based
firmware of a target microcontroller is not able to communicate via the network
after roughly every second boot process. The Ethernet-capable microcontroller
has to be connected directly to the Linux-based host PC via an Ethernet cable
and a ST-Link. Reboots are triggered via the Open Source st-utils tool
(https://github.com/stlink-org/stlink). Connectivity is checked with ping."""


def main():
    args = parse_args()

    ok = 0
    not_ok = 0

    for i in range(args.iterations):
        print(f"Iteration #{i+1}")
        subprocess.run(["st-info", "--connect-under-reset"], check=True)
        time.sleep(args.delay)
        try:
            subprocess.run(
                ["ping", "-c", "1", args.address], check=True, stdout=subprocess.DEVNULL
            )
            print("ok")
            ok += 1
        except:
            print("not ok")
            not_ok += 1

    print(
        f"Success rate is {round(100 * ok / (ok + not_ok))} % ({ok} ok and {not_ok} failed)"
    )


def parse_args():
    parser = argparse.ArgumentParser(description=HELP)

    parser.add_argument(
        "-i",
        "--iterations",
        default=100,
        help="how often the test should be executed",
        type=int,
    )

    parser.add_argument(
        "-d",
        "--delay",
        default=10,
        help="how long to wait after reset for the ICMP request",
        type=int,
    )

    parser.add_argument(
        "-a",
        "--address",
        default="192.0.2.1",
        help="address of the microcontroller / target of ICMP request",
        type=str,
    )

    return parser.parse_args()


if __name__ == "__main__":
    main()

Expected behavior

It is expected that the used script reports 100 % success rate.

Impact

This bug is a showstopper. A firmware with such an unreliable network connection is useless.

Logs and console output

The script summarizes a test with 100 iterations on my setup with:

  • Success rate is 44 % (44 ok and 56 failed) for zephyr v3.7.0
  • Success rate is 39 % (39 ok and 61 failed) for zephyr v3.6.0

Environment (please complete the following information):

  • OS: Arch Linux
  • Toolchain: Zephyr SDK 0.16.5
  • Zephyr version: v3.7.0 and v3.6.0
@jkrautmacher jkrautmacher added the bug The issue is a bug, or the PR is fixing a bug label Aug 30, 2024
@jkrautmacher
Copy link
Contributor Author

Updated bug report after it turned out that Zephyr v3.6.0 is affected too.

@jukkar jukkar added the platform: STM32 ST Micro STM32 label Aug 30, 2024
@erwango erwango added the priority: low Low impact/importance bug label Sep 2, 2024
@erwango erwango assigned marwaiehm-st and unassigned erwango Sep 2, 2024
@marwaiehm-st
Copy link
Collaborator

marwaiehm-st commented Sep 4, 2024

The bug is reproduced on nucleo_f767zi but not on stm32f769i_disco, so i tried to compare the two :

  • I Verified Device Tree Configuration of the Ethernet MAC, its similar for the both nucleo_f767zi and stm32f769i_disco:
		mac: ethernet@40028000 {
			compatible = "st,stm32-ethernet";
			reg = <0x40028000 0x8000>;
			interrupts = <61 0>;
			clock-names = "stmmaceth", "mac-clk-tx",
				      "mac-clk-rx", "mac-clk-ptp";
			clocks = <&rcc STM32_CLOCK_BUS_AHB1 0x02000000>,
				 <&rcc STM32_CLOCK_BUS_AHB1 0x04000000>,
				 <&rcc STM32_CLOCK_BUS_AHB1 0x08000000>,
				 <&rcc STM32_CLOCK_BUS_AHB1 0x10000000>;
			status = "disabled";
		};
  • I Verified the clock settings for the Ethernet MAC and they are correctly configured.

The difference :

  • STM32F769I-DISCO board includes additional components for PoE, such as the PM8800A PoE controller, transformers, and various passive components.
    Screenshot from 2024-09-04 12-00-06

  • NUCLEO-F767ZI board lacks these components, which might affect the stability and performance of the Ethernet connection.

Impact on Ethernet:

  • The PoE circuit on the STM32F769I-DISCO board provides a stable power supply to the Ethernet PHY, which can improve the reliability of the Ethernet connection.
  • The NUCLEO-F767ZI board relies on a different power supply configuration, which might be less stable, leading to the observed 40% reliability.

@FRASTM
Copy link
Collaborator

FRASTM commented Sep 18, 2024

@jkrautmacher can you please confirm that ?

@erwango erwango added the Waiting for response Waiting for author's response label Sep 18, 2024
@jkrautmacher
Copy link
Contributor Author

I would be glad to help but unsure what to confirm. As far as I understood the current theory is that the power supply of the PHY on nucleo_f767zi might not be stable enough during init so that initialization fails.

If this is correct my next debugging step would be to connect an oscilloscope to the supply voltage of the PHY and observe it during init. That together with a debug GPIO from the kernel code which toggles right before and after Ethernet init should validate this theory or not. The next step would be to either fix the hardware design or to add a workaround to the Zephyr kernel (longer delays or similar).

I am lacking two things to verify the theory:

  • an oscilloscope at home
  • detailed schematics of nucleo_f767zi

Is there maybe more public information about the board than in UM1974 Rev 10 I overlooked? The oscilloscope situation I could maybe improve but it would likely be way faster if you could do that at ST if this is an option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Ethernet bug The issue is a bug, or the PR is fixing a bug platform: STM32 ST Micro STM32 priority: low Low impact/importance bug Waiting for response Waiting for author's response
Projects
None yet
Development

No branches or pull requests

5 participants