Skip to content

DLPX-86177 Azure Accelerated networking broken because Mellanox drivers absent in kernel #27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

palash-gandhi
Copy link
Contributor

@palash-gandhi palash-gandhi commented May 20, 2023

Problem

ESCL-4467 came in where the customer did not notice evidence of accelerated networking in terms of throughput. There were other indications that the new virtual device was ignored by the kernel.

In 7.0, we disabled kernel modules as part of DLPX-83442 Disable various kernel modules which we don't use by prakashsurya · Pull Request #14 · delphix/linux-kernel-azure . This included disabling the Mellanox drivers causing AN to break.

Solution

Re-enable the Mellanox modules required for AN.

Testing Done

ab-pre-push: http://selfservice.jenkins.delphix.com/job/appliance-build-orchestrator-pre-push/5508/

delphix@pg-develop-mlx-fix:~$ get-appliance-version
12.0.0.0-snapshot.20230520044431399+jenkins-selfservice-appliance-build-develop-pre-push-211

delphix@pg-develop-mlx-fix:~$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 00:0d:3a:fc:df:0c brd ff:ff:ff:ff:ff:ff
    inet 10.39.241.180/20 brd 10.39.255.255 scope global eth0
       valid_lft forever preferred_lft forever
3: enP38618s1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth0 state UP group default qlen 1000
    link/ether 00:0d:3a:fc:df:0c brd ff:ff:ff:ff:ff:ff

delphix@pg-develop-mlx-fix:~$ ethtool -S eth0 | grep vf_
     vf_rx_packets: 523
     vf_rx_bytes: 73079
     vf_tx_packets: 799
     vf_tx_bytes: 166492
     vf_tx_dropped: 0
     ...

delphix@pg-develop-mlx-fix:~$ grep MLX5 /boot/config-5.4.0-1107-dx2023052002-113599c15-azure
CONFIG_MLX5_CORE=m
CONFIG_MLX5_ACCEL=y
CONFIG_MLX5_FPGA=y
CONFIG_MLX5_CORE_EN=y
CONFIG_MLX5_EN_ARFS=y
CONFIG_MLX5_EN_RXNFC=y
CONFIG_MLX5_MPFS=y
CONFIG_MLX5_ESWITCH=y
CONFIG_MLX5_CORE_EN_DCB=y
CONFIG_MLX5_CORE_IPOIB=y
CONFIG_MLX5_FPGA_IPSEC=y
CONFIG_MLX5_EN_IPSEC=y
CONFIG_MLX5_FPGA_TLS=y
CONFIG_MLX5_TLS=y
CONFIG_MLX5_EN_TLS=y
CONFIG_MLX5_SW_STEERING=y


delphix@pg-develop-mlx-fix:~$ ls /lib/modules/5.4.0-1107-dx2023052002-113599c15-azure/kernel/drivers/net/ethernet/mellanox/
mlx4  mlx5  mlxfw  mlxsw

@palash-gandhi palash-gandhi force-pushed the dlpx/pr/pgandhi-delphix/4df2fea5-f639-4e7c-a411-61d5a4698992 branch from e0079ca to 13a7350 Compare May 20, 2023 18:16
@palash-gandhi palash-gandhi marked this pull request as ready for review May 20, 2023 18:20
@sebroy
Copy link
Contributor

sebroy commented May 22, 2023

@pgandhi-delphix @david-mendez1 , do we have no end-to-end automation that tests Azure accelerated networking? If that's a gap, can we plan on filling it?

@palash-gandhi
Copy link
Contributor Author

@pgandhi-delphix @david-mendez1 , do we have no end-to-end automation that tests Azure accelerated networking? If that's a gap, can we plan on filling it?

@sebroy we do have a test, but it is incomplete. I have filed https://delphix.atlassian.net/browse/QA-41154 to add some more checks to that test.

@prakashsurya
Copy link

This might not be necessary, but we might want to port this to all the other platforms simply for consistency.

@palash-gandhi palash-gandhi merged commit 8dcd234 into develop May 22, 2023
@palash-gandhi palash-gandhi deleted the dlpx/pr/pgandhi-delphix/4df2fea5-f639-4e7c-a411-61d5a4698992 branch May 22, 2023 17:48
jwk404 pushed a commit to jwk404/linux-kernel-azure that referenced this pull request Mar 23, 2024
jwk404 pushed a commit to jwk404/linux-kernel-azure that referenced this pull request Mar 25, 2024
pcd1193182 pushed a commit to pcd1193182/linux-kernel-azure that referenced this pull request Aug 19, 2024
delphix-devops-bot pushed a commit that referenced this pull request Feb 27, 2025
BugLink: https://bugs.launchpad.net/bugs/2089272

[ Upstream commit 60f07e2 ]

We use uprobe in aarch64_be, which we found the tracee task would exit
due to SIGILL when we enable the uprobe trace.
We can see the replace inst from uprobe is not correct in aarch big-endian.
As in Armv8-A, instruction fetches are always treated as little-endian,
we should treat the UPROBE_SWBP_INSN as little-endian。

The test case is as following。
bash-4.4# ./mqueue_test_aarchbe 1 1 2 1 10 > /dev/null &
bash-4.4# cd /sys/kernel/debug/tracing/
bash-4.4# echo 'p:test /mqueue_test_aarchbe:0xc30 %x0 %x1' > uprobe_events
bash-4.4# echo 1 > events/uprobes/enable
bash-4.4#
bash-4.4# ps
  PID TTY          TIME CMD
  140 ?        00:00:01 bash
  237 ?        00:00:00 ps
[1]+  Illegal instruction     ./mqueue_test_aarchbe 1 1 2 1 100 > /dev/null

which we debug use gdb as following:

bash-4.4# gdb attach 155
(gdb) disassemble send
Dump of assembler code for function send:
   0x0000000000400c30 <+0>:     .inst   0xa00020d4 ; undefined
   0x0000000000400c34 <+4>:     mov     x29, sp
   0x0000000000400c38 <+8>:     str     w0, [sp, #28]
   0x0000000000400c3c <+12>:    strb    w1, [sp, #27]
   0x0000000000400c40 <+16>:    str     xzr, [sp, #40]
   0x0000000000400c44 <+20>:    str     xzr, [sp, #48]
   0x0000000000400c48 <+24>:    add     x0, sp, #0x1b
   0x0000000000400c4c <+28>:    mov     w3, #0x0                 // #0
   0x0000000000400c50 <+32>:    mov     x2, #0x1                 // #1
   0x0000000000400c54 <+36>:    mov     x1, x0
   0x0000000000400c58 <+40>:    ldr     w0, [sp, #28]
   0x0000000000400c5c <+44>:    bl      0x405e10 <mq_send>
   0x0000000000400c60 <+48>:    str     w0, [sp, #60]
   0x0000000000400c64 <+52>:    ldr     w0, [sp, #60]
   0x0000000000400c68 <+56>:    ldp     x29, x30, [sp], #64
   0x0000000000400c6c <+60>:    ret
End of assembler dump.
(gdb) info b
No breakpoints or watchpoints.
(gdb) c
Continuing.

Program received signal SIGILL, Illegal instruction.
0x0000000000400c30 in send ()
(gdb) x/10x 0x400c30
0x400c30 <send>:    0xd42000a0   0xfd030091      0xe01f00b9      0xe16f0039
0x400c40 <send+16>: 0xff1700f9   0xff1b00f9      0xe06f0091      0x03008052
0x400c50 <send+32>: 0x220080d2   0xe10300aa
(gdb) disassemble 0x400c30
Dump of assembler code for function send:
=> 0x0000000000400c30 <+0>:     .inst   0xa00020d4 ; undefined
   0x0000000000400c34 <+4>:     mov     x29, sp
   0x0000000000400c38 <+8>:     str     w0, [sp, #28]
   0x0000000000400c3c <+12>:    strb    w1, [sp, #27]
   0x0000000000400c40 <+16>:    str     xzr, [sp, #40]

Signed-off-by: junhua huang <huang.junhua@zte.com.cn>
Link: https://lore.kernel.org/r/202212021511106844809@zte.com.cn
Signed-off-by: Will Deacon <will@kernel.org>
Stable-dep-of: 13f8f1e05f1d ("arm64: probes: Fix uprobes for big-endian kernels")
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Manuel Diewald <manuel.diewald@canonical.com>
Signed-off-by: Mehmet Basaran <mehmet.basaran@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants