Skip to content

[CI run] selftests/bpf: Don't assign outer source IP to host #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 971 commits into from

Conversation

pchaigno
Copy link
Member

No description provided.

Jilin Yuan and others added 30 commits July 1, 2022 20:13
Delete the redundant word 'are'.

Signed-off-by: Jilin Yuan <yuanjilin@cdjrlc.com>
Link: https://lore.kernel.org/r/20220630124639.11420-1-yuanjilin@cdjrlc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Delete the redundant word 'a'.

Signed-off-by: Jilin Yuan <yuanjilin@cdjrlc.com>
Link: https://lore.kernel.org/r/20220630123924.7531-1-yuanjilin@cdjrlc.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Most drivers use "skb_transport_offset(skb) + tcp_hdrlen(skb)"
to compute headers length for a TCP packet, but others
use more convoluted (but equivalent) ways.

Add skb_tcp_all_headers() and skb_inner_tcp_all_headers()
helpers to harmonize this a bit.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Enable led-mode configuration for LAN8814 phy

Signed-off-by: Divya Koppera <Divya.Koppera@microchip.com>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
LED support for extended mode where
LED 1: Enhanced Mode 5 (10M/1000M/Activity)
LED 2: Enhanced Mode 4 (100M/1000M/Activity)

By default it supports KSZ9031 LED mode

Signed-off-by: Divya Koppera <Divya.Koppera@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Divya Koppera says:

====================
net: LED feature for LAN8814 PHY

Enable LED mode configuration for LAN8814 PHY

v2 -> v3:
- Fixed compilation issues

v1 -> v2:
- Updated dt-bindings for micrel,led-mode in LAN8814 PHY
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
*-internal-delay-ps properties would be applicable only for RGMII interface
modes.

It is changed as per the request.

Ran dt_binding_check to confirm nothing is broken.

link: https://lore.kernel.org/netdev/d8e5f6a8-a7e1-dabd-f4b4-ea8ea21d0a1d@gmail.com/
Signed-off-by: Prasanna Vengateshan <prasanna.vengateshan@microchip.com>
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Rob Herring <robh@kernel.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Documentation in .yaml format and updates to the MAINTAINERS
Also 'make dt_binding_check' is passed.

RGMII internal delay values for the mac is retrieved from
rx-internal-delay-ps & tx-internal-delay-ps as per the feedback from
v3 patch series.
https://lore.kernel.org/netdev/20210802121550.gqgbipqdvp5x76ii@skbuf/

It supports only the delay value of 0ns and 2ns.

Signed-off-by: Prasanna Vengateshan <prasanna.vengateshan@microchip.com>
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Microchip LAN937X switches have a tagging protocol which is
very similar to KSZ tagging. So that the implementation is added to
tag_ksz.c and reused common APIs

Signed-off-by: Prasanna Vengateshan <prasanna.vengateshan@microchip.com>
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The ksz9477 and lan937x has few difference in the static and reserved
table register 0x041C. For the ksz9477 if the bit 0 is 1 - read
operation and 0 - write operation. But for lan937x bit 1:0 used for
selecting the read/write operation, 01 - write and 10 - read.
To use ksz9477 mdb add/del and enable_stp_addr for the lan937x, masks &
shifts are introduced for ksz9477 & lan937x in ksz_common.c. Then
updated the function with masks & shifts based on the switch instead of
hard coding it.

Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Basic DSA driver support for lan937x and the device will be
configured through SPI interface.
It adds the lan937x_dev_ops in ksz_common.c file and tries to reuse the
functionality of ksz9477 series switch.

drivers/net/dsa/microchip/ path is already part of MAINTAINERS &
the new files come under this path. Hence no update needed to the
MAINTAINERS

Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch update the ksz_get_tag_protocol to return LAN937x specific
tag if the chip id matches one of LAN937x series switch

Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add support for the writing and reading of the phy registers.
LAN937x uses the Vphy indirect addressing method for accessing the phys.

Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch register mdio-bus for the lan937x series switch. mdio read
and write uses the vphy for accessing the phy register.

Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add the support for port_max_mtu, port_change_mtu and
port_fast_age dsa functionality.

Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The internal phy of the LAN937x are capable of 100Mbps Full duplex. The
xMII port of switch is capable of 10Mbps Full & Half Duplex, 100Mbps
Full & Half Duplex and 1000Mbps Half duplex. xMII port also supports Tx
and Rx Flow control.

Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add support for phylink_mac_link_up. It configures the mac
for the speed, flow control and duplex mode.

Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add support for phylink_mac_config dsa hook. It configures
the mac for MII/RMII modes. The RGMII mode will be added in the future
patches.

Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch add the LAN937x part support in the existing ksz_spi_probe.

Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Arun Ramadoss says:

====================
net: dsa: microchip: DSA Driver support for LAN937x

LAN937x is a Multi-Port 100BASE-T1 Ethernet Physical Layer switch
compliant with the IEEE 802.3bw-2015 specification. The device provides
100 Mbit/s transmit and receive capability over a single Unshielded
Twisted Pair (UTP) cable. LAN937x is successive revision of KSZ series
switch.
This series of patches provide the DSA driver support for
Microchip LAN937X switch through MII/RMII interface. The RGMII interface
support will be added in the follow up series.  LAN937x uses the most of
functionality of KSZ9477.

The LAN937x switch series family consists of following SKUs:

LAN9370:
  - 4 T1 Phys
  - 1 RGMII port

LAN9371:
  - 3 T1 Phys & 1 TX Phy
  - 2 RGMII ports

LAN9372:
  - 5 T1 Phys & 1 TX Phy
  - 2 RGMII ports

LAN9373:
  - 5 T1 Phys
  - 2 RGMII
  - 1 SGMII port

LAN9374:
  - 6 T1 Phys
  - 2 RGMII ports

Changes in v15:
- fixed compilation issue.
- Updated the phylink_mac_link_up to check only for 10/100/1000 speed.

Changes in v14:
- Updated the patch series to latest ksz code refactoring.
- RGMII register configuration is removed from the series. It will be added in
the follow up patch series.

Changes in v13:
- Fixed the compilation issue in patch 5 and 6

Changes in v12:
- Removed the reduntant spi indirect enable in lan937x_init
- Used the ksz_port_stp_state_set function
- Apply rgmii internal delay only if it is rgmii port
- Set the bit for 100baseTx in phylink_get_caps
- Moved the ethtool related API from patch 5 to 7
- Moved lan_alu_entry struct in lan937x_dev.h from patch 5 to 9
- Moved lan_vlan_entry in lan937x_dev.h from patch 5 to 10
- Used the ksz_get_stats64 function for get_stats64 hook
- Splitted the patch 5. one for port configuration, spi driver, phy read &
  write and mtu configuration.
- Updated the indentation in ethernet-controller.yaml
- lan937x.yaml: Removed the blank lines, updated the ethernet handle to macb0.
  Added the rgmii internal delay only for the ports.

Changes in v11:
- Tagged as RFC to get the feedback for the subpatches 1/10, 5/10 and 6/10

Changes in v10:
- dsa.yaml: dropped moving mdio properties to dsa.yaml as per the feedback
https://patchwork.kernel.org/project/netdevbpf/patch/20220318085540.281721-3-prasanna.vengateshan@microchip.com/#24787466
- microchip,lan937x.yaml: Naming convention changes in the example
- lan937x_main.c: Moving configurations from lan937x_reset_switch() to setup()
- lan937x_main.c: helper function has been introduced for
  lan937x_internal_phy_read & write
- lan937x_dev.h: lan_alu_struct struct data type changes
- lan937x_main.c: lan937x_get_stats64 make non blocking
- lan937x_main.c: modified lan937x_port_mirror_add to include extack

Changes in v9:
- lan937x_main.c: of_node_put() correction in lan937x_parse_dt_rgmii_delay
- lan937x_dev.c: removed the interface checks from lan937x_apply_rgmii_delay.
- changes in ethernet-controller.yaml and dsa.yaml

Changes in v8:
- lan937x_dev.c: fixed lan937x_r_mib_pkt warning in the sub patches
- lan937x_main.c: phylink_autoneg_inband() check removed in
  lan937x_phylink_mac_link_up()
- lan937x_main.c: made legacy_pre_march2020 = false as this is non-legacy driver
  and indentation correction in lan937x_phylink_mac_link_up()
- removed unnecessary parenthesis in lan937x_get_strings()

Changes in v7:
- microchip,lan937x.yaml: *-internal-delay-ps enum values & commit messages
  corrections
- lan937x_main.c: removed phylink_validate() and added phylink_get_caps()
- lan937x_main.c: added support for ethtool standard stats   (get_eth_*_stats
  and get_stats64)
- lan937x_main.c: removed unnecessary PVID read from lan937x_port_vlan_del()
- integrated the changes of ksz9477 multi bridging support to lan937x dev and
  tested both multi bridging and STP
- lan937x_port_vlan_del - dummy pvid read removed

Changes in v6:
- microchip_t1.c: There was new merge done in the net-next tree for
  microchip_1.c after the v5 submission. Hence rebased it for v6.

Changes in v5:
- microchip,lan937x.yaml: Added mdio properties detail
- microchip,lan937x.yaml: *-internal-delay-ps added under port node
- lan937x_dev.c: changed devm_mdiobus_alloc from of_mdiobus_register as suggested
  by Vladimir
- lan937x_dev.c: added dev_info for rgmii internal delay & error message to user
  in case of out of range values
- lan937x_dev.c: return -EOPNOTSUPP for C45 regnum values for
  lan937x_sw_mdio_read & write operations
- return from function with out storing in a variable
- lan937x_main.c: Added vlan_enable info in vlan_filtering API
- lan937x_main.c: lan937x_port_vlan_del: removed unintended PVID write

Changes in v4:
- tag_ksz.c: cpu_to_be16 to  put_unaligned_be16
- correct spacing in comments
- tag_ksz.c: NETIF_F_HW_CSUM fix is integrated
- lan937x_dev.c: mdio_np is removed from global and handled locally
- lan937x_dev.c: unused functions removed lan937x_cfg32 & lan937x_port_cfg32
- lan937x_dev.c: lan937x_is_internal_100BTX_phy_port function name changes
- lan937x_dev.c: RGMII internal delay handling for MAC. Delay values are
  retrieved from DTS and updated
- lan937x_dev.c: corrected mutex operations for few dev variables
- microchip,lan937x.yaml: introduced rx-internal-delay-ps &
  tx-internal-delay-ps for RGMII internal delay
- lan937x_dev.c: Unnecessary mutex_lock has been removed
- lan937x_main.c: PHY_INTERFACE_MODE_NA handling for lan937x_phylink_validate
- lan937x_main.c: PORT_MIRROR_SNIFFER check in right place
- lan937x_main.c: memset is used instead of writing 0's individually in
  lan937x_port_fdb_add function
- lan937x_main.c: Removed \n from NL_SET_ERR_MSG_MOD calls

Changes in v3:
- Removed settings of cnt_ptr to zero and the memset()
  added a cleanup patch which moves this into ksz_init_mib_timer().
- Used ret everywhere instead of rc
- microchip,lan937x.yaml: Remove mdio compatible
- microchip_t1.c: Renaming standard phy registers
- tag_ksz.c: LAN937X_TAIL_TAG_OVERRIDE renaming
  LAN937X_TAIL_TAG_BLOCKING_OVERRIDE
- tag_ksz.c: Changed Ingress and Egress naming convention based on
  Host
- tag_ksz.c: converted to skb_mac_header(skb) from
  (is_link_local_ether_addr(hdr->h_dest))
- lan937x_dev.c: Removed BCAST Storm protection settings since we
  have Tc commands for them
- lan937x_dev.c: Flow control setting in lan937x_port_setup function
- lan937x_dev.c: RGMII internal delay added only for cpu port,
- lan937x_dev.c: of_get_compatible_child(node,
  "microchip,lan937x-mdio") to of_get_child_by_name(node, "mdio");
- lan937x_dev.c:lan937x_get_interface API: returned
  PHY_INTERFACE_MODE_INTERNAL instead of PHY_INTERFACE_MODE_NA
- lan937x_main.c: Removed compat interface implementation in
  lan937x_config_cpu_port() API & dev_info corrected as well
- lan937x_main.c: deleted ds->configure_vlan_while_not_filtering
  = true
- lan937x_main.c: Added explanation for lan937x_setup lines
- lan937x_main.c: FR_MAX_SIZE correction in lan937x_get_max_mtu API
- lan937x_main.c: removed lan937x_port_bridge_flags dummy functions
- lan937x_spi.c - mdiobus_unregister to be added to spi_remove
  function
- lan937x_main.c: phy link layer changes
- lan937x_main.c: port mirroring: sniff port selection limiting to
  one port
- lan937x_main.c: Changed to global vlan filtering
- lan937x_main.c: vlan_table array to structure
- lan937x_main.c -Use extack instead of reporting errors to Console
- lan937x_main.c - Remove cpu_port addition in vlan_add api
- lan937x_main.c - removed pvid resetting

Changes in v2:
- return check for register read/writes
- dt compatible compatible check is added against chip id value
- lan937x_internal_t1_tx_phy_write() is renamed to
  lan937x_internal_phy_write()
- lan937x_is_internal_tx_phy_port is renamed to
  lan937x_is_internal_100BTX_phy_port as it is 100Base-Tx phy
- Return value for lan937x_internal_phy_write() is -EOPNOTSUPP
  in case of failures
- Return value for lan937x_internal_phy_read() is 0xffff
  for non existent phy
- cpu_port checking is removed from lan937x_port_stp_state_set()
- lan937x_phy_link_validate: 100baseT_Full to 100baseT1_Full
- T1 Phy driver is moved to drivers/net/phy/microchip_t1.c
- Tx phy driver support will be added later
- Legacy switch checkings in dts file are removed.
- tag_ksz.c: Re-used ksz9477_rcv for lan937x_rcv
- tag_ksz.c: Xmit() & rcv() Comments are corrected w.r.to host
- net/dsa/Kconfig: Family skew numbers altered in ascending order
- microchip,lan937x.yaml: eth is replaced with ethernet
- microchip,lan937x.yaml: spi1 is replaced with spi
- microchip,lan937x.yaml: cpu labelling is removed
- microchip,lan937x.yaml: port@x value will match the reg value now
====================
ipsec_fs.h is not used and can be safely deleted.

Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Smatch complains about this function:

    drivers/net/ethernet/mellanox/mlx5/core/eswitch.c:2000 mlx5_esw_unlock()
    warn: inconsistent returns '&esw->mode_lock'.

Before commit ec2fa47 ("net/mlx5: Lag, use lag lock") there
used to be a matching mlx5_esw_lock() function and the lock and
unlock functions were symmetric.  But now we take the lock
unconditionally and must unlock unconditionally as well.

As near as I can tell this is dead code and can just be deleted.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
… is created

Eswitch vport acl namespace is needed when loading vfs. There is
no need to free and reallocate it when switching eswitch mode.
Introduce flag to indicate if it is created or not. When needed,
create it. Only free it when the driver is unloaded or in bare
metal mode.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Introduce flag to indicate if fdb table is created as a pre-step
to prepare for removing dependency between sriov and eswitch mode
in the downstream patches.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Currently, there are three eswitch modes, none, legacy and switchdev.
None is the default mode. Remove redundant none mode as eswitch mode
should always be either legacy mode or switchdev mode.

With this patch, there are two behavior changes:

1. Legacy becomes the default mode. When querying eswitch mode using
   devlink, a valid mode is always returned.
2. When disabling sriov, the eswitch mode will not change, only vfs
   are unloaded.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Enable or disable switchdev according to the eswitch mode set by
devlink command. So it is not changed by other functions anymore.

Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add a separate API to create SQ and CQ for advanced steering
operations (ASO).

Since the mlx5_en API to create these resources is strongly coupled
with netdev channels and datapath elements, this API provides an
alternative for creating send queues that are used for ASO.

Currently the API allows creating channels with 2 wqbbs only - meaning
the support will be for a single ACCESS_ASO wqe with data at a time.

Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Ariel Levkovich <lariel@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Add interfaces to use ASO object control channel. The channel consists
of a control SQ and CQ to which user can post ACCESS_ASO work requests
to modify ASO objects. The functions to get wqe from SQ, fill wqe,
post the request, and poll the completion of the work, are provided.

Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Ariel Levkovich <lariel@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
If flow meter aso object is supported, set the allocated range, and
initialize aso wqe.

The allocated range is indicated by log_meter_aso_granularity in HW
capabilities, and currently is 6.

Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Maor Dickman <maord@nvidia.com>
Reviewed-by: Ariel Levkovich <lariel@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
The policing rate and burst from user are converted to flow meter
parameters in hardware. These parameters are set or modified by
ACCESS_ASO WQE, add function to support it.

Signed-off-by: Jianbo Liu <jianbol@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Ariel Levkovich <lariel@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Commit 26101f5 ("bpf: Add source ip in "struct bpf_tunnel_key"")
added support for getting and setting the outer source IP of encapsulated
packets via the bpf_skb_{get,set}_tunnel_key BPF helper. This change
allows BPF programs to set any IP address as the source, including for
example the IP address of a container running on the same host.

In that last case, however, the encapsulated packets are dropped when
looking up the route because the source IP address isn't assigned to any
interface on the host. To avoid this, we need to set the
FLOWI_FLAG_ANYSRC flag.

Fixes: 26101f5 ("bpf: Add source ip in "struct bpf_tunnel_key"")
Signed-off-by: Paul Chaignon <paul@isovalent.com>
@pchaigno pchaigno force-pushed the ci-test/fix-anysrc-bpf-set-tunnel-key branch 2 times, most recently from a6df251 to 925725e Compare July 21, 2022 11:15
pchaigno added 2 commits July 21, 2022 13:46
The previous commit fixed a bug in the bpf_skb_set_tunnel_key helper to
avoid dropping packets whose outer source IP address isn't assigned to a
host interface. This commit changes the corresponding selftest to not
assign the outer source IP address to an interface.

With this change and without the bugfix, the ICMP echo packets sent as
part of the test are dropped.

Signed-off-by: Paul Chaignon <paul@isovalent.com>
@pchaigno pchaigno force-pushed the ci-test/fix-anysrc-bpf-set-tunnel-key branch from 925725e to 3c538c8 Compare July 21, 2022 11:46
@pchaigno pchaigno closed this Jul 23, 2022
aspsk pushed a commit that referenced this pull request Nov 30, 2022
In the error path in ata_tport_add(), when calling put_device(),
ata_tport_release() is called, it will put the refcount of 'ap->host'.

And then ata_host_put() is called again, the refcount is decreased
to 0, ata_host_release() is called, all ports are freed and set to
null.

When unbinding the device after failure, ata_host_stop() is called
to release the resources, it leads a null-ptr-deref(), because all
the ports all freed and null.

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
CPU: 7 PID: 18671 Comm: modprobe Kdump: loaded Tainted: G            E      6.1.0-rc3+ #8
pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : ata_host_stop+0x3c/0x84 [libata]
lr : release_nodes+0x64/0xd0
Call trace:
 ata_host_stop+0x3c/0x84 [libata]
 release_nodes+0x64/0xd0
 devres_release_all+0xbc/0x1b0
 device_unbind_cleanup+0x20/0x70
 really_probe+0x158/0x320
 __driver_probe_device+0x84/0x120
 driver_probe_device+0x44/0x120
 __driver_attach+0xb4/0x220
 bus_for_each_dev+0x78/0xdc
 driver_attach+0x2c/0x40
 bus_add_driver+0x184/0x240
 driver_register+0x80/0x13c
 __pci_register_driver+0x4c/0x60
 ahci_pci_driver_init+0x30/0x1000 [ahci]

Fix this by removing redundant ata_host_put() in the error path.

Fixes: 2623c7a ("libata: add refcounting to ata_host")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
aspsk pushed a commit that referenced this pull request Nov 30, 2022
In ata_tport_add(), the return value of transport_add_device() is
not checked. As a result, it causes null-ptr-deref while removing
the module, because transport_remove_device() is called to remove
the device that was not added.

Unable to handle kernel NULL pointer dereference at virtual address 00000000000000d0
CPU: 12 PID: 13605 Comm: rmmod Kdump: loaded Tainted: G        W          6.1.0-rc3+ #8
pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : device_del+0x48/0x39c
lr : device_del+0x44/0x39c
Call trace:
 device_del+0x48/0x39c
 attribute_container_class_device_del+0x28/0x40
 transport_remove_classdev+0x60/0x7c
 attribute_container_device_trigger+0x118/0x120
 transport_remove_device+0x20/0x30
 ata_tport_delete+0x34/0x60 [libata]
 ata_port_detach+0x148/0x1b0 [libata]
 ata_pci_remove_one+0x50/0x80 [libata]
 ahci_remove_one+0x4c/0x8c [ahci]

Fix this by checking and handling return value of transport_add_device()
in ata_tport_add().

Fixes: d902747 ("[libata] Add ATA transport class")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
borkmann pushed a commit that referenced this pull request Apr 30, 2023
Ido Schimmel says:

====================
bridge: Add per-{Port, VLAN} neighbor suppression

Background
==========

In order to minimize the flooding of ARP and ND messages in the VXLAN
network, EVPN includes provisions [1] that allow participating VTEPs to
suppress such messages in case they know the MAC-IP binding and can
reply on behalf of the remote host. In Linux, the above is implemented
in the bridge driver using a per-port option called "neigh_suppress"
that was added in kernel version 4.15 [2].

Motivation
==========

Some applications use ARP messages as keepalives between the application
nodes in the network. This works perfectly well when two nodes are
connected to the same VTEP. When a node goes down it will stop
responding to ARP requests and the other node will notice it
immediately.

However, when the two nodes are connected to different VTEPs and
neighbor suppression is enabled, the local VTEP will reply to ARP
requests even after the remote node went down, until certain timers
expire and the EVPN control plane decides to withdraw the MAC/IP
Advertisement route for the address. Therefore, some users would like to
be able to disable neighbor suppression on VLANs where such applications
reside and keep it enabled on the rest.

Implementation
==============

The proposed solution is to allow user space to control neighbor
suppression on a per-{Port, VLAN} basis, in a similar fashion to other
per-port options that gained per-{Port, VLAN} counterparts such as
"mcast_router". This allows users to benefit from the operational
simplicity and scalability associated with shared VXLAN devices (i.e.,
external / collect-metadata mode), while still allowing for per-VLAN/VNI
neighbor suppression control.

The user interface is extended with a new "neigh_vlan_suppress" bridge
port option that allows user space to enable per-{Port, VLAN} neighbor
suppression on the bridge port. When enabled, the existing
"neigh_suppress" option has no effect and neighbor suppression is
controlled using a new "neigh_suppress" VLAN option. Example usage:

 # bridge link set dev vxlan0 neigh_vlan_suppress on
 # bridge vlan add vid 10 dev vxlan0
 # bridge vlan set vid 10 dev vxlan0 neigh_suppress on

Testing
=======

Tested using existing bridge selftests. Added a dedicated selftest in
the last patch.

Patchset overview
=================

Patches #1-#5 are preparations.

Patch #6 adds per-{Port, VLAN} neighbor suppression support to the
bridge's data path.

Patches #7-#8 add the required netlink attributes to enable the feature.

Patch #9 adds a selftest.

iproute2 patches can be found here [3].

Changelog
=========

Since RFC [4]:

No changes.

[1] https://www.rfc-editor.org/rfc/rfc7432#section-10
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a42317785c898c0ed46db45a33b0cc71b671bf29
[3] https://github.com/idosch/iproute2/tree/submit/neigh_suppress_v1
[4] https://lore.kernel.org/netdev/20230413095830.2182382-1-idosch@nvidia.com/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
lmb pushed a commit that referenced this pull request May 23, 2023
Add support precision backtracking in the presence of subprogram frames in
jump history.

This means supporting a few different kinds of subprogram invocation
situations, all requiring a slightly different handling in precision
backtracking handling logic:
  - static subprogram calls;
  - global subprogram calls;
  - callback-calling helpers/kfuncs.

For each of those we need to handle a few precision propagation cases:
  - what to do with precision of subprog returns (r0);
  - what to do with precision of input arguments;
  - for all of them callee-saved registers in caller function should be
    propagated ignoring subprog/callback part of jump history.

N.B. Async callback-calling helpers (currently only
bpf_timer_set_callback()) are transparent to all this because they set
a separate async callback environment and thus callback's history is not
shared with main program's history. So as far as all the changes in this
commit goes, such helper is just a regular helper.

Let's look at all these situation in more details. Let's start with
static subprogram being called, using an exxerpt of a simple main
program and its static subprog, indenting subprog's frame slightly to
make everything clear.

frame 0				frame 1			precision set
=======				=======			=============

 9: r6 = 456;
10: r1 = 123;						fr0: r6
11: call pc+10;						fr0: r1, r6
				22: r0 = r1;		fr0: r6;     fr1: r1
				23: exit		fr0: r6;     fr1: r0
12: r1 = <map_pointer>					fr0: r0, r6
13: r1 += r0;						fr0: r0, r6
14: r1 += r6;						fr0: r6
15: exit

As can be seen above main function is passing 123 as single argument to
an identity (`return x;`) subprog. Returned value is used to adjust map
pointer offset, which forces r0 to be marked as precise. Then
instruction torvalds#14 does the same for callee-saved r6, which will have to be
backtracked all the way to instruction #9. For brevity, precision sets
for instruction torvalds#13 and torvalds#14 are combined in the diagram above.

First, for subprog calls, r0 returned from subprog (in frame 0) has to
go into subprog's frame 1, and should be cleared from frame 0. So we go
back into subprog's frame knowing we need to mark r0 precise. We then
see that insn torvalds#22 sets r0 from r1, so now we care about marking r1
precise.  When we pop up from subprog's frame back into caller at
insn #11 we keep r1, as it's an argument-passing register, so we eventually
find `10: r1 = 123;` and satify precision propagation chain for insn torvalds#13.

This example demonstrates two sets of rules:
  - r0 returned after subprog call has to be moved into subprog's r0 set;
  - *static* subprog arguments (r1-r5) are moved back to caller precision set.

Let's look at what happens with callee-saved precision propagation. Insn torvalds#14
mark r6 as precise. When we get into subprog's frame, we keep r6 in
frame 0's precision set *only*. Subprog itself has its own set of
independent r6-r10 registers and is not affected. When we eventually
made our way out of subprog frame we keep r6 in precision set until we
reach `9: r6 = 456;`, satisfying propagation. r6-r10 propagation is
perhaps the simplest aspect, it always stays in its original frame.

That's pretty much all we have to do to support precision propagation
across *static subprog* invocation.

Let's look at what happens when we have global subprog invocation.

frame 0				frame 1			precision set
=======				=======			=============

 9: r6 = 456;
10: r1 = 123;						fr0: r6
11: call pc+10; # global subprog			fr0: r6
12: r1 = <map_pointer>					fr0: r0, r6
13: r1 += r0;						fr0: r0, r6
14: r1 += r6;						fr0: r6;
15: exit

Starting from insn torvalds#13, r0 has to be precise. We backtrack all the way
to insn #11 (call pc+10) and see that subprog is global, so was already
validated in isolation. As opposed to static subprog, global subprog
always returns unknown scalar r0, so that satisfies precision
propagation and we drop r0 from precision set. We are done for insns torvalds#13.

Now for insn torvalds#14. r6 is in precision set, we backtrack to `call pc+10;`.
Here we need to recognize that this is effectively both exit and entry
to global subprog, which means we stay in caller's frame. So we carry on
with r6 still in precision set, until we satisfy it at insn #9. The only
hard part with global subprogs is just knowing when it's a global func.

Lastly, callback-calling helpers and kfuncs do simulate subprog calls,
so jump history will have subprog instructions in between caller
program's instructions, but the rules of propagating r0 and r1-r5
differ, because we don't actually directly call callback. We actually
call helper/kfunc, which at runtime will call subprog, so the only
difference between normal helper/kfunc handling is that we need to make
sure to skip callback simulatinog part of jump history.
Let's look at an example to make this clearer.

frame 0				frame 1			precision set
=======				=======			=============

 8: r6 = 456;
 9: r1 = 123;						fr0: r6
10: r2 = &callback;					fr0: r6
11: call bpf_loop;					fr0: r6
				22: r0 = r1;		fr0: r6      fr1:
				23: exit		fr0: r6      fr1:
12: r1 = <map_pointer>					fr0: r0, r6
13: r1 += r0;						fr0: r0, r6
14: r1 += r6;						fr0: r6;
15: exit

Again, insn torvalds#13 forces r0 to be precise. As soon as we get to `23: exit`
we see that this isn't actually a static subprog call (it's `call
bpf_loop;` helper call instead). So we clear r0 from precision set.

For callee-saved register, there is no difference: it stays in frame 0's
precision set, we go through insn torvalds#22 and torvalds#23, ignoring them until we
get back to caller frame 0, eventually satisfying precision backtrack
logic at insn #8 (`r6 = 456;`).

Assuming callback needed to set r0 as precise at insn torvalds#23, we'd
backtrack to insn torvalds#22, switching from r0 to r1, and then at the point
when we pop back to frame 0 at insn #11, we'll clear r1-r5 from
precision set, as we don't really do a subprog call directly, so there
is no input argument precision propagation.

That's pretty much it. With these changes, it seems like the only still
unsupported situation for precision backpropagation is the case when
program is accessing stack through registers other than r10. This is
still left as unsupported (though rare) case for now.

As for results. For selftests, few positive changes for bigger programs,
cls_redirect in dynptr variant benefitting the most:

[vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results.csv ~/subprog-precise-after-results.csv -f @veristat.cfg -e file,prog,insns -f 'insns_diff!=0'
File                                      Program        Insns (A)  Insns (B)  Insns     (DIFF)
----------------------------------------  -------------  ---------  ---------  ----------------
pyperf600_bpf_loop.bpf.linked1.o          on_event            2060       2002      -58 (-2.82%)
test_cls_redirect_dynptr.bpf.linked1.o    cls_redirect       15660       2914  -12746 (-81.39%)
test_cls_redirect_subprogs.bpf.linked1.o  cls_redirect       61620      59088    -2532 (-4.11%)
xdp_synproxy_kern.bpf.linked1.o           syncookie_tc      109980      86278  -23702 (-21.55%)
xdp_synproxy_kern.bpf.linked1.o           syncookie_xdp      97716      85147  -12569 (-12.86%)

Cilium progress don't really regress. They don't use subprogs and are
mostly unaffected, but some other fixes and improvements could have
changed something. This doesn't appear to be the case:

[vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-cilium.csv ~/subprog-precise-after-results-cilium.csv -e file,prog,insns -f 'insns_diff!=0'
File           Program                         Insns (A)  Insns (B)  Insns (DIFF)
-------------  ------------------------------  ---------  ---------  ------------
bpf_host.o     tail_nodeport_nat_ingress_ipv6       4983       5003  +20 (+0.40%)
bpf_lxc.o      tail_nodeport_nat_ingress_ipv6       4983       5003  +20 (+0.40%)
bpf_overlay.o  tail_nodeport_nat_ingress_ipv6       4983       5003  +20 (+0.40%)
bpf_xdp.o      tail_handle_nat_fwd_ipv6            12475      12504  +29 (+0.23%)
bpf_xdp.o      tail_nodeport_nat_ingress_ipv6       6363       6371   +8 (+0.13%)

Looking at (somewhat anonymized) Meta production programs, we see mostly
insignificant variation in number of instructions, with one program
(syar_bind6_protect6) benefitting the most at -17%.

[vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-fbcode.csv ~/subprog-precise-after-results-fbcode.csv -e prog,insns -f 'insns_diff!=0'
Program                   Insns (A)  Insns (B)  Insns     (DIFF)
------------------------  ---------  ---------  ----------------
on_request_context_event        597        585      -12 (-2.01%)
read_async_py_stack           43789      43657     -132 (-0.30%)
read_sync_py_stack            35041      37599    +2558 (+7.30%)
rrm_usdt                        946        940       -6 (-0.63%)
sysarmor_inet6_bind           28863      28249     -614 (-2.13%)
sysarmor_inet_bind            28845      28240     -605 (-2.10%)
syar_bind4_protect4          154145     147640    -6505 (-4.22%)
syar_bind6_protect6          165242     137088  -28154 (-17.04%)
syar_task_exit_setgid         21289      19720    -1569 (-7.37%)
syar_task_exit_setuid         21290      19721    -1569 (-7.37%)
do_uprobe                     19967      19413     -554 (-2.77%)
tw_twfw_ingress              215877     204833   -11044 (-5.12%)
tw_twfw_tc_in                215877     204833   -11044 (-5.12%)

But checking duration (wall clock) differences, that is the actual time taken
by verifier to validate programs, we see a sometimes dramatic improvements, all
the way to about 16x improvements:

[vmuser@archvm bpf]$ ./veristat -C ~/subprog-precise-before-results-meta.csv ~/subprog-precise-after-results-meta.csv -e prog,duration -s duration_diff^ | head -n20
Program                                   Duration (us) (A)  Duration (us) (B)  Duration (us) (DIFF)
----------------------------------------  -----------------  -----------------  --------------------
tw_twfw_ingress                                     4488374             272836    -4215538 (-93.92%)
tw_twfw_tc_in                                       4339111             268175    -4070936 (-93.82%)
tw_twfw_egress                                      3521816             270751    -3251065 (-92.31%)
tw_twfw_tc_eg                                       3472878             284294    -3188584 (-91.81%)
balancer_ingress                                     343119             291391      -51728 (-15.08%)
syar_bind6_protect6                                   78992              64782      -14210 (-17.99%)
ttls_tc_ingress                                       11739               8176       -3563 (-30.35%)
kprobe__security_inode_link                           13864              11341       -2523 (-18.20%)
read_sync_py_stack                                    21927              19442       -2485 (-11.33%)
read_async_py_stack                                   30444              28136        -2308 (-7.58%)
syar_task_exit_setuid                                 10256               8440       -1816 (-17.71%)

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230505043317.3629845-9-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
lmb pushed a commit that referenced this pull request May 23, 2023
Andrii Nakryiko says:

====================
As more and more real-world BPF programs become more complex
and increasingly use subprograms (both static and global), scalar precision
tracking and its (previously weak) support for BPF subprograms (and callbacks
as a special case of that) is becoming more and more of an issue and
limitation. Couple that with increasing reliance on state equivalence (BPF
open-coded iterators have a hard requirement for state equivalence to converge
and successfully validate loops), and it becomes pretty critical to address
this limitation and make precision tracking universally supported for BPF
programs of any complexity and composition.

This patch set teaches BPF verifier to support SCALAR precision
backpropagation across multiple frames (for subprogram calls and callback
simulations) and addresses most practical situations (SCALAR stack
loads/stores using registers other than r10 being the last remaining
limitation, though thankfully rarely used in practice).

Main logic is explained in details in patch #8. The rest are preliminary
preparations, refactorings, clean ups, and fixes. See respective patches for
details.

Patch #8 has also veristat comparison of results for selftests, Cilium, and
some of Meta production BPF programs before and after these changes.

v2->v3:
  - drop bitcnt and ifs from bt_xxx() helpers (Alexei);
v1->v2:
  - addressed review feedback form Alexei, adjusted commit messages, comments,
    added verbose(), WARN_ONCE(), etc;
  - re-ran all the tests and veristat on selftests, cilium, and meta-internal
    code: no new changes and no kernel warnings.
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
lmb pushed a commit that referenced this pull request Jun 20, 2023
The cited commit adds a compeletion to remove dependency on rtnl
lock. But it causes a deadlock for multiple encapsulations:

 crash> bt ffff8aece8a64000
 PID: 1514557  TASK: ffff8aece8a64000  CPU: 3    COMMAND: "tc"
  #0 [ffffa6d14183f368] __schedule at ffffffffb8ba7f45
  #1 [ffffa6d14183f3f8] schedule at ffffffffb8ba8418
  #2 [ffffa6d14183f418] schedule_preempt_disabled at ffffffffb8ba8898
  #3 [ffffa6d14183f428] __mutex_lock at ffffffffb8baa7f8
  #4 [ffffa6d14183f4d0] mutex_lock_nested at ffffffffb8baabeb
  #5 [ffffa6d14183f4e0] mlx5e_attach_encap at ffffffffc0f48c17 [mlx5_core]
  #6 [ffffa6d14183f628] mlx5e_tc_add_fdb_flow at ffffffffc0f39680 [mlx5_core]
  #7 [ffffa6d14183f688] __mlx5e_add_fdb_flow at ffffffffc0f3b636 [mlx5_core]
  #8 [ffffa6d14183f6f0] mlx5e_tc_add_flow at ffffffffc0f3bcdf [mlx5_core]
  #9 [ffffa6d14183f728] mlx5e_configure_flower at ffffffffc0f3c1d1 [mlx5_core]
 #10 [ffffa6d14183f790] mlx5e_rep_setup_tc_cls_flower at ffffffffc0f3d529 [mlx5_core]
 #11 [ffffa6d14183f7a0] mlx5e_rep_setup_tc_cb at ffffffffc0f3d714 [mlx5_core]
 #12 [ffffa6d14183f7b0] tc_setup_cb_add at ffffffffb8931bb8
 torvalds#13 [ffffa6d14183f810] fl_hw_replace_filter at ffffffffc0dae901 [cls_flower]
 torvalds#14 [ffffa6d14183f8d8] fl_change at ffffffffc0db5c57 [cls_flower]
 torvalds#15 [ffffa6d14183f970] tc_new_tfilter at ffffffffb8936047
 torvalds#16 [ffffa6d14183fac8] rtnetlink_rcv_msg at ffffffffb88c7c31
 torvalds#17 [ffffa6d14183fb50] netlink_rcv_skb at ffffffffb8942853
 torvalds#18 [ffffa6d14183fbc0] rtnetlink_rcv at ffffffffb88c1835
 torvalds#19 [ffffa6d14183fbd0] netlink_unicast at ffffffffb8941f27
 torvalds#20 [ffffa6d14183fc18] netlink_sendmsg at ffffffffb8942245
 torvalds#21 [ffffa6d14183fc98] sock_sendmsg at ffffffffb887d482
 torvalds#22 [ffffa6d14183fcb8] ____sys_sendmsg at ffffffffb887d81a
 torvalds#23 [ffffa6d14183fd38] ___sys_sendmsg at ffffffffb88806e2
 torvalds#24 [ffffa6d14183fe90] __sys_sendmsg at ffffffffb88807a2
 torvalds#25 [ffffa6d14183ff28] __x64_sys_sendmsg at ffffffffb888080f
 torvalds#26 [ffffa6d14183ff38] do_syscall_64 at ffffffffb8b9b6a8
 torvalds#27 [ffffa6d14183ff50] entry_SYSCALL_64_after_hwframe at ffffffffb8c0007c
 crash> bt 0xffff8aeb07544000
 PID: 1110766  TASK: ffff8aeb07544000  CPU: 0    COMMAND: "kworker/u20:9"
  #0 [ffffa6d14e6b7bd8] __schedule at ffffffffb8ba7f45
  #1 [ffffa6d14e6b7c68] schedule at ffffffffb8ba8418
  #2 [ffffa6d14e6b7c88] schedule_timeout at ffffffffb8baef88
  #3 [ffffa6d14e6b7d10] wait_for_completion at ffffffffb8ba968b
  #4 [ffffa6d14e6b7d60] mlx5e_take_all_encap_flows at ffffffffc0f47ec4 [mlx5_core]
  #5 [ffffa6d14e6b7da0] mlx5e_rep_update_flows at ffffffffc0f3e734 [mlx5_core]
  #6 [ffffa6d14e6b7df8] mlx5e_rep_neigh_update at ffffffffc0f400bb [mlx5_core]
  #7 [ffffa6d14e6b7e50] process_one_work at ffffffffb80acc9c
  #8 [ffffa6d14e6b7ed0] worker_thread at ffffffffb80ad012
  #9 [ffffa6d14e6b7f10] kthread at ffffffffb80b615d
 #10 [ffffa6d14e6b7f50] ret_from_fork at ffffffffb8001b2f

After the first encap is attached, flow will be added to encap
entry's flows list. If neigh update is running at this time, the
following encaps of the flow can't hold the encap_tbl_lock and
sleep. If neigh update thread is waiting for that flow's init_done,
deadlock happens.

Fix it by holding lock outside of the for loop. If neigh update is
running, prevent encap flows from offloading. Since the lock is held
outside of the for loop, concurrent creation of encap entries is not
allowed. So remove unnecessary wait_for_completion call for res_ready.

Fixes: 95435ad ("net/mlx5e: Only access fully initialized flows in neigh update")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
lmb pushed a commit that referenced this pull request Jun 20, 2023
Currently, the per cpu upcall counters are allocated after the vport is
created and inserted into the system. This could lead to the datapath
accessing the counters before they are allocated resulting in a kernel
Oops.

Here is an example:

  PID: 59693    TASK: ffff0005f4f51500  CPU: 0    COMMAND: "ovs-vswitchd"
   #0 [ffff80000a39b5b0] __switch_to at ffffb70f0629f2f4
   #1 [ffff80000a39b5d0] __schedule at ffffb70f0629f5cc
   #2 [ffff80000a39b650] preempt_schedule_common at ffffb70f0629fa60
   #3 [ffff80000a39b670] dynamic_might_resched at ffffb70f0629fb58
   #4 [ffff80000a39b680] mutex_lock_killable at ffffb70f062a1388
   #5 [ffff80000a39b6a0] pcpu_alloc at ffffb70f0594460c
   #6 [ffff80000a39b750] __alloc_percpu_gfp at ffffb70f05944e68
   #7 [ffff80000a39b760] ovs_vport_cmd_new at ffffb70ee6961b90 [openvswitch]
   ...

  PID: 58682    TASK: ffff0005b2f0bf00  CPU: 0    COMMAND: "kworker/0:3"
   #0 [ffff80000a5d2f40] machine_kexec at ffffb70f056a0758
   #1 [ffff80000a5d2f70] __crash_kexec at ffffb70f057e2994
   #2 [ffff80000a5d3100] crash_kexec at ffffb70f057e2ad8
   #3 [ffff80000a5d3120] die at ffffb70f0628234c
   #4 [ffff80000a5d31e0] die_kernel_fault at ffffb70f062828a8
   #5 [ffff80000a5d3210] __do_kernel_fault at ffffb70f056a31f4
   #6 [ffff80000a5d3240] do_bad_area at ffffb70f056a32a4
   #7 [ffff80000a5d3260] do_translation_fault at ffffb70f062a9710
   #8 [ffff80000a5d3270] do_mem_abort at ffffb70f056a2f74
   #9 [ffff80000a5d32a0] el1_abort at ffffb70f06297dac
  #10 [ffff80000a5d32d0] el1h_64_sync_handler at ffffb70f06299b24
  #11 [ffff80000a5d3410] el1h_64_sync at ffffb70f056812dc
  #12 [ffff80000a5d3430] ovs_dp_upcall at ffffb70ee6963c84 [openvswitch]
  torvalds#13 [ffff80000a5d3470] ovs_dp_process_packet at ffffb70ee6963fdc [openvswitch]
  torvalds#14 [ffff80000a5d34f0] ovs_vport_receive at ffffb70ee6972c78 [openvswitch]
  torvalds#15 [ffff80000a5d36f0] netdev_port_receive at ffffb70ee6973948 [openvswitch]
  torvalds#16 [ffff80000a5d3720] netdev_frame_hook at ffffb70ee6973a28 [openvswitch]
  torvalds#17 [ffff80000a5d3730] __netif_receive_skb_core.constprop.0 at ffffb70f06079f90

We moved the per cpu upcall counter allocation to the existing vport
alloc and free functions to solve this.

Fixes: 95637d9 ("net: openvswitch: release vport resources on failure")
Fixes: 1933ea3 ("net: openvswitch: Add support to count upcall packets")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Acked-by: Aaron Conole <aconole@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
lmb pushed a commit that referenced this pull request Jun 20, 2023
When using option -a without --prod-affinity or --cons-affinity, if the
number of producers and consumers is greater than the number of online
CPUs, the benchmark will fail to run as shown below:

  $ getconf _NPROCESSORS_ONLN
  8
  $ ./bench bpf-loop -a -p9
  Setting up benchmark 'bpf-loop'...
  setting affinity to CPU #8 failed: -22

Fix it by returning the remainder of next_cpu divided by the number of
online CPUs in next_cpu().

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20230613080921.1623219-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
borkmann pushed a commit that referenced this pull request Jul 19, 2023
Petr Machata says:

====================
mlxsw: Manage RIF across PVID changes

The mlxsw driver currently makes the assumption that the user applies
configuration in a bottom-up manner. Thus netdevices need to be added to
the bridge before IP addresses are configured on that bridge or SVI added
on top of it. Enslaving a netdevice to another netdevice that already has
uppers is in fact forbidden by mlxsw for this reason. Despite this safety,
it is rather easy to get into situations where the offloaded configuration
is just plain wrong.

As an example, take a front panel port, configure an IP address: it gets a
RIF. Now enslave the port to the bridge, and the RIF is gone. Remove the
port from the bridge again, but the RIF never comes back. There is a number
of similar situations, where changing the configuration there and back
utterly breaks the offload.

The situation is going to be made better by implementing a range of replays
and post-hoc offloads.

In this patch set, address the ordering issues related to creation of
bridge RIFs. Currently, mlxsw has several shortcomings with regards to RIF
handling due to PVID changes:

- In order to cause RIF for a bridge device to be created, the user is
  expected first to set PVID, then to add an IP address. The reverse
  ordering is disallowed, which is not very user-friendly.

- When such bridge gets a VLAN upper whose VID was the same as the existing
  PVID, and this VLAN netdevice gets an IP address, a RIF is created for
  this netdevice. The new RIF is then assigned to the 802.1Q FID for the
  given VID. This results in a working configuration. However, then, when
  the VLAN netdevice is removed again, the RIF for the bridge itself is
  never reassociated to the PVID.

- PVID cannot be changed once the bridge has uppers. Presumably this is
  because the driver does not manage RIFs properly in face of PVID changes.
  However, as the previous point shows, it is still possible to get into
  invalid configurations.

This patch set addresses these issues and relaxes some of the ordering
requirements that mlxsw had. The patch set proceeds as follows:

- In patch #1, pass extack to mlxsw_sp_br_ban_rif_pvid_change()

- To relax ordering between setting PVID and adding an IP address to a
  bridge, mlxsw must be able to request that a RIF is created with a given
  VLAN ID, instead of trying to deduce it from the current netdevice
  settings, which do not reflect the user-requested values yet. This is
  done in patches #2 and #3.

- Similarly, mlxsw_sp_inetaddr_bridge_event() will need to make decisions
  based on the user-requested value of PVID, not the current value. Thus in
  patches #4 and #5, add a new argument which carries the requested PVID
  value.

- Finally in patch #6 relax the ban on PVID changes when a bridge has
  uppers. Instead, add the logic necessary for creation of a RIF as a
  result of PVID change.

- Relevant selftests are presented afterwards. In patch #7 a preparatory
  helper is added to lib.sh. Patches #8, #9, #10 and #11 include selftests
  themselves.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann pushed a commit that referenced this pull request Jul 28, 2023
Petr Machata says:

====================
mlxsw: Permit enslavement to netdevices with uppers

The mlxsw driver currently makes the assumption that the user applies
configuration in a bottom-up manner. Thus netdevices need to be added to
the bridge before IP addresses are configured on that bridge or SVI added
on top of it. Enslaving a netdevice to another netdevice that already has
uppers is in fact forbidden by mlxsw for this reason. Despite this safety,
it is rather easy to get into situations where the offloaded configuration
is just plain wrong.

As an example, take a front panel port, configure an IP address: it gets a
RIF. Now enslave the port to the bridge, and the RIF is gone. Remove the
port from the bridge again, but the RIF never comes back. There is a number
of similar situations, where changing the configuration there and back
utterly breaks the offload.

Similarly, detaching a front panel port from a configured topology means
unoffloading of this whole topology -- VLAN uppers, next hops, etc.
Attaching the port back is then not permitted at all. If it were, it would
not result in a working configuration, because much of mlxsw is written to
react to changes in immediate configuration. There is nothing that would go
visit netdevices in the attached-to topology and offload existing routes
and VLAN memberships, for example.

In this patchset, introduce a number of replays to be invoked so that this
sort of post-hoc offload is supported. Then remove the vetoes that
disallowed enslavement of front panel ports to other netdevices with
uppers.

The patchset progresses as follows:

- In patch #1, fix an issue in the bridge driver. To my knowledge, the
  issue could not have resulted in a buggy behavior previously, and thus is
  packaged with this patchset instead of being sent separately to net.

- In patch #2, add a new helper to the switchdev code.

- In patch #3, drop mlxsw selftests that will not be relevant after this
  patchset anymore.

- Patches #4, #5, #6, #7 and #8 prepare the codebase for smoother
  introduction of the rest of the code.

- Patches #9, #10, #11, #12, torvalds#13 and torvalds#14 replay various aspects of upper
  configuration when a front panel port is introduced into a topology.
  Individual patches take care of bridge and LAG RIF memberships, switchdev
  replay, nexthop and neighbors replay, and MACVLAN offload.

- Patches torvalds#15 and torvalds#16 introduce RIFs for newly-relevant netdevices when a
  front panel port is enslaved (in which case all uppers are newly
  relevant), or, respectively, deslaved (in which case the newly-relevant
  netdevice is the one being deslaved).

- Up until this point, the introduced scaffolding was not really used,
  because mlxsw still forbids enslavement of mlxsw netdevices to uppers
  with uppers. In patch torvalds#17, this condition is finally relaxed.

A sizable selftest suite is available to test all this new code. That will
be sent in a separate patchset.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann pushed a commit that referenced this pull request Aug 4, 2023
The cited commit holds encap tbl lock unconditionally when setting
up dests. But it may cause the following deadlock:

 PID: 1063722  TASK: ffffa062ca5d0000  CPU: 13   COMMAND: "handler8"
  #0 [ffffb14de05b7368] __schedule at ffffffffa1d5aa91
  #1 [ffffb14de05b7410] schedule at ffffffffa1d5afdb
  #2 [ffffb14de05b7430] schedule_preempt_disabled at ffffffffa1d5b528
  #3 [ffffb14de05b7440] __mutex_lock at ffffffffa1d5d6cb
  #4 [ffffb14de05b74e8] mutex_lock_nested at ffffffffa1d5ddeb
  #5 [ffffb14de05b74f8] mlx5e_tc_tun_encap_dests_set at ffffffffc12f2096 [mlx5_core]
  #6 [ffffb14de05b7568] post_process_attr at ffffffffc12d9fc5 [mlx5_core]
  #7 [ffffb14de05b75a0] mlx5e_tc_add_fdb_flow at ffffffffc12de877 [mlx5_core]
  #8 [ffffb14de05b75f0] __mlx5e_add_fdb_flow at ffffffffc12e0eef [mlx5_core]
  #9 [ffffb14de05b7660] mlx5e_tc_add_flow at ffffffffc12e12f7 [mlx5_core]
 #10 [ffffb14de05b76b8] mlx5e_configure_flower at ffffffffc12e1686 [mlx5_core]
 #11 [ffffb14de05b7720] mlx5e_rep_indr_offload at ffffffffc12e3817 [mlx5_core]
 #12 [ffffb14de05b7730] mlx5e_rep_indr_setup_tc_cb at ffffffffc12e388a [mlx5_core]
 torvalds#13 [ffffb14de05b7740] tc_setup_cb_add at ffffffffa1ab2ba8
 torvalds#14 [ffffb14de05b77a0] fl_hw_replace_filter at ffffffffc0bdec2f [cls_flower]
 torvalds#15 [ffffb14de05b7868] fl_change at ffffffffc0be6caa [cls_flower]
 torvalds#16 [ffffb14de05b7908] tc_new_tfilter at ffffffffa1ab71f0

[1031218.028143]  wait_for_completion+0x24/0x30
[1031218.028589]  mlx5e_update_route_decap_flows+0x9a/0x1e0 [mlx5_core]
[1031218.029256]  mlx5e_tc_fib_event_work+0x1ad/0x300 [mlx5_core]
[1031218.029885]  process_one_work+0x24e/0x510

Actually no need to hold encap tbl lock if there is no encap action.
Fix it by checking if encap action exists or not before holding
encap tbl lock.

Fixes: 37c3b9f ("net/mlx5e: Prevent encap offload when neigh update is running")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Vlad Buslov <vladbu@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
borkmann pushed a commit that referenced this pull request Aug 4, 2023
Petr Machata says:

====================
selftests: New selftests for out-of-order-operations patches in mlxsw

In the past, the mlxsw driver made the assumption that the user applies
configuration in a bottom-up manner. Thus netdevices needed to be added to
the bridge before IP addresses were configured on that bridge or SVI added
on top of it, because whatever happened before a netdevice was mlxsw upper
was generally ignored by mlxsw. Recently, several patch series were pushed
to introduce the bookkeeping and replays necessary to offload the full
state, not just the immediate configuration step.

In this patchset, introduce new selftests that directly exercise the out of
order code paths in mlxsw.

- Patch #1 adds new tests into the existing selftest router_bridge.sh.
- Patches #2-#5 add new generic selftests.
- Patches #6-#8 add new mlxsw-specific selftests.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann pushed a commit that referenced this pull request Aug 24, 2023
Normally, x->replay_esn and x->preplay_esn should be allocated at
xfrm_alloc_replay_state_esn(...) in xfrm_state_construct(...), hence the
xfrm_update_ae_params(...) is okay to update them. However, the current
implementation of xfrm_new_ae(...) allows a malicious user to directly
dereference a NULL pointer and crash the kernel like below.

BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 8253067 P4D 8253067 PUD 8e0e067 PMD 0
Oops: 0002 [#1] PREEMPT SMP KASAN NOPTI
CPU: 0 PID: 98 Comm: poc.npd Not tainted 6.4.0-rc7-00072-gdad9774deaf1 #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.o4
RIP: 0010:memcpy_orig+0xad/0x140
Code: e8 4c 89 5f e0 48 8d 7f e0 73 d2 83 c2 20 48 29 d6 48 29 d7 83 fa 10 72 34 4c 8b 06 4c 8b 4e 08 c
RSP: 0018:ffff888008f57658 EFLAGS: 00000202
RAX: 0000000000000000 RBX: ffff888008bd0000 RCX: ffffffff8238e571
RDX: 0000000000000018 RSI: ffff888007f64844 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff888008f57818
R13: ffff888007f64aa4 R14: 0000000000000000 R15: 0000000000000000
FS:  00000000014013c0(0000) GS:ffff88806d600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000000054d8000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 ? __die+0x1f/0x70
 ? page_fault_oops+0x1e8/0x500
 ? __pfx_is_prefetch.constprop.0+0x10/0x10
 ? __pfx_page_fault_oops+0x10/0x10
 ? _raw_spin_unlock_irqrestore+0x11/0x40
 ? fixup_exception+0x36/0x460
 ? _raw_spin_unlock_irqrestore+0x11/0x40
 ? exc_page_fault+0x5e/0xc0
 ? asm_exc_page_fault+0x26/0x30
 ? xfrm_update_ae_params+0xd1/0x260
 ? memcpy_orig+0xad/0x140
 ? __pfx__raw_spin_lock_bh+0x10/0x10
 xfrm_update_ae_params+0xe7/0x260
 xfrm_new_ae+0x298/0x4e0
 ? __pfx_xfrm_new_ae+0x10/0x10
 ? __pfx_xfrm_new_ae+0x10/0x10
 xfrm_user_rcv_msg+0x25a/0x410
 ? __pfx_xfrm_user_rcv_msg+0x10/0x10
 ? __alloc_skb+0xcf/0x210
 ? stack_trace_save+0x90/0xd0
 ? filter_irq_stacks+0x1c/0x70
 ? __stack_depot_save+0x39/0x4e0
 ? __kasan_slab_free+0x10a/0x190
 ? kmem_cache_free+0x9c/0x340
 ? netlink_recvmsg+0x23c/0x660
 ? sock_recvmsg+0xeb/0xf0
 ? __sys_recvfrom+0x13c/0x1f0
 ? __x64_sys_recvfrom+0x71/0x90
 ? do_syscall_64+0x3f/0x90
 ? entry_SYSCALL_64_after_hwframe+0x72/0xdc
 ? copyout+0x3e/0x50
 netlink_rcv_skb+0xd6/0x210
 ? __pfx_xfrm_user_rcv_msg+0x10/0x10
 ? __pfx_netlink_rcv_skb+0x10/0x10
 ? __pfx_sock_has_perm+0x10/0x10
 ? mutex_lock+0x8d/0xe0
 ? __pfx_mutex_lock+0x10/0x10
 xfrm_netlink_rcv+0x44/0x50
 netlink_unicast+0x36f/0x4c0
 ? __pfx_netlink_unicast+0x10/0x10
 ? netlink_recvmsg+0x500/0x660
 netlink_sendmsg+0x3b7/0x700

This Null-ptr-deref bug is assigned CVE-2023-3772. And this commit
adds additional NULL check in xfrm_update_ae_params to fix the NPD.

Fixes: d8647b7 ("xfrm: Add user interface for esn and big anti-replay windows")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
borkmann pushed a commit that referenced this pull request Sep 24, 2023
Noticed with:

  make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin

Direct leak of 45 byte(s) in 1 object(s) allocated from:
    #0 0x7f213f87243b in strdup (/lib64/libasan.so.8+0x7243b)
    #1 0x63d15f in evsel__set_filter util/evsel.c:1371
    #2 0x63d15f in evsel__append_filter util/evsel.c:1387
    #3 0x63d15f in evsel__append_tp_filter util/evsel.c:1400
    #4 0x62cd52 in evlist__append_tp_filter util/evlist.c:1145
    #5 0x62cd52 in evlist__append_tp_filter_pids util/evlist.c:1196
    #6 0x541e49 in trace__set_filter_loop_pids /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3646
    #7 0x541e49 in trace__set_filter_pids /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3670
    #8 0x541e49 in trace__run /home/acme/git/perf-tools/tools/perf/builtin-trace.c:3970
    #9 0x541e49 in cmd_trace /home/acme/git/perf-tools/tools/perf/builtin-trace.c:5141
    #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools/tools/perf/perf.c:323
    #11 0x4196da in handle_internal_command /home/acme/git/perf-tools/tools/perf/perf.c:377
    #12 0x4196da in run_argv /home/acme/git/perf-tools/tools/perf/perf.c:421
    torvalds#13 0x4196da in main /home/acme/git/perf-tools/tools/perf/perf.c:537
    torvalds#14 0x7f213e84a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

Free it on evsel__exit().

Acked-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/20230719202951.534582-2-acme@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann pushed a commit that referenced this pull request Sep 24, 2023
To plug these leaks detected with:

  $ make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin

  =================================================================
  ==473890==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 112 byte(s) in 1 object(s) allocated from:
    #0 0x7fdf19aba097 in calloc (/lib64/libasan.so.8+0xba097)
    #1 0x987836 in zalloc (/home/acme/bin/perf+0x987836)
    #2 0x5367ae in thread_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1289
    #3 0x5367ae in thread__trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1307
    #4 0x5367ae in trace__sys_exit /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2468
    #5 0x52bf34 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3177
    #6 0x52bf34 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3685
    #7 0x542927 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3712
    #8 0x542927 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4055
    #9 0x542927 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5141
    #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323
    #11 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377
    #12 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421
    torvalds#13 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537
    torvalds#14 0x7fdf18a4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

  Direct leak of 2048 byte(s) in 1 object(s) allocated from:
    #0 0x7f788fcba6af in __interceptor_malloc (/lib64/libasan.so.8+0xba6af)
    #1 0x5337c0 in trace__sys_enter /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2342
    #2 0x52bfb4 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3191
    #3 0x52bfb4 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3699
    #4 0x542883 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3726
    #5 0x542883 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4069
    #6 0x542883 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5155
    #7 0x5ef232 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323
    #8 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377
    #9 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421
    #10 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537
    #11 0x7f788ec4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

  Indirect leak of 48 byte(s) in 1 object(s) allocated from:
    #0 0x7fdf19aba6af in __interceptor_malloc (/lib64/libasan.so.8+0xba6af)
    #1 0x77b335 in intlist__new util/intlist.c:116
    #2 0x5367fd in thread_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1293
    #3 0x5367fd in thread__trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:1307
    #4 0x5367fd in trace__sys_exit /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:2468
    #5 0x52bf34 in trace__handle_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3177
    #6 0x52bf34 in __trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3685
    #7 0x542927 in trace__deliver_event /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3712
    #8 0x542927 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:4055
    #9 0x542927 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5141
    #10 0x5ef1a2 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323
    #11 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377
    #12 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421
    torvalds#13 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537
    torvalds#14 0x7fdf18a4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

Acked-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/20230719202951.534582-4-acme@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann pushed a commit that referenced this pull request Sep 24, 2023
In 3cb4d5e ("perf trace: Free syscall tp fields in
evsel->priv") it only was freeing if strcmp(evsel->tp_format->system,
"syscalls") returned zero, while the corresponding initialization of
evsel->priv was being performed if it was _not_ zero, i.e. if the tp
system wasn't 'syscalls'.

Just stop looking for that and free it if evsel->priv was set, which
should be equivalent.

Also use the pre-existing evsel_trace__delete() function.

This resolves these leaks, detected with:

  $ make EXTRA_CFLAGS="-fsanitize=address" BUILD_BPF_SKEL=1 CORESIGHT=1 O=/tmp/build/perf-tools-next -C tools/perf install-bin

  =================================================================
  ==481565==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 40 byte(s) in 1 object(s) allocated from:
      #0 0x7f7343cba097 in calloc (/lib64/libasan.so.8+0xba097)
      #1 0x987966 in zalloc (/home/acme/bin/perf+0x987966)
      #2 0x52f9b9 in evsel_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:307
      #3 0x52f9b9 in evsel__syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:333
      #4 0x52f9b9 in evsel__init_raw_syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:458
      #5 0x52f9b9 in perf_evsel__raw_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:480
      #6 0x540e8b in trace__add_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3212
      #7 0x540e8b in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3891
      #8 0x540e8b in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5156
      #9 0x5ef262 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323
      #10 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377
      #11 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421
      #12 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537
      torvalds#13 0x7f7342c4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

  Direct leak of 40 byte(s) in 1 object(s) allocated from:
      #0 0x7f7343cba097 in calloc (/lib64/libasan.so.8+0xba097)
      #1 0x987966 in zalloc (/home/acme/bin/perf+0x987966)
      #2 0x52f9b9 in evsel_trace__new /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:307
      #3 0x52f9b9 in evsel__syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:333
      #4 0x52f9b9 in evsel__init_raw_syscall_tp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:458
      #5 0x52f9b9 in perf_evsel__raw_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:480
      #6 0x540dd1 in trace__add_syscall_newtp /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3205
      #7 0x540dd1 in trace__run /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:3891
      #8 0x540dd1 in cmd_trace /home/acme/git/perf-tools-next/tools/perf/builtin-trace.c:5156
      #9 0x5ef262 in run_builtin /home/acme/git/perf-tools-next/tools/perf/perf.c:323
      #10 0x4196da in handle_internal_command /home/acme/git/perf-tools-next/tools/perf/perf.c:377
      #11 0x4196da in run_argv /home/acme/git/perf-tools-next/tools/perf/perf.c:421
      #12 0x4196da in main /home/acme/git/perf-tools-next/tools/perf/perf.c:537
      torvalds#13 0x7f7342c4a50f in __libc_start_call_main (/lib64/libc.so.6+0x2750f)

  SUMMARY: AddressSanitizer: 80 byte(s) leaked in 2 allocation(s).
  [root@quaco ~]#

With this we plug all leaks with "perf trace sleep 1".

Fixes: 3cb4d5e ("perf trace: Free syscall tp fields in evsel->priv")
Acked-by: Ian Rogers <irogers@google.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Riccardo Mancini <rickyman7@gmail.com>
Link: https://lore.kernel.org/lkml/20230719202951.534582-5-acme@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann pushed a commit that referenced this pull request Sep 24, 2023
…failure to add a probe

Building perf with EXTRA_CFLAGS="-fsanitize=address" a leak is detect
when trying to add a probe to a non-existent function:

  # perf probe -x ~/bin/perf dso__neW
  Probe point 'dso__neW' not found.
    Error: Failed to add events.

  =================================================================
  ==296634==ERROR: LeakSanitizer: detected memory leaks

  Direct leak of 128 byte(s) in 1 object(s) allocated from:
      #0 0x7f67642ba097 in calloc (/lib64/libasan.so.8+0xba097)
      #1 0x7f67641a76f1 in allocate_cfi (/lib64/libdw.so.1+0x3f6f1)

  Direct leak of 65 byte(s) in 1 object(s) allocated from:
      #0 0x7f67642b95b5 in __interceptor_realloc.part.0 (/lib64/libasan.so.8+0xb95b5)
      #1 0x6cac75 in strbuf_grow util/strbuf.c:64
      #2 0x6ca934 in strbuf_init util/strbuf.c:25
      #3 0x9337d2 in synthesize_perf_probe_point util/probe-event.c:2018
      #4 0x92be51 in try_to_find_probe_trace_events util/probe-event.c:964
      #5 0x93d5c6 in convert_to_probe_trace_events util/probe-event.c:3512
      #6 0x93d6d5 in convert_perf_probe_events util/probe-event.c:3529
      #7 0x56f37f in perf_add_probe_events /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:354
      #8 0x572fbc in __cmd_probe /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:738
      #9 0x5730f2 in cmd_probe /var/home/acme/git/perf-tools-next/tools/perf/builtin-probe.c:766
      #10 0x635d81 in run_builtin /var/home/acme/git/perf-tools-next/tools/perf/perf.c:323
      #11 0x6362c1 in handle_internal_command /var/home/acme/git/perf-tools-next/tools/perf/perf.c:377
      #12 0x63667a in run_argv /var/home/acme/git/perf-tools-next/tools/perf/perf.c:421
      torvalds#13 0x636b8d in main /var/home/acme/git/perf-tools-next/tools/perf/perf.c:537
      torvalds#14 0x7f676302950f in __libc_start_call_main (/lib64/libc.so.6+0x2950f)

  SUMMARY: AddressSanitizer: 193 byte(s) leaked in 2 allocation(s).
  #

synthesize_perf_probe_point() returns a "detachec" strbuf, i.e. a
malloc'ed string that needs to be free'd.

An audit will be performed to find other such cases.

Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/ZM0l1Oxamr4SVjfY@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
borkmann pushed a commit that referenced this pull request Sep 24, 2023
Jiri Pirko says:

====================
expose devlink instances relationships

From: Jiri Pirko <jiri@nvidia.com>

Currently, the user can instantiate new SF using "devlink port add"
command. That creates an E-switch representor devlink port.

When user activates this SF, there is an auxiliary device created and
probed for it which leads to SF devlink instance creation.

There is 1:1 relationship between E-switch representor devlink port and
the SF auxiliary device devlink instance.

Also, for example in mlx5, one devlink instance is created for
PCI device and one is created for an auxiliary device that represents
the uplink port. The relation between these is invisible to the user.

Patches #1-#3 and #5 are small preparations.

Patch #4 adds netnsid attribute for nested devlink if that in a
different namespace.

Patch #5 is the main one in this set, introduces the relationship
tracking infrastructure later on used to track SFs, linecards and
devlink instance relationships with nested devlink instances.

Expose the relation to the user by introducing new netlink attribute
DEVLINK_PORT_FN_ATTR_DEVLINK which contains the devlink instance related
to devlink port function. This is done by patch #8.
Patch #9 implements this in mlx5 driver.

Patch #10 converts the linecard nested devlink handling to the newly
introduced rel infrastructure.

Patch #11 benefits from the rel infra and introduces possiblitily to
have relation between devlink instances.
Patch #12 implements this in mlx5 driver.

Examples:
$ devlink dev
pci/0000:08:00.0: nested_devlink auxiliary/mlx5_core.eth.0
pci/0000:08:00.1: nested_devlink auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.1
auxiliary/mlx5_core.eth.0

$ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 106
pci/0000:08:00.0/32768: type eth netdev eth4 flavour pcisf controller 0 pfnum 0 sfnum 106 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state inactive opstate detached roce enable
$ devlink port function set pci/0000:08:00.0/32768 state active
$ devlink port show pci/0000:08:00.0/32768
pci/0000:08:00.0/32768: type eth netdev eth4 flavour pcisf controller 0 pfnum 0 sfnum 106 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state active opstate attached roce enable nested_devlink auxiliary/mlx5_core.sf.2

$ devlink port show pci/0000:08:00.0/32768
pci/0000:08:00.0/32768: type eth netdev eth4 flavour pcisf controller 0 pfnum 0 sfnum 106 splittable false
  function:
    hw_addr 00:00:00:00:00:00 state active opstate attached roce enable nested_devlink auxiliary/mlx5_core.sf.2 nested_devlink_netns ns1
====================

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
borkmann pushed a commit that referenced this pull request Oct 6, 2023
Fix an error detected by memory sanitizer:
```
==4033==WARNING: MemorySanitizer: use-of-uninitialized-value
    #0 0x55fb0fbedfc7 in read_alias_info tools/perf/util/pmu.c:457:6
    #1 0x55fb0fbea339 in check_info_data tools/perf/util/pmu.c:1434:2
    #2 0x55fb0fbea339 in perf_pmu__check_alias tools/perf/util/pmu.c:1504:9
    #3 0x55fb0fbdca85 in parse_events_add_pmu tools/perf/util/parse-events.c:1429:32
    #4 0x55fb0f965230 in parse_events_parse tools/perf/util/parse-events.y:299:6
    #5 0x55fb0fbdf6b2 in parse_events__scanner tools/perf/util/parse-events.c:1822:8
    #6 0x55fb0fbdf8c1 in __parse_events tools/perf/util/parse-events.c:2094:8
    #7 0x55fb0fa8ffa9 in parse_events tools/perf/util/parse-events.h:41:9
    #8 0x55fb0fa8ffa9 in test_event tools/perf/tests/parse-events.c:2393:8
    #9 0x55fb0fa8f458 in test__pmu_events tools/perf/tests/parse-events.c:2551:15
    #10 0x55fb0fa6d93f in run_test tools/perf/tests/builtin-test.c:242:9
    #11 0x55fb0fa6d93f in test_and_print tools/perf/tests/builtin-test.c:271:8
    #12 0x55fb0fa6d082 in __cmd_test tools/perf/tests/builtin-test.c:442:5
    torvalds#13 0x55fb0fa6d082 in cmd_test tools/perf/tests/builtin-test.c:564:9
    torvalds#14 0x55fb0f942720 in run_builtin tools/perf/perf.c:322:11
    torvalds#15 0x55fb0f942486 in handle_internal_command tools/perf/perf.c:375:8
    torvalds#16 0x55fb0f941dab in run_argv tools/perf/perf.c:419:2
    torvalds#17 0x55fb0f941dab in main tools/perf/perf.c:535:3
```

Fixes: 7b723db ("perf pmu: Be lazy about loading event info files from sysfs")
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: James Clark <james.clark@arm.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Link: https://lore.kernel.org/r/20230914022425.1489035-1-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
borkmann pushed a commit that referenced this pull request Oct 6, 2023
The following call trace shows a deadlock issue due to recursive locking of
mutex "device_mutex". First lock acquire is in target_for_each_device() and
second in target_free_device().

 PID: 148266   TASK: ffff8be21ffb5d00  CPU: 10   COMMAND: "iscsi_ttx"
  #0 [ffffa2bfc9ec3b18] __schedule at ffffffffa8060e7f
  #1 [ffffa2bfc9ec3ba0] schedule at ffffffffa8061224
  #2 [ffffa2bfc9ec3bb8] schedule_preempt_disabled at ffffffffa80615ee
  #3 [ffffa2bfc9ec3bc8] __mutex_lock at ffffffffa8062fd7
  #4 [ffffa2bfc9ec3c40] __mutex_lock_slowpath at ffffffffa80631d3
  #5 [ffffa2bfc9ec3c50] mutex_lock at ffffffffa806320c
  #6 [ffffa2bfc9ec3c68] target_free_device at ffffffffc0935998 [target_core_mod]
  #7 [ffffa2bfc9ec3c90] target_core_dev_release at ffffffffc092f975 [target_core_mod]
  #8 [ffffa2bfc9ec3ca0] config_item_put at ffffffffa79d250f
  #9 [ffffa2bfc9ec3cd0] config_item_put at ffffffffa79d2583
 #10 [ffffa2bfc9ec3ce0] target_devices_idr_iter at ffffffffc0933f3a [target_core_mod]
 #11 [ffffa2bfc9ec3d00] idr_for_each at ffffffffa803f6fc
 #12 [ffffa2bfc9ec3d60] target_for_each_device at ffffffffc0935670 [target_core_mod]
 torvalds#13 [ffffa2bfc9ec3d98] transport_deregister_session at ffffffffc0946408 [target_core_mod]
 torvalds#14 [ffffa2bfc9ec3dc8] iscsit_close_session at ffffffffc09a44a6 [iscsi_target_mod]
 torvalds#15 [ffffa2bfc9ec3df0] iscsit_close_connection at ffffffffc09a4a88 [iscsi_target_mod]
 torvalds#16 [ffffa2bfc9ec3df8] finish_task_switch at ffffffffa76e5d07
 torvalds#17 [ffffa2bfc9ec3e78] iscsit_take_action_for_connection_exit at ffffffffc0991c23 [iscsi_target_mod]
 torvalds#18 [ffffa2bfc9ec3ea0] iscsi_target_tx_thread at ffffffffc09a403b [iscsi_target_mod]
 torvalds#19 [ffffa2bfc9ec3f08] kthread at ffffffffa76d8080
 torvalds#20 [ffffa2bfc9ec3f50] ret_from_fork at ffffffffa8200364

Fixes: 36d4cb4 ("scsi: target: Avoid that EXTENDED COPY commands trigger lock inversion")
Signed-off-by: Junxiao Bi <junxiao.bi@oracle.com>
Link: https://lore.kernel.org/r/20230918225848.66463-1-junxiao.bi@oracle.com
Reviewed-by: Mike Christie <michael.christie@oracle.com>
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
borkmann pushed a commit that referenced this pull request Oct 20, 2023
Chuyi Zhou says:

====================
Add Open-coded task, css_task and css iters

This is version 6 of task, css_task and css iters support.

--- Changelog ---

v5 -> v6:

Patch #3:
 * In bpf_iter_task_next, return pos rather than goto out. (Andrii)
Patch #2, #3, #4:
 * Add the missing __diag_ignore_all to avoid kernel build warning
Patch #5, #6, #7:
 * Add Andrii's ack

Patch #8:
 * In BPF prog iter_css_task_for_each, return -EPERM rather than 0, and
   ensure stack_mprotect() in iters.c not success. If not, it would cause
   the subsequent 'test_lsm' fail, since the 'is_stack' check in
   test_int_hook(lsm.c) would not be guaranteed.
   (https://github.com/kernel-patches/bpf/actions/runs/6489662214/job/17624665086?pr=5790)

v4 -> v5:https://lore.kernel.org/lkml/20231007124522.34834-1-zhouchuyi@bytedance.com/

Patch 3~4:
 * Relax the BUILD_BUG_ON check in bpf_iter_task_new and bpf_iter_css_new to avoid
   netdev/build_32bit CI error.
   (https://netdev.bots.linux.dev/static/nipa/790929/13412333/build_32bit/stderr)
Patch 8:
 * Initialize skel pointer to fix the LLVM-16 build CI error
   (https://github.com/kernel-patches/bpf/actions/runs/6462875618/job/17545170863)

v3 -> v4:https://lore.kernel.org/all/20230925105552.817513-1-zhouchuyi@bytedance.com/

* Address all the comments from Andrii in patch-3 ~ patch-6
* Collect Tejun's ack
* Add a extra patch to rename bpf_iter_task.c to bpf_iter_tasks.c
* Seperate three BPF program files for selftests (iters_task.c iters_css_task.c iters_css.c)

v2 -> v3:https://lore.kernel.org/lkml/20230912070149.969939-1-zhouchuyi@bytedance.com/

Patch 1 (cgroup: Prepare for using css_task_iter_*() in BPF)
  * Add tj's ack and Alexei's suggest-by.
Patch 2 (bpf: Introduce css_task open-coded iterator kfuncs)
  * Use bpf_mem_alloc/bpf_mem_free rather than kzalloc()
  * Add KF_TRUSTED_ARGS for bpf_iter_css_task_new (Alexei)
  * Move bpf_iter_css_task's definition from uapi/linux/bpf.h to
    kernel/bpf/task_iter.c and we can use it from vmlinux.h
  * Move bpf_iter_css_task_XXX's declaration from bpf_helpers.h to
    bpf_experimental.h
Patch 3 (Introduce task open coded iterator kfuncs)
  * Change th API design keep consistent with SEC("iter/task"), support
    iterating all threads(BPF_TASK_ITERATE_ALL) and threads of a
    specific task (BPF_TASK_ITERATE_THREAD).(Andrii)
  * Move bpf_iter_task's definition from uapi/linux/bpf.h to
    kernel/bpf/task_iter.c and we can use it from vmlinux.h
  * Move bpf_iter_task_XXX's declaration from bpf_helpers.h to
    bpf_experimental.h
Patch 4 (Introduce css open-coded iterator kfuncs)
  * Change th API design keep consistent with cgroup_iters, reuse
    BPF_CGROUP_ITER_DESCENDANTS_PRE/BPF_CGROUP_ITER_DESCENDANTS_POST
    /BPF_CGROUP_ITER_ANCESTORS_UP(Andrii)
  * Add KF_TRUSTED_ARGS for bpf_iter_css_new
  * Move bpf_iter_css's definition from uapi/linux/bpf.h to
    kernel/bpf/task_iter.c and we can use it from vmlinux.h
  * Move bpf_iter_css_XXX's declaration from bpf_helpers.h to
    bpf_experimental.h
Patch 5 (teach the verifier to enforce css_iter and task_iter in RCU CS)
  * Add KF flag KF_RCU_PROTECTED to maintain kfuncs which need RCU CS.(Andrii)
  * Consider STACK_ITER when using bpf_for_each_spilled_reg.
Patch 6 (Let bpf_iter_task_new accept null task ptr)
  * Add this extra patch to let bpf_iter_task_new accept a 'nullable'
  * task pointer(Andrii)
Patch 7 (selftests/bpf: Add tests for open-coded task and css iter)
  * Add failure testcase(Alexei)

Changes from v1(https://lore.kernel.org/lkml/20230827072057.1591929-1-zhouchuyi@bytedance.com/):
- Add a pre-patch to make some preparations before supporting css_task
  iters.(Alexei)
- Add an allowlist for css_task iters(Alexei)
- Let bpf progs do explicit bpf_rcu_read_lock() when using process
  iters and css_descendant iters.(Alexei)
---------------------

In some BPF usage scenarios, it will be useful to iterate the process and
css directly in the BPF program. One of the expected scenarios is
customizable OOM victim selection via BPF[1].

Inspired by Dave's task_vma iter[2], this patchset adds three types of
open-coded iterator kfuncs:

1. bpf_task_iters. It can be used to
1) iterate all process in the system, like for_each_forcess() in kernel.
2) iterate all threads in the system.
3) iterate all threads of a specific task

2. bpf_css_iters. It works like css_task_iter_{start, next, end} and would
be used to iterating tasks/threads under a css.

3. css_iters. It works like css_next_descendant_{pre, post} to iterating all
descendant css.

BPF programs can use these kfuncs directly or through bpf_for_each macro.

link[1]: https://lore.kernel.org/lkml/20230810081319.65668-1-zhouchuyi@bytedance.com/
link[2]: https://lore.kernel.org/all/20230810183513.684836-1-davemarchevsky@fb.com/
====================

Link: https://lore.kernel.org/r/20231018061746.111364-1-zhouchuyi@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
borkmann pushed a commit that referenced this pull request Oct 27, 2023
Jiri Pirko says:

====================
devlink: finish conversion to generated split_ops

This patchset converts the remaining genetlink commands to generated
split_ops and removes the existing small_ops arrays entirely
alongside with shared netlink attribute policy.

Patches #1-#6 are just small preparations and small fixes on multiple
              places. Note that couple of patches contain the "Fixes"
              tag but no need to put them into -net tree.
Patch #7 is a simple rename preparation
Patch #8 is the main one in this set and adds actual definitions of cmds
         in to yaml file.
Patches #9-#10 finalize the change removing bits that are no longer in
               use.
====================

Link: https://lore.kernel.org/r/20231021112711.660606-1-jiri@resnulli.us
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
borkmann pushed a commit that referenced this pull request Oct 31, 2023
Ido Schimmel says:

====================
Add MDB get support

This patchset adds MDB get support, allowing user space to request a
single MDB entry to be retrieved instead of dumping the entire MDB.
Support is added in both the bridge and VXLAN drivers.

Patches #1-#6 are small preparations in both drivers.

Patches #7-#8 add the required uAPI attributes for the new functionality
and the MDB get net device operation (NDO), respectively.

Patches #9-#10 implement the MDB get NDO in both drivers.

Patch #11 registers a handler for RTM_GETMDB messages in rtnetlink core.
The handler derives the net device from the ifindex specified in the
ancillary header and invokes its MDB get NDO.

Patches #12-torvalds#13 add selftests by converting tests that use MDB dump with
grep to the new MDB get functionality.

iproute2 changes can be found here [1].

v2:
* Patch #7: Add a comment to describe attributes structure.
* Patch #9: Add a comment above spin_lock_bh().

[1] https://github.com/idosch/iproute2/tree/submit/mdb_get_v1
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.