Merge branch 'll_poll'

Eliezer Tamir says: ==================== This patch set adds the ability for the socket layer code to poll directly on an Ethernet device's RX queue. This eliminates the cost of the interrupt and context switch and with proper tuning allows us to get very close to the HW latency. This is a follow up to Jesse Brandeburg's Kernel Plumbers talk from last year http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf Patch 1 adds a napi_id and a hashing mechanism to lookup a napi by id. Patch 2 adds an ndo_ll_poll method and the code that supports it. Patch 3 adds support for busy-polling on UDP sockets. Patch 4 adds support for TCP. Patch 5 adds the ixgbe driver code implementing ndo_ll_poll. Patch 6 adds additional statistics to the ixgbe driver for ndo_ll_poll. Performance numbers: setup TCP_RR UDP_RR kernel Config C3/6 rx-usecs tps cpu% S.dem tps cpu% S.dem patched optimized on 100 87k 3.13 11.4 94K 3.17 10.7 patched optimized on 0 71k 3.12 14.0 84k 3.19 12.0 patched optimized on adaptive 80k 3.13 12.5 90k 3.46 12.2 patched typical on 100 72 3.13 14.0 79k 3.17 12.8 patched typical on 0 60k 2.13 16.5 71k 3.18 14.0 patched typical on adaptive 67k 3.51 16.7 75k 3.36 14.5 3.9 optimized on adaptive 25k 1.0 12.7 28k 0.98 11.2 3.9 typical off 0 48k 1.09 7.3 52k 1.11 4.18 3.9 typical 0ff adaptive 35k 1.12 4.08 38k 0.65 5.49 3.9 optimized off adaptive 40k 0.82 4.83 43k 0.70 5.23 3.9 optimized off 0 57k 1.17 4.08 62k 1.04 3.95 Test setup details: Machines: each with two Intel Xeon 2680 CPUs and X520 (82599) optical NICs Tests: Netperf tcp_rr and udp_rr, 1 byte (round trips per second) Kernel: unmodified 3.9 and patched 3.9 Config: typical is derived from RH6.2, optimized is a stripped down config. Interrupt coalescing (ethtool rx-usecs) settings: 0=off, 1=adaptive, 100 us When C3/6 states were turned on (via BIOS) the performance governor was used. These performance numbers were measured with v2 of the patch set. Performance of the optimized config with an rx-usecs setting of 100 (the first line in the table above) was tracked during the evolution of the patches and has never varied by more than 1%. Design: A global hash table that allows us to look up a struct napi by a unique id was added. A napi_id field was added both to struct sk_buff and struct sk. This is used to track which NAPI we need to poll for a specific socket. The device driver marks every incoming skb with this id. This is propagated to the sk when the socket is looked up in the protocol handler. When the socket code does not find any more data on the socket queue, it now may call ndo_ll_poll which will crank the device's rx queue and feed incoming packets to the stack directly from the context of the socket. A sysctl value (net.core4.low_latency_poll) controls how many microseconds we busy-wait before giving up. (setting to 0 globally disables busy-polling) Locking: 1. Locking between napi poll and ndo_ll_poll: Since what needs to be locked between a device's NAPI poll and ndo_ll_poll, is highly device / configuration dependent, we do this inside the Ethernet driver. For example, when packets for high priority connections are sent to separate rx queues, you might not need locking between napi poll and ndo_ll_poll at all. For ixgbe we only lock the RX queue. ndo_ll_poll does not touch the interrupt state or the TX queues. (earlier versions of this patchset did touch them, but this design is simpler and works better.) If a queue is actively polled by a socket (on another CPU) napi poll will not service it, but will wait until the queue can be locked and cleaned before doing a napi_complete(). If a socket can't lock the queue because another CPU has it, either from napi or from another socket polling on the queue, the socket code can busy wait on the socket's skb queue. Ndo_ll_poll does not have preferential treatment for the data from the calling socket vs. data from others, so if another CPU is polling, you will see your data on this socket's queue when it arrives. Ndo_ll_poll is called with local BHs disabled, so it won't race on the same CPU with net_rx_action, which calls the napi poll method. 2. Napi_hash The napi hash mechanism uses RCU. napi_by_id() must be called under rcu_read_lock(). After a call to napi_hash_del(), caller must take care to wait an rcu grace period before freeing the memory containing the napi struct. (Ixgbe already had this because the queue vector structure uses rcu to protect the statistics counters in it.) how to test: 1. The patchset should apply cleanly to net-next. (don't forget to configure INET_LL_RX_POLL). 2. The ethtool -c setting for rx-usecs should be on the order of 100. 3. Use ethtool -K to disable GRO and LRO (You are encouraged to try it both ways. If you find that your workload does better with GRO on do tell us.) 4. Sysctl value net.core.low_latency_poll controls how long (in us) to busy-wait for more data, You are encouraged to play with this and see what works for you. The default is now 0 so you need to set it to turn the feature on. I recommend a value around 50. 4. benchmark thread and IRQ should be bound to separate cores. Both cores should be on the same CPU NUMA node as the NIC. When the app and the IRQ run on the same CPU you get a small penalty. If interrupt coalescing is set to a low value this penalty can be very large. 5. If you suspect that your machine is not configured properly, use numademo to make sure that the CPU to memory BW is OK. numademo 128m memcpy local copy numbers should be more than 8GB/s on a properly configured machine. Change log: v10 - removed select/poll support. (we will work on this some more and try again) v9 - correct sysctl proc_handler, reported by Eric Dumazet and Amir Vadai. - more int -> bool changes, reported by Eric Dumazet. - better mask testing in sock_poll(), reported by Eric Dumazet. v8 - split out udp and select/poll into separate patches. what used to be patch 2/5 is now three patches. - type corrections from Amir Vadai and Cong Wang: one unsigned long that was left when changing to cycles_t int -> bool - more detailed patch descriptions. v7 - suggested by Ben Hutchings and Eric Dumazet: type fixes, static for globals in net/core.c, avoid napi_id collisions in napi_hash_add() v6 - many small fixes suggested by Eric Dumazet: data locality, typos, documentation protect napi_hash insert/delete with a spinlock (napi_gen_id is no longer atomic_t since it's only accessed with the spinlock held.) - added IPv6 TCP and UDP support (only minimally tested) v5 - corrections suggested by Ben Hutchings: fixed typos, moved the config option and sysctl value from IPv4 to net - moved sk_mark_ll() to the protocol handlers - removed global id mechanism, replaced with a hashed napi_id. based on code sample from Eric Dumazet Note that ixgbe_free_q_vector() already waits an rcu grace period before freeing the q_vector, so nothing additional needs to be done when adding a call to napi_hash_del(). - simple poll/select support v4 - removed separate config option for TCP as suggested Eric Dumazet. - added linux mib counter for packets received through the low latency path, as suggested by Andi Kleen. - re-allow module unloading, remove module param, use a global generation id instead to prevent the use of a stale napi pointer, as suggested by Eric Dumazet - updated Documentation/networking/ip-sysctl.txt text v3 - coding style changes suggested by Dave Miller v2 - the sysctl knob is now in microseconds. The default value is now 0 (off). - for now the code depends at configure time on CONFIG_I86_TSC - the napi reference in struct skb is now a union with the dma cookie since the former is only used on RX and the latter on TX, as suggested by Eric Dumazet. - we do a better job at honoring non-blocking operations. - removed busy-polling support for tcp_read_sock() - remove dynamic disabling of GRO - coding style fixes - disallow unloading the device module after the feature has been used Credit: Jesse Brandeburg, Arun Chekhov Ilango, Julie Cummings, Alexander Duyck, Eric Geisler, Jason Neighbors, Yadong Li, Mike Polehn, Anil Vasudevan, Don Wood Special thanks for finding bugs in earlier versions: Willem de Bruijn and Andi Kleen ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
aarch64-laptops · Jun 11, 2013 · 0a4db18 · 0a4db18
2 parents 6f00a02 + 7e15b90
commit 0a4db18
Show file tree

Hide file tree

Showing 23 changed files with 556 additions and 12 deletions.
diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
@@ -50,6 +50,13 @@ The maximum number of packets that kernel can handle on a NAPI interrupt,
 it's a Per-CPU variable.
 Default: 64
 
+low_latency_poll
+----------------
+Low latency busy poll timeout. (needs CONFIG_NET_LL_RX_POLL)
+Approximate time in us to spin waiting for packets on the device queue.
+Recommended value is 50. May increase power usage.
+Default: 0 (off)
+
 rmem_default
 ------------
 

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -52,6 +52,11 @@
 #include <linux/dca.h>
 #endif
 
+#include <net/ll_poll.h>
+
+#ifdef CONFIG_NET_LL_RX_POLL
+#define LL_EXTENDED_STATS
+#endif
 /* common prefix used by pr_<> macros */
 #undef pr_fmt
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -182,6 +187,11 @@ struct ixgbe_rx_buffer {
 struct ixgbe_queue_stats {
 	u64 packets;
 	u64 bytes;
+#ifdef LL_EXTENDED_STATS
+	u64 yields;
+	u64 misses;
+	u64 cleaned;
+#endif  /* LL_EXTENDED_STATS */
 };
 
 struct ixgbe_tx_queue_stats {
@@ -356,9 +366,133 @@ struct ixgbe_q_vector {
 	struct rcu_head rcu;	/* to avoid race with update stats on free */
 	char name[IFNAMSIZ + 9];
 
+#ifdef CONFIG_NET_LL_RX_POLL
+	unsigned int state;
+#define IXGBE_QV_STATE_IDLE        0
+#define IXGBE_QV_STATE_NAPI	   1    /* NAPI owns this QV */
+#define IXGBE_QV_STATE_POLL	   2    /* poll owns this QV */
+#define IXGBE_QV_LOCKED (IXGBE_QV_STATE_NAPI | IXGBE_QV_STATE_POLL)
+#define IXGBE_QV_STATE_NAPI_YIELD  4    /* NAPI yielded this QV */
+#define IXGBE_QV_STATE_POLL_YIELD  8    /* poll yielded this QV */
+#define IXGBE_QV_YIELD (IXGBE_QV_STATE_NAPI_YIELD | IXGBE_QV_STATE_POLL_YIELD)
+#define IXGBE_QV_USER_PEND (IXGBE_QV_STATE_POLL | IXGBE_QV_STATE_POLL_YIELD)
+	spinlock_t lock;
+#endif  /* CONFIG_NET_LL_RX_POLL */
+
 	/* for dynamic allocation of rings associated with this q_vector */
 	struct ixgbe_ring ring[0] ____cacheline_internodealigned_in_smp;
 };
+#ifdef CONFIG_NET_LL_RX_POLL
+static inline void ixgbe_qv_init_lock(struct ixgbe_q_vector *q_vector)
+{
+
+	spin_lock_init(&q_vector->lock);
+	q_vector->state = IXGBE_QV_STATE_IDLE;
+}
+
+/* called from the device poll routine to get ownership of a q_vector */
+static inline bool ixgbe_qv_lock_napi(struct ixgbe_q_vector *q_vector)
+{
+	int rc = true;
+	spin_lock(&q_vector->lock);
+	if (q_vector->state & IXGBE_QV_LOCKED) {
+		WARN_ON(q_vector->state & IXGBE_QV_STATE_NAPI);
+		q_vector->state |= IXGBE_QV_STATE_NAPI_YIELD;
+		rc = false;
+#ifdef LL_EXTENDED_STATS
+		q_vector->tx.ring->stats.yields++;
+#endif
+	} else
+		/* we don't care if someone yielded */
+		q_vector->state = IXGBE_QV_STATE_NAPI;
+	spin_unlock(&q_vector->lock);
+	return rc;
+}
+
+/* returns true is someone tried to get the qv while napi had it */
+static inline bool ixgbe_qv_unlock_napi(struct ixgbe_q_vector *q_vector)
+{
+	int rc = false;
+	spin_lock(&q_vector->lock);
+	WARN_ON(q_vector->state & (IXGBE_QV_STATE_POLL |
+			       IXGBE_QV_STATE_NAPI_YIELD));
+
+	if (q_vector->state & IXGBE_QV_STATE_POLL_YIELD)
+		rc = true;
+	q_vector->state = IXGBE_QV_STATE_IDLE;
+	spin_unlock(&q_vector->lock);
+	return rc;
+}
+
+/* called from ixgbe_low_latency_poll() */
+static inline bool ixgbe_qv_lock_poll(struct ixgbe_q_vector *q_vector)
+{
+	int rc = true;
+	spin_lock_bh(&q_vector->lock);
+	if ((q_vector->state & IXGBE_QV_LOCKED)) {
+		q_vector->state |= IXGBE_QV_STATE_POLL_YIELD;
+		rc = false;
+#ifdef LL_EXTENDED_STATS
+		q_vector->rx.ring->stats.yields++;
+#endif
+	} else
+		/* preserve yield marks */
+		q_vector->state |= IXGBE_QV_STATE_POLL;
+	spin_unlock_bh(&q_vector->lock);
+	return rc;
+}
+
+/* returns true if someone tried to get the qv while it was locked */
+static inline bool ixgbe_qv_unlock_poll(struct ixgbe_q_vector *q_vector)
+{
+	int rc = false;
+	spin_lock_bh(&q_vector->lock);
+	WARN_ON(q_vector->state & (IXGBE_QV_STATE_NAPI));
+
+	if (q_vector->state & IXGBE_QV_STATE_POLL_YIELD)
+		rc = true;
+	q_vector->state = IXGBE_QV_STATE_IDLE;
+	spin_unlock_bh(&q_vector->lock);
+	return rc;
+}
+
+/* true if a socket is polling, even if it did not get the lock */
+static inline bool ixgbe_qv_ll_polling(struct ixgbe_q_vector *q_vector)
+{
+	WARN_ON(!(q_vector->state & IXGBE_QV_LOCKED));
+	return q_vector->state & IXGBE_QV_USER_PEND;
+}
+#else /* CONFIG_NET_LL_RX_POLL */
+static inline void ixgbe_qv_init_lock(struct ixgbe_q_vector *q_vector)
+{
+}
+
+static inline bool ixgbe_qv_lock_napi(struct ixgbe_q_vector *q_vector)
+{
+	return true;
+}
+
+static inline bool ixgbe_qv_unlock_napi(struct ixgbe_q_vector *q_vector)
+{
+	return false;
+}
+
+static inline bool ixgbe_qv_lock_poll(struct ixgbe_q_vector *q_vector)
+{
+	return false;
+}
+
+static inline bool ixgbe_qv_unlock_poll(struct ixgbe_q_vector *q_vector)
+{
+	return false;
+}
+
+static inline bool ixgbe_qv_ll_polling(struct ixgbe_q_vector *q_vector)
+{
+	return false;
+}
+#endif /* CONFIG_NET_LL_RX_POLL */
+
 #ifdef CONFIG_IXGBE_HWMON
 
 #define IXGBE_HWMON_TYPE_LOC		0

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -1054,6 +1054,12 @@ static void ixgbe_get_ethtool_stats(struct net_device *netdev,
 			data[i] = 0;
 			data[i+1] = 0;
 			i += 2;
+#ifdef LL_EXTENDED_STATS
+			data[i] = 0;
+			data[i+1] = 0;
+			data[i+2] = 0;
+			i += 3;
+#endif
 			continue;
 		}
 
@@ -1063,13 +1069,25 @@ static void ixgbe_get_ethtool_stats(struct net_device *netdev,
 			data[i+1] = ring->stats.bytes;
 		} while (u64_stats_fetch_retry_bh(&ring->syncp, start));
 		i += 2;
+#ifdef LL_EXTENDED_STATS
+		data[i] = ring->stats.yields;
+		data[i+1] = ring->stats.misses;
+		data[i+2] = ring->stats.cleaned;
+		i += 3;
+#endif
 	}
 	for (j = 0; j < IXGBE_NUM_RX_QUEUES; j++) {
 		ring = adapter->rx_ring[j];
 		if (!ring) {
 			data[i] = 0;
 			data[i+1] = 0;
 			i += 2;
+#ifdef LL_EXTENDED_STATS
+			data[i] = 0;
+			data[i+1] = 0;
+			data[i+2] = 0;
+			i += 3;
+#endif
 			continue;
 		}
 
@@ -1079,6 +1097,12 @@ static void ixgbe_get_ethtool_stats(struct net_device *netdev,
 			data[i+1] = ring->stats.bytes;
 		} while (u64_stats_fetch_retry_bh(&ring->syncp, start));
 		i += 2;
+#ifdef LL_EXTENDED_STATS
+		data[i] = ring->stats.yields;
+		data[i+1] = ring->stats.misses;
+		data[i+2] = ring->stats.cleaned;
+		i += 3;
+#endif
 	}
 
 	for (j = 0; j < IXGBE_MAX_PACKET_BUFFERS; j++) {
@@ -1115,12 +1139,28 @@ static void ixgbe_get_strings(struct net_device *netdev, u32 stringset,
 			p += ETH_GSTRING_LEN;
 			sprintf(p, "tx_queue_%u_bytes", i);
 			p += ETH_GSTRING_LEN;
+#ifdef LL_EXTENDED_STATS
+			sprintf(p, "tx_q_%u_napi_yield", i);
+			p += ETH_GSTRING_LEN;
+			sprintf(p, "tx_q_%u_misses", i);
+			p += ETH_GSTRING_LEN;
+			sprintf(p, "tx_q_%u_cleaned", i);
+			p += ETH_GSTRING_LEN;
+#endif /* LL_EXTENDED_STATS */
 		}
 		for (i = 0; i < IXGBE_NUM_RX_QUEUES; i++) {
 			sprintf(p, "rx_queue_%u_packets", i);
 			p += ETH_GSTRING_LEN;
 			sprintf(p, "rx_queue_%u_bytes", i);
 			p += ETH_GSTRING_LEN;
+#ifdef LL_EXTENDED_STATS
+			sprintf(p, "rx_q_%u_ll_poll_yield", i);
+			p += ETH_GSTRING_LEN;
+			sprintf(p, "rx_q_%u_misses", i);
+			p += ETH_GSTRING_LEN;
+			sprintf(p, "rx_q_%u_cleaned", i);
+			p += ETH_GSTRING_LEN;
+#endif /* LL_EXTENDED_STATS */
 		}
 		for (i = 0; i < IXGBE_MAX_PACKET_BUFFERS; i++) {
 			sprintf(p, "tx_pb_%u_pxon", i);

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -811,6 +811,7 @@ static int ixgbe_alloc_q_vector(struct ixgbe_adapter *adapter,
 	/* initialize NAPI */
 	netif_napi_add(adapter->netdev, &q_vector->napi,
 		       ixgbe_poll, 64);
+	napi_hash_add(&q_vector->napi);
 
 	/* tie q_vector and adapter together */
 	adapter->q_vector[v_idx] = q_vector;
@@ -931,6 +932,7 @@ static void ixgbe_free_q_vector(struct ixgbe_adapter *adapter, int v_idx)
 		adapter->rx_ring[ring->queue_index] = NULL;
 
 	adapter->q_vector[v_idx] = NULL;
+	napi_hash_del(&q_vector->napi);
 	netif_napi_del(&q_vector->napi);
 
 	/*