Skip to content

Commit

Permalink
Merge branch 'pci/peer-to-peer'
Browse files Browse the repository at this point in the history
  - Add PCI support for peer-to-peer DMA (Logan Gunthorpe)

  - Add sysfs group for PCI peer-to-peer memory statistics (Logan
    Gunthorpe)

  - Add PCI peer-to-peer DMA scatterlist mapping interface (Logan
    Gunthorpe)

  - Add PCI configfs/sysfs helpers for use by peer-to-peer users (Logan
    Gunthorpe)

  - Add PCI peer-to-peer DMA driver writer's documentation (Logan
    Gunthorpe)

  - Add block layer flag to indicate driver support for PCI peer-to-peer
    DMA (Logan Gunthorpe)

  - Map Infiniband scatterlists for peer-to-peer DMA if they contain P2P
    memory (Logan Gunthorpe)

  - Register nvme-pci CMB buffer as PCI peer-to-peer memory (Logan
    Gunthorpe)

  - Add nvme-pci support for PCI peer-to-peer memory in requests (Logan
    Gunthorpe)

  - Use PCI peer-to-peer memory in nvme (Stephen Bates, Steve Wise,
    Christoph Hellwig, Logan Gunthorpe)

* pci/peer-to-peer:
  nvmet: Optionally use PCI P2P memory
  nvmet: Introduce helper functions to allocate and free request SGLs
  nvme-pci: Add support for P2P memory in requests
  nvme-pci: Use PCI p2pmem subsystem to manage the CMB
  IB/core: Ensure we map P2P memory correctly in rdma_rw_ctx_[init|destroy]()
  block: Add PCI P2P flag for request queue
  PCI/P2PDMA: Add P2P DMA driver writer's documentation
  docs-rst: Add a new directory for PCI documentation
  PCI/P2PDMA: Introduce configfs/sysfs enable attribute helpers
  PCI/P2PDMA: Add PCI p2pmem DMA mappings to adjust the bus offset
  PCI/P2PDMA: Add sysfs group to display p2pmem stats
  PCI/P2PDMA: Support peer-to-peer memory
  • Loading branch information
bjorn-helgaas committed Oct 20, 2018
2 parents 0af6166 + c692509 commit 1734715
Show file tree
Hide file tree
Showing 22 changed files with 1,493 additions and 50 deletions.
24 changes: 24 additions & 0 deletions Documentation/ABI/testing/sysfs-bus-pci
Original file line number Diff line number Diff line change
Expand Up @@ -323,3 +323,27 @@ Description:

This is similar to /sys/bus/pci/drivers_autoprobe, but
affects only the VFs associated with a specific PF.

What: /sys/bus/pci/devices/.../p2pmem/size
Date: November 2017
Contact: Logan Gunthorpe <logang@deltatee.com>
Description:
If the device has any Peer-to-Peer memory registered, this
file contains the total amount of memory that the device
provides (in decimal).

What: /sys/bus/pci/devices/.../p2pmem/available
Date: November 2017
Contact: Logan Gunthorpe <logang@deltatee.com>
Description:
If the device has any Peer-to-Peer memory registered, this
file contains the amount of memory that has not been
allocated (in decimal).

What: /sys/bus/pci/devices/.../p2pmem/published
Date: November 2017
Contact: Logan Gunthorpe <logang@deltatee.com>
Description:
If the device has any Peer-to-Peer memory registered, this
file contains a '1' if the memory has been published for
use outside the driver that owns the device.
2 changes: 1 addition & 1 deletion Documentation/driver-api/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ available subsections can be seen below.
iio/index
input
usb/index
pci
pci/index
spi
i2c
hsi
Expand Down
22 changes: 22 additions & 0 deletions Documentation/driver-api/pci/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
.. SPDX-License-Identifier: GPL-2.0
============================================
The Linux PCI driver implementer's API guide
============================================

.. class:: toc-title

Table of contents

.. toctree::
:maxdepth: 2

pci
p2pdma

.. only:: subproject and html

Indices
=======

* :ref:`genindex`
145 changes: 145 additions & 0 deletions Documentation/driver-api/pci/p2pdma.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,145 @@
.. SPDX-License-Identifier: GPL-2.0
============================
PCI Peer-to-Peer DMA Support
============================

The PCI bus has pretty decent support for performing DMA transfers
between two devices on the bus. This type of transaction is henceforth
called Peer-to-Peer (or P2P). However, there are a number of issues that
make P2P transactions tricky to do in a perfectly safe way.

One of the biggest issues is that PCI doesn't require forwarding
transactions between hierarchy domains, and in PCIe, each Root Port
defines a separate hierarchy domain. To make things worse, there is no
simple way to determine if a given Root Complex supports this or not.
(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel
only supports doing P2P when the endpoints involved are all behind the
same PCI bridge, as such devices are all in the same PCI hierarchy
domain, and the spec guarantees that all transactions within the
hierarchy will be routable, but it does not require routing
between hierarchies.

The second issue is that to make use of existing interfaces in Linux,
memory that is used for P2P transactions needs to be backed by struct
pages. However, PCI BARs are not typically cache coherent so there are
a few corner case gotchas with these pages so developers need to
be careful about what they do with them.


Driver Writer's Guide
=====================

In a given P2P implementation there may be three or more different
types of kernel drivers in play:

* Provider - A driver which provides or publishes P2P resources like
memory or doorbell registers to other drivers.
* Client - A driver which makes use of a resource by setting up a
DMA transaction to or from it.
* Orchestrator - A driver which orchestrates the flow of data between
clients and providers.

In many cases there could be overlap between these three types (i.e.,
it may be typical for a driver to be both a provider and a client).

For example, in the NVMe Target Copy Offload implementation:

* The NVMe PCI driver is both a client, provider and orchestrator
in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
resource (provider), it accepts P2P memory pages as buffers in requests
to be used directly (client) and it can also make use of the CMB as
submission queue entries (orchastrator).
* The RDMA driver is a client in this arrangement so that an RNIC
can DMA directly to the memory exposed by the NVMe device.
* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
to the P2P memory (CMB) and then to the NVMe device (and vice versa).

This is currently the only arrangement supported by the kernel but
one could imagine slight tweaks to this that would allow for the same
functionality. For example, if a specific RNIC added a BAR with some
memory behind it, its driver could add support as a P2P provider and
then the NVMe Target could use the RNIC's memory instead of the CMB
in cases where the NVMe cards in use do not have CMB support.


Provider Drivers
----------------

A provider simply needs to register a BAR (or a portion of a BAR)
as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
This will register struct pages for all the specified memory.

After that it may optionally publish all of its resources as
P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
any orchestrator drivers to find and use the memory. When marked in
this way, the resource must be regular memory with no side effects.

For the time being this is fairly rudimentary in that all resources
are typically going to be P2P memory. Future work will likely expand
this to include other types of resources like doorbells.


Client Drivers
--------------

A client driver typically only has to conditionally change its DMA map
routine to use the mapping function :c:func:`pci_p2pdma_map_sg()` instead
of the usual :c:func:`dma_map_sg()` function. Memory mapped in this
way does not need to be unmapped.

The client may also, optionally, make use of
:c:func:`is_pci_p2pdma_page()` to determine when to use the P2P mapping
functions and when to use the regular mapping functions. In some
situations, it may be more appropriate to use a flag to indicate a
given request is P2P memory and map appropriately. It is important to
ensure that struct pages that back P2P memory stay out of code that
does not have support for them as other code may treat the pages as
regular memory which may not be appropriate.


Orchestrator Drivers
--------------------

The first task an orchestrator driver must do is compile a list of
all client devices that will be involved in a given transaction. For
example, the NVMe Target driver creates a list including the namespace
block device and the RNIC in use. If the orchestrator has access to
a specific P2P provider to use it may check compatibility using
:c:func:`pci_p2pdma_distance()` otherwise it may find a memory provider
that's compatible with all clients using :c:func:`pci_p2pmem_find()`.
If more than one provider is supported, the one nearest to all the clients will
be chosen first. If more than one provider is an equal distance away, the
one returned will be chosen at random (it is not an arbitrary but
truely random). This function returns the PCI device to use for the provider
with a reference taken and therefore when it's no longer needed it should be
returned with pci_dev_put().

Once a provider is selected, the orchestrator can then use
:c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
allocating scatter-gather lists with P2P memory.

Struct Page Caveats
-------------------

Driver writers should be very careful about not passing these special
struct pages to code that isn't prepared for it. At this time, the kernel
interfaces do not have any checks for ensuring this. This obviously
precludes passing these pages to userspace.

P2P memory is also technically IO memory but should never have any side
effects behind it. Thus, the order of loads and stores should not be important
and ioreadX(), iowriteX() and friends should not be necessary.
However, as the memory is not cache coherent, if access ever needs to
be protected by a spinlock then :c:func:`mmiowb()` must be used before
unlocking the lock. (See ACQUIRES VS I/O ACCESSES in
Documentation/memory-barriers.txt)


P2P DMA Support Library
=======================

.. kernel-doc:: drivers/pci/p2pdma.c
:export:
File renamed without changes.
11 changes: 9 additions & 2 deletions drivers/infiniband/core/rw.c
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
*/
#include <linux/moduleparam.h>
#include <linux/slab.h>
#include <linux/pci-p2pdma.h>
#include <rdma/mr_pool.h>
#include <rdma/rw.h>

Expand Down Expand Up @@ -280,7 +281,11 @@ int rdma_rw_ctx_init(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
struct ib_device *dev = qp->pd->device;
int ret;

ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);
if (is_pci_p2pdma_page(sg_page(sg)))
ret = pci_p2pdma_map_sg(dev->dma_device, sg, sg_cnt, dir);
else
ret = ib_dma_map_sg(dev, sg, sg_cnt, dir);

if (!ret)
return -ENOMEM;
sg_cnt = ret;
Expand Down Expand Up @@ -602,7 +607,9 @@ void rdma_rw_ctx_destroy(struct rdma_rw_ctx *ctx, struct ib_qp *qp, u8 port_num,
break;
}

ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
/* P2PDMA contexts do not need to be unmapped */
if (!is_pci_p2pdma_page(sg_page(sg)))
ib_dma_unmap_sg(qp->pd->device, sg, sg_cnt, dir);
}
EXPORT_SYMBOL(rdma_rw_ctx_destroy);

Expand Down
4 changes: 4 additions & 0 deletions drivers/nvme/host/core.c
Original file line number Diff line number Diff line change
Expand Up @@ -3051,7 +3051,11 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid)
ns->queue = blk_mq_init_queue(ctrl->tagset);
if (IS_ERR(ns->queue))
goto out_free_ns;

blk_queue_flag_set(QUEUE_FLAG_NONROT, ns->queue);
if (ctrl->ops->flags & NVME_F_PCI_P2PDMA)
blk_queue_flag_set(QUEUE_FLAG_PCI_P2PDMA, ns->queue);

ns->queue->queuedata = ns;
ns->ctrl = ctrl;

Expand Down
1 change: 1 addition & 0 deletions drivers/nvme/host/nvme.h
Original file line number Diff line number Diff line change
Expand Up @@ -343,6 +343,7 @@ struct nvme_ctrl_ops {
unsigned int flags;
#define NVME_F_FABRICS (1 << 0)
#define NVME_F_METADATA_SUPPORTED (1 << 1)
#define NVME_F_PCI_P2PDMA (1 << 2)
int (*reg_read32)(struct nvme_ctrl *ctrl, u32 off, u32 *val);
int (*reg_write32)(struct nvme_ctrl *ctrl, u32 off, u32 val);
int (*reg_read64)(struct nvme_ctrl *ctrl, u32 off, u64 *val);
Expand Down
Loading

0 comments on commit 1734715

Please sign in to comment.