authors | state | discussion |
---|---|---|
Angela Fong <angela.fong@joyent.com>, Casey Bisson <casey.bisson@joyent.com>, Jerry Jelinek <jerry@joyent.com>, Josh Wilsdon <jwilsdon@joyent.com>, Julien Gilli <julien.gilli@joyent.com>, Trent Mick <trent.mick@joyent.com>, Marsell Kukuljevic <marsell@joyent.com> |
publish |
Table of Contents
- RFD 26 Network Shared Storage for Triton
- Conventions used in this document
- Introduction
- Current prototype
- Use cases
- General scope
- CLI
- Shared storage implementation
- Relationship between shared volumes and VMs
- Allocation (DAPI, packages, etc.)
- Networking
- Snapshots (Snapshots milestone)
- REST APIs
- Changes to CloudAPI
- Not exposing NFS volumes' storage VMs via any of the
Machines
endpoints - Volume objects representation
- New
volumes
parameter forCreateMachine
endpoint (CloudAPI volumes automount milestone) - New
/volumes
endpoints- ListVolumes GET /volumes
- CreateVolume
- GetVolume GET /volumes/id
- GetVolumeReferences GET /volumes/id/references (MVP milestone)
- DeleteVolume DELETE /volumes/id
- UpdateVolume POST /volumes/id
- ListVolumeSizes GET /volumesizes
- AttachVolumeToNetwork POST /volumes/id/attachtonetwork (MVP milestone)
- DetachVolumeFromNetwork POST /volumes/id/detachfromnetwork (MVP milestone)
- CreateVolumeSnapshot POST /volumes/id/snapshot (Snapshot milestone)
- GetVolumeSnapshot GET /volumes/id/snapshots/snapshot-name
- RollbackToVolumeSnapshot POST /volumes/id/rollbacktosnapshot (Snapshot milestone)
- DeleteVolumeSnapshot DELETE /volumes/id/snapshots/snapshot-name (Snapshot milestone)
- ListVolumePackages GET /volumepackages (Volume packages milestone)
- GetVolumePackage GET /volumepackages/volume-package-uuid (Volume packages milestone)
- Not exposing NFS volumes' storage VMs via any of the
- Changes to VMAPI
- Changes to PAPI
- New
VOLAPI
service and API- ListVolumes GET /volumes
- GetVolume GET /volumes/volume-uuid
- CreateVolume POST /volumes
- DeleteVolume DELETE /volumes/volume-uuid
- UpdateVolume POST /volumes/volume-uuid
- ListVolumeSizes GET /volumesizes
- GetVolumeReferences GET /volumes/uuid/references
- AttachVolumeToNetwork POST /volumes/volume-uuid/attachtonetwork (MVP milestone)
- DetachVolumeFromNetwork POST /volumes/volume-uuid/detachfromnetwork (MVP milestone)
- Volume references
- Volume reservations
- Snapshots (Snapshots milestone)
- Snapshot objects
- CreateVolumeSnapshot POST /volumes/volume-uuid/snapshot
- GetVolumeSnapshot GET /volumes/volume-uuid/snapshots/snapshot-name
- RollbackToVolumeSnapshot POST /volumes/volume-uuid/rollbacktosnapshot
- ListVolumeSnapshots GET /volume/volume-uuid/snapshots
- DeleteVolumeSnapshot DELETE /volumes/volume-uuid/snapshots/snapshot-name
- Volume objects
- Volumes state machine
- Data retention policy
- Changes to CloudAPI
- Support for operating shared volumes
- Integration plan
- Open Questions
This document describes features and changes that are not meant to be integrated at the same time, but instead progressively, in stages. Each stage is represented by a milestone that has a name.
When no milestone is mentioned when describing a change or new feature, they belong to the default milestone that will be integrated first (named "master integration"). Otherwise, the name of the milestone should be mentioned clearly.
See the integration plan for more details about milestones and when they are planned to be integrated.
In general, support for network shared storage is antithetical to Triton's philosophy. However, for some customers and applications it is a requirement. In particular, network shared storage is needed to support Docker volumes in configurations where the Triton zones are deployed on different compute nodes.
A prototype for what this RFD describes is available at https://github.com/TritonDataCenter/sdc-volapi. It implements all the new features and changes described in this document that belong to the default (master integration) milestone.
It provides:
-
all the core services that are relevant to the implementation of this RFD (sdc-docker, CloudAPI, VMAPI, workflow, etc.)
-
the
sdcadm
tool that implements the "experimental" commands that can be used to enable/disable features related to NFS volumes. -
a
tools/setup/setup.sh
script that installs all of these components in a given DC.
In addition to that, node-triton at version 5.3.1 added support for creating and
managing NFS volumes with the volume
subcommands. Note however that these
subcommands are hidden for now, and thus don't show up when running triton --help
.
Please note that this prototype is not meant to be used in a production environment, or any environment where data integrity matters.
See its README for more details on how to install and use it.
The stories in this section have been gathered from Triton users. The UGC (User Generated Content) story is genericized because it's been described by many people as something they need. The video conferencing story is more specific because it’s a one-off so far. The names, of course, are fictional.
Danica runs WordPress, Drupal, Ghost, or some other tool that expects a local filesystem that stores user generated content (UGC). These are typically images, but are not limited to it. The size of these filesystems is typically hundreds of megabytes to hundreds of gigabytes (a 1TB per volume limit would likely be very acceptable).
When Danica builds a new version of her app, she builds an image (can be Docker or infrastructure container) with the application code, runtime environment, and other non-UG content. She tests the image with a sample set of content, but needs to bring the UGC into the image in some way when deploying it.
In production, Danica needs to run multiple instances of her app on different physical compute nodes for availability and performance, and each instance needs access to a shared filesystem that includes the UGC and persists across application deploys.
The application supports some manipulation of the UG content (example: an image editor for cropping), but the most common use for the content is to serve it out on the internet. The result is that filesystem performance is not critical to the app or user experience.
Though the UGC is stored on devices that offer RAID protection, Danica further protects against loss of UG content by making regular backups that she does as nightly tarballs stored in other infrastructure. She often uses those backups in her development workflow as a way to get real content and practice the restore procedures for them.
Danica can’t use an object store for the UGC because there’s no plugin for it in her app, or no plugin that supports Manta, or she depends on features that are incompatible with object storage plugins, or she’s running the app on-prem and can’t send the data to an off-site object store (and she doesn’t have the interest or budget to run Manta on-prem).
Danica does not yet run her application across multiple data centers, but she’d love to do so as a way to ensure availability of her site in the case of a data center failure and so to improve performance by directing requests to the closest DC.
Vinod is building a video conferencing app that allows the optional recording of conversations. The app component that manages individual conferences is ephemeral with the exception of the recordings. The recordings are spooled to disk as the conversation progresses, then the file is closed and enters a task queue for further processing. Because the recordings can become quite large, and because the job queue is easier to design if the recordings are on a shared filesystem, Vinod has chosen to build his application with that expectation. The recordings don’t stay in the shared filesystem indefinitely, but workers in the queue compress and move them to an object store asynchronously to the conversation activity.
Conversation recordings are typically in the low hundreds of gigabytes, and performance requirements for any single conversation are limited by the bit rates for internet video. Vinod currently uses a single filesystem, but expects to add additional filesystems if necessary for performance. By running the application in multiple data centers, each with its own set of conference manager instances, shared filesystems, and work queues, Vinod’s app can remain available despite the loss of individual data centers or shared filesystems, though users would have to restart any conference calls that were interrupted by that failure.
Vinod has chosen not to push the recordings into an object store immediately because doing so would require larger in-instance storage for the video than is available in all but the most expensive Triton packages, because adding support for chunked uploads to the app that is managing the conversation is a big expansion of scope for that component, and because the current process works and makes sense to him.
-
The solution must work for both Triton Cloud and Triton Enterprise users (public cloud and on-prem).
-
Support a maximum shared file system size of 1TB.
-
Although initially targeted toward Docker volume support, the solution should be applicable to non-Docker containers as well.
-
Shared volumes do not need to be available across data centers.
-
High performance is not critical.
-
Dedicated storage server (or servers) hardware is not necessary.
-
Robust concurrent read-write access (e.g. as used by a database) is not necessary.
The triton
CLI currently doesn't support the concept of (shared) volumes, so
new commands and options will need to be added.
The docker CLI already supports shared volumes, but some conventions will need to be established to allow Triton users to pass triton-specific input to Triton's Docker API.
triton volumes
triton volume create|list|get|delete|sizes
triton volume create --network mynetwork --name wp-uploads --size 100G
triton volume create --name wp-uploads -s 100g
triton volume create --name wp-uploads nfs1-100g (volume packages milestone)
# Mounting the volume wp-uploads from an instance
triton instance create -v wp-uploads:/uploads ...
triton volume create --network mynetwork --name wp-uploads --size 100G
triton volume create --network mynetwork --name wp-uploads --size 100G -a instance!=wp-server (affinity milestone)
The name of the shared volume. If not specified, a unique name is automatically generated.
The size of the shared volume. If no size is provided, the volume will be created with the smallest size available as outputted by the triton volume sizes command.
If a size is provided and it is not one of those listed as available in triton's
volume sizes
' output, the creation fails.
Specifying a unit in the size parameter is required. The only unit suffix
available is G
. 10G
means 10 Gibibytes
(2^30 bytes).
Later, users will also be able to specify volume sizes via volume packages' UUIDs and list available volume sizes with CloudAPI's ListVolumePackages endpoint.
The network to which this shared volume will be attached. Only fabric networks can be attached to shared volumes.
See Expressing locality with affinity filters below.
Mounting a volume from an instance is done via the new -v
or --volume
comamnd line option of the triton instance create
subcommand:
$ triton volume create --name my-volume
$ triton instance create --name my-instance -v my-volume:/mountpoint
It supports specifying mode flags that represent read and write permissions:
ro
for "read-only" and rw
for "read-write":
$ triton instance create --name my-instance -v my-volume:/mountpoint-read-only:ro
rw
(read-write permissions) is the default mode.
More than one volume can be mounted from an instance:
$ triton volume create --name my-volume-1
$ triton volume create --name my-volume-2
$ triton instance create --name my-instance -v my-volume-1:/mountpoint-1 -v my-volume-2:/mountpoint-2
$ triton volume create -n foo ...
$ triton volume list
NAME SIZE NETWORK RESOURCE
foo 100G My Default Network nfs://10.0.0.1/foo # nfs://host[:port]/pathname
$
$ triton volume get my-volume
{
"name": "my-volume",
"owner_uuid": "a08085c4-1624-45c1-a004-c16e91efae1e",
"size": 20480,
"type": "tritonnfs",
"create_timestamp": "2017-08-24T22:35:03.109Z",
"state": "creating",
"networks": [
"c5d34272-afae-41e5-b014-a204b05435f6"
],
"id": "24b90e7a-cd55-e706-cfe6-8d9146b2414c"
}
$ triton volume create --name my-volume
$ triton volume rm my-volume
Delete volume my-volume? [y/n] y
Deleting volume my-volume
$
This command fails if one or more VMs are referencing the volume to be deleted:
$ triton volume create --name my-volume
$ triton instance create --name my-instance -v my-volume:/mountpoint
$ triton volume rm my-volume
Delete volume my-volume? [y/n] y
Deleting volume my-volume
triton volume rm: error: first of 1 error: Error when deleting volume: Volume with name my-volume is used
$ echo $?
1
$
$ triton volume sizes
TYPE SIZE
tritonnfs 10G
tritonnfs 20G
tritonnfs 30G
tritonnfs 40G
tritonnfs 50G
tritonnfs 60G
tritonnfs 70G
tritonnfs 80G
tritonnfs 90G
tritonnfs 100G
tritonnfs 200G
tritonnfs 300G
tritonnfs 400G
tritonnfs 500G
tritonnfs 600G
tritonnfs 700G
tritonnfs 800G
tritonnfs 900G
tritonnfs 1000G
$ triton volume sizes -j
[
{
"type": "tritonnfs",
"size": 10240
},
{
"type": "tritonnfs",
"size": 20480
},
{
"type": "tritonnfs",
"size": 30720
},
{
"type": "tritonnfs",
"size": 40960
},
{
"type": "tritonnfs",
"size": 51200
},
{
"type": "tritonnfs",
"size": 61440
},
{
"type": "tritonnfs",
"size": 71680
},
{
"type": "tritonnfs",
"size": 81920
},
{
"type": "tritonnfs",
"size": 92160
},
{
"type": "tritonnfs",
"size": 102400
},
{
"type": "tritonnfs",
"size": 204800
},
{
"type": "tritonnfs",
"size": 307200
},
{
"type": "tritonnfs",
"size": 409600
},
{
"type": "tritonnfs",
"size": 512000
},
{
"type": "tritonnfs",
"size": 614400
},
{
"type": "tritonnfs",
"size": 716800
},
{
"type": "tritonnfs",
"size": 819200
},
{
"type": "tritonnfs",
"size": 921600
},
{
"type": "tritonnfs",
"size": 1024000
}
]
$
This command lists available volume sizes. Trying to create a volume with a different size results in an error:
$ triton -i volume create --name my-volume --size 21G
triton volume create: error: volume size not available, use triton volume sizes command for available sizes
$
It is implemented using the ListVolumeSizes
CloudAPI endpoint for the master
integration milestone, and will use volume packages when those become available
(currently once the "volume packages" milestone is completed).
Creating a shared volume results in creating a VM object and an instance with
the sdc:system_role
internal_metadata
property set to 'nfsvolumestorage'
.
As such, a user could list all their "resources" (including instances and
shared volumes) by listing instances.
However, the fact that shared volumes have a 1 to 1 relationship with their underlying containers is an implementation detail that should not be publicly exposed.
Shared volumes should instead be considered as a separate resource type and a
new triton report
command could list all resources of any type for a given
user, including:
- actual compute instances.
- NAT zones.
- shared volumes zones.
The Docker CLI already has support for volumes. This section describes what commands and command line options will be used by Triton users to manage their shared volumes on Triton.
docker network create mynetwork ...
docker volume create --driver tritonnfs --name wp-uploads \
--opt size=100G --opt network=mynetwork
docker run -d --network mynetwork -v wp-uploads:/var/wp-uploads wp-server
The tritonnfs
driver is the default driver on Triton. If not specified, the
network to which a newly created volume is attached is the user's default fabric
network.
Creating a shared volume can be done using the following shorter command line:
docker volume create --name wp-uploads
The name of the shared volume. If not specified, a unique name is automatically generated.
The size of the shared volume. This option is passed using docker's CLI's
--opt
command line switch:
docker volume create --name wp-uploads --opt size=100G
The size of the shared volume. If no size is provided, the volume will be created with the smallest size available as specified by CloudAPI's ListVolumeSizes endpoint.
If a size is provided and it is not one of those listed as available by sending a request to CloudAPI's ListVolumeSizes endpoint, the creation fails and outputs the list of available sizes:
$ docker volume create --name my-volume --opt size=21G
Error response from daemon: Volume size not available. Available sizes: 10G, 20G, 30G, 40G, 50G, 60G, 70G, 80G, 90G, 100G, 200G, 300G, 400G, 500G, 600G, 700G, 800G, 900G, 1000G (a95e7141-9783-41da-908f-302811323fcf)
$
Specifying a unit in the size parameter is required. The only available unit
suffix is G
. 10G
means 10 Gibibytes
(2^30 bytes).
Later, users will also be able to specify volume sizes via volume packages' UUIDs and list available volume sizes with CloudAPI's ListVolumePackages endpoint.
The network to which this shared volume will be attached. This option is
passed using docker's CLI's --opt
command line switch:
docker volume create --name wp-uploads --opt network=mynetwork
Shared volumes only support being attached to fabric networks.
The Triton shared volume driver is named tritonnfs
. It is the default driver
when creating shared volumes when using Triton's Docker API with the docker
client.
Local volumes, created with e.g docker run -v /foo...
are already supported by
Triton. A container will be able to mount both local and NFS shared volumes.
When mounting local and shared NFS volumes on the same mountpoint, e.g with:
docker run -v /bar -v foo:/bar...
the command will result in an error. Otherwise, the NFS volume would be
implicitly mounted over the lofs
filesystem created with -v /bar
, which is
likely not what the user expects.
For reliability reasons, compute nodes never use network block storage and Triton zones are rarely configured with any additional devices. Thus, shared block storage is not an option.
Shared file systems are a natural fit for zones and SmartOS has good support for network file systems. A network file system also aligns with the semantics for a Docker volume. Thus, this is the chosen approach. NFSv3 is used as the underlying protocol. This provides excellent interoperability among various client and server configurations. Using NFS even provides for the possibility of sharing volumes between zones and kvm-based services.
- The NFS server does not need to be HA.
- Dedicated NFS server (or servers) hardware is not necessary.
- Robust locking for concurrent read-write access (e.g. as used by a database) is not necessary.
Because the file system(s) must be served on the customer's VXLAN, it makes sense to provision an NFS server zone, similar to the NAT zone, which is owned by the customer and configured on their VXLAN. Container zones can only talk to NFS server zones that are on the same customer's network.
The current design has a one-to-one mapping between shared volumes and NFS server zones, but this is an implementation detail: it is not impossible that in the future more than one volumes be served from one NFS server zone.
Serving NFS from within a zone using the in-kernel NFS implementation is not currently supported by SmartOS, although we have reason to believe that we could fix this in the future. Instead, a user-mode NFS server will be deployed within the NFS server zone. Because the server runs as a user-level process, it will be subject to all of the normal resource controls that are applicable to any zone. The user-mode solution can be deployed without the need for a new platform or a CN reboot (but other features of this RFD require new platforms and/or reboots).
The NFS server will serve files out of a ZFS delegated dataset, which allow for the following use cases:
-
Upgrading the NFS server zone (e.g to upgrade the NFS server software) without needing to throw away users' data.
-
Snapshotting.
-
Using ZFS send to send users' data to a different host.
The user-mode server must be installed in the zone and configured to export the appropriate file system. The Triton docker tools must support the mapping of the user's logical volume name to the zone name and share.
When a container uses a shared volume, and if the platform where the container runs is updated to a platform that supports it, the NFS file system is automatically mounted in the container from the shared volume zone automatically at startup.
-
Users cannot ssh into shared volume containers.
-
Users cannot list shared volume containers when listing compute instances (they can list them using CloudAPI's /volumes endpoint).
The user-mode NFS server is a solution that provides for quick implementation but may not offer the best performance. In the future we could do the work to enable kernel-based NFS servers within zones. Switching over to that should be transparent to all clients, but would require a new platform on the server side.
Although HA is not a requirement, an NFS server zone clearly represents a SPOF for the application. There are various possibilities we could consider to improve the availability of the NFS service for a specific customer. These ideas are not discussed here but could be considered as future projects.
Integration with Manta is not discussed here. However, we do have our manta-nfs server and it dovetails neatly with the proposed approach. Integrating manta access into an NFS server zone could be considered as a future project.
Hardware configurations that are optimized for use as NFS servers are not discussed here. It is likely that a parallel effort will be needed to identify and recommend server hardware that is more appropriate than our current configurations.
Concurrent access use cases need to be supported, but support for robust high concurrency ones (such as database storage) is not critical.
The dependency of a VM on a given set of shared volumes can only be specified at the VM's creation time. Once a VM is created, it is not possible to change the set of shared volumes it uses/depends on.
With the docker client, users can specify that a docker container uses shared volumes with the following commands:
$ docker create -v volume-1:/mountpoint-1 -v volume-2:/mountpoint-2:ro ...
$ docker run -v volume-1:/mountpoint-1 -v volume-2:/mountpoint-2:ro ...
With the node-triton client, users are able to do the same using the triton instance create
command:
$ triton create -v volume-1:/mountpount-1 -v volume-2:/mountpount-2:ro ...
Using CloudAPI's REST API, clients can specify that a VM uses a given set of
volumes by using CreateMachine
's volumes
input
parameter.
When a VM uses shared volumes and that VM becomes "active", a reference is added to each volume it uses so that these volumes can't be deleted until no VM references them.
In addition to preventing referenced volumes from being deleted, a VM that uses shared volumes automatically mounts them when it starts. There is one exception for KVM VMs: they cannot automatically mount volumes (more details on that below).
How the mounting operation is performed depends on what initialization system the VM uses, and is different for different types of VMs.
Docker containers are initialized by the dockerinit
program. This program gets
the docker:nfsvolumes
metadata to determine what NFS volumes it needs to mount
and mounts them.
In the future, it will be changed to use the sdc:volumes
metadata instead of
docker:nfsvolumes
, but for now we will continue to set the relevant metadata
in docker:nfsvolumes
for backward compatibility.
As part of this work, we'll modify the metadata agent to respond to metadata
requests for the key sdc:volumes
. It will respond by returning the value from
internal_metadata for the sdc:volumes
key there. Since keys prefixed with
sdc:
are already reserved, we're not adding any additional restrictions on
keys that can be used by containers.
We'll add a stringified version of an array of volume objects each of which look like:
{
"mountpoint": "/data",
"name": "mydata",
"nfsvolume": "192.168.128.234:/exports/data",
"mode": "rw",
"type": "tritonnfs"
}
the internal_metadata will be set by vmapi when provisioning the container if volumes are used.
Making infrastructure containers automatically mount shared volumes requires
changes to the platform. Since the mounting needs to happen within the zone
(because it needs to be on the zone's network) the mdata-fetch
service's start
script is modified so that, on zone startup, the zone reads the list of
required NFS volumes using:
mdata-get sdc:volumes
for which a typical response will look something like:
[{"mountpoint":"/data","name":"mydata","nfsvolume":"192.168.128.234:/exports/data","mode":"rw","type":"tritonnfs"}]
It then ensures each of these volumes has an entry in /etc/vfstab, adding as necessary. The entries added will look like:
192.168.1.1:/exports/data - /data nfs - yes
If any of these entries are added, the following SMF services will also be enabled:
svc:/network/nfs/nlockmgr:default
svc:/network/nfs/status:default
svc:/network/rpc/bind:default
svc:/network/nfs/client:default
which will cause the volumes to be mounted once networking has been started. On future boots, since the service will already be enabled and the entries will remain in vfstab, the volumes will continue to be mounted automatically.
For LX containers, lxinit
is responsible for gettting the sdc:volumes
metadata and mounting the relevant volumes, in a way that is similar to
dockerinit
. However, unlike dockerinit, lxinit relies on a new program
volumeinfo
to return information about volumes. This volumeinfo tool reads the
metadata and converts the results to |
separated lines that look like:
tritonnfs|192.168.128.234:/exports/data|/data|mydata|rw
which are parsed by the lxinit tool which then calls the appropriate mount command for each volume found. This happens before the container's "real" init process is called, so the volumes will be mounted before the user's programs are running.
For the master integration milestone, users of KVM containers will not be able
to mount volumes. Using the --volume
command line option of the triton instance create
subcommand to create instances from KVM images will always
generate an error.
Eventually, this limitation might be addressed by updating the sdc-vmtools programs to automatically mount volumes when creating instances from KVM images provided by Joyent.
KVM instances created from custom KVM images would still not support automatically mounting volumes, since it would not be possible to determine whether the image would include support for mounting NFS filesystems.
For these instances of custom KVM images, users could still use user-scripts to
query the sdc:volumes
metadata and mount the relevant volumes however they'd
like to.
NFS shared volumes will have separate packages used when provisioning their underlying storage VMs. These packages will be owned by the administrator, and thus won't be available to end users when provisioning compute instances.
{
cpu_cap: 100,
max_lwps: 1000,
max_physical_memory: 256,
max_swap: 256,
version: '1.0.0',
zfs_io_priority: 100,
default: false
}
NFS volumes' storage VM packages will have names of the following form:
sdc_volume_nfs_$size
where $size
represents the quota of the package in GiB.
A first set of NFS volumes packages will be introduced. They will have a minimum size of 10 GiB, and support sizing in units of 10 GiB and then 100 GiB as the volume size increases.
The list of all available NFS volumes packages sizes initially available will be:
- 10 GiB
- 20 GiB
- 30 GiB
- 40 GiB
- 50 GiB
- 60 GiB
- 70 GiB
- 80 GiB
- 90 GiB
- 100 GiB
- 200 GiB
- 300 GiB
- 400 GiB
- 500 GiB
- 600 GiB
- 700 GiB
- 800 GiB
- 900 GiB
- 1,000 GiB
As users provide feedback regarding package sizes during the first phase of deployment and testing, new packages may be created to better suit their needs, and previous packages will be retired (not deleted, but deactivated).
The current prototype makes NFS server zones have a zfs_io_priority
of 100
,
which seems to be the default value. However, there is a wide spectrum of values
for this property in TPC, ranging from 1
to 16383
.
If we consider only 4th generation packages (g4, k4 and t4), we can run the following command:
sdc-papi /packages | json -Ha zfs_io_priority name | egrep (g4-|k4-|t4-) | awk '{pkgs[$1]=pkgs[$1]" "$2} END{for (pkg_size in pkgs) print pkg_size","pkgs[pkg_size]}' | sort -n -k1,1`
in east1 to obtain the following distribution:
zfs_io_priority | package names |
---|---|
8 | t4-standard-128M |
16 | t4-standard-256M g4-highcpu-128M |
32 | t4-standard-512M g4-highcpu-256M |
64 | t4-standard-1G g4-highcpu-512M k4-highcpu-kvm-250M |
128 | t4-standard-2G g4-highcpu-1G k4-highcpu-kvm-750M |
256 | t4-standard-4G g4-general-4G g4-highcpu-2G k4-general-kvm-3.75G k4-highcpu-kvm-1.75G |
512 | t4-standard-8G g4-bigdisk-16G g4-general-8G g4-highcpu-4G g4-highram-16G k4-general-kvm-7.75G k4-highcpu-kvm-3.75G k4-highram-kvm-15.75G k4-bigdisk-kvm-15.75G |
1024 | t4-standard-16G g4-bigdisk-32G g4-fastdisk-32G g4-general-16G g4-highcpu-8G g4-highram-32G k4-bigdisk-kvm-31.75G k4-fastdisk-kvm-31.75G k4-general-kvm-15.75G k4-highcpu-kvm-7.75G k4-highram-kvm-31.75G |
2048 | t4-standard-32G g4-bigdisk-64G g4-fastdisk-64G g4-general-32G g4-highcpu-16G g4-highram-64G k4-bigdisk-kvm-63.75G k4-fastdisk-kvm-63.75G k4-general-kvm-31.75G k4-highcpu-kvm-15.75G k4-highram-kvm-63.75G |
4096 | t4-standard-64G g4-bigdisk-110G g4-fastdisk-110G g4-highcpu-32G g4-highram-110G |
6144 | t4-standard-96G |
8192 | t4-standard-128G g4-highcpu-110G g4-highcpu-160G g4-highcpu-192G g4-highcpu-64G g4-highcpu-96G g4-highram-222G g4-fastdisk-222G |
10240 | t4-standard-160G |
12288 | t4-standard-192G g4-highcpu-222G |
14336 | t4-standard-224G |
It seems that for compute VMs, the zfs_io_priority
value is proportional to at
least the max_physical_memory
value.
What should NFS volume packages' zfs_io_priority
value be? And since those
packages currently have a constant max_physical_memory
value, should they have
different zfs_io_priority
values?
It seems the answers to these questions would depend on the placement of NFS
volumes (or more precisely, of their storage VMs). Since the placement strategy
can change over time for a given deployment, the zfs_io_priority
value of
packages used by storage VMs must be able to change. However, zfs_io_priority
is an immutable property of packages.
Thus, when operators need to change the zfs_io_priority
value for storage VMs
used by NFS volumes of a given size, they'll need to:
- deactivate the current package used by NFS volumes' storage VMs
- create a new package with the same quota value, the desired
zfs_io_priority
value and a name that matches the patternsdc_nfs_volumes*
.
The placement of NFS server zones during provisioning is a complicated decision which depends, among other things, on:
- hardware resources available
- operation practices
- use cases
For instance, dedicating specific hardware resources to NFS volumes' storage VMs could make rebooting CNs due to platform updates less frequent, as these are typically driven by LX and other OS fixes.
However, mixing NFS volumes' storage VMs and compute VMs would have the benefits of soaking up unused storage space on compute nodes and improving availability of NFS volumes when CNs go down,
As a result, the system should allow for either using dedicated hardware or for server zones to be interspersed with other zones.
Using dedicated hardware will be achieved by setting traits on NFS shared volumes' storage VMs packages that match the traits of CNs where these storage VMs should be provisioned.
In practice, this can be done by sending UpdatePackage
requests
to PAPI to set the desired traits on NFS volume packages, and sending
ServerUpdate
requests to set those traits on the CNs that should act as dedicated
hardware for NFS volumes.
Using dedicated hardware has disadvantages:
-
CN reboots are potentially more disruptive since a lot of volumes would be affected
-
it requires a larger upfront equipment purchase
-
utilization can be low if demand is low
-
affinity constraints that express the requirement for two volumes to be on different CNs can be harder to fulfill, unless there's a large number of storage CNs
-
affinity constraints that express the requirement for a volume to be colocated with a compute VM are impossible to fulfill
Mixing compute VMs and volume storage VMs is the default behavior, and doesn't require operators to do anything. When volume storage VMs' packages do not have any trait set, placement will be done as if storage VMs were compute VMs.
Mixing compute and volume containers has some advantages: since the server fleet does not need to be split, there is more potential to spread volumes across the entire datacenter; the failure of individual servers is less likely to have a wide-spread effect. There are also disadvantages: dedicated volume servers likely need platform updates less often, thus having the potential for better uptime.
The main challenge with mixing compute and volume containers from DAPI's perspective is that volume containers are disk-heavy, while packages for compute containers have (and assume) a more balanced mix between memory, disk, and CPU. E.g. if a package gets 1/4 of a server's RAM, it typically also gets roughly 1/4 of the disk. The disk-heavy volume containers upset this balance, with at least three (non-exclusive) solutions:
-
We accept that mixing compute and volume containers on the same servers will leave more memory not reserved to containers. This would leave more for the ZFS ARC, but how much is too much?
-
We allow the overprovisioning of disk. A lot of disk in typical cloud deployments remain unused, thus this would increase utilization, but sometimes those promises are all called in on the same server. Programs often degrade more poorly under low-disk conditions than low-ram, which can still page out.
-
We add a tier of packages that are RAM-heavy. These could be more easily slotted in on servers which have one or more volumes. It's uncertain how much the demand for RAM-heavy packages can compensate for volume containers.
In order to have better control on performance and availability, Triton users need to be able to express where in the data center their shared volumes are located in relation with their compute containers.
For instance, a user might need to place a shared volume on a compute node that is as close as possible than the compute containers that use them.
The same user might need to place another shared volume on a different computer node than the compute containers that use them to avoid a single point of failure.
Finally, the same user might need to place several different or identical shared volumes on different compute nodes to avoid a single point of failure.
These locality constraints are expressed in terms of relative affinity between shared volumes and compute containers.
Affinity filters are already supported by the docker CLI. Triton users should be able to express affinity using:
- partial names
- labels
and the following operators:
==
:!=
:==~
:!=~
:
Volume containers will not be resizable in place. Resizing VMs presents a lot of challenges that usually end up breaking users' expectations. See https://mo.joyent.com/docs/engdoc/master/resize/index.html for more information about challenges in resizing compute VMs that can be applied to storage VMs.
The recommended way to resize NFS volumes is to create a new volume with the desired size and copy the data from the original volume to the new volume.
While provisioning volume containers does not depend on platform changes, provisioning VMs that automatically mount volumes does depend on platform changes.
To avoid requiring all CNs of a datacenter to be upgraded to platform versions that include those changes before VMs can automatically mount volumes, allocation algorithms will be updated to filter out CNs that don't match minimum platform versions when provisioning VMs that automatically mount volumes.
Each feature flag related to automatically mounting volumes (docker-automount for Docker containers and cloudapi-automount for non-Docker containers) will have their separate SAPI flag that indicates the minimum platform version required to support them.
If no CN or only a limited number of CNs meet these minimum platform requirements, it is possible that provisioning a certain type of VMs that depend on volumes will fail all the time, or that the capacity will fill up very quickly.
To make sure that operators are aware of this, the sdcadm experimental nfs-volumes
command will output a warning whenever at least one CN of a given
data center does not match the minimum platform requirements for a given feature
flag.
NFS volumes will be reachable only on fabric networks. The rationale is that only the network provides isolation for NFS volumes, and that it makes using and managing NFS volumes simpler.
Users should not be able to set firewall rules on NFS volumes and should instead make their NFS volumes available on selected fabric networks.
Making NFS volumes available on non-fabric networks would, without the ability for users to set firewall rules, potentially make them reachable by other users.
Triton users need to be able to snapshot the content of their shared volumes, and roll back to any of these snapshots at any time. The typical use case that needs to be supported is a user who needs to be able to make changes to the data in a shared volume while being able to roll back to a known snapshot. Snapshot backups are out of scope.
Shared volumes' user data is stored in a delegated dataset. This gives the nice property of being able to update the underlying zone's software (such as the NFS server) without having to recreate the shared volume.
Snapshotting a shared volume involves snapshotting only the delegated dataset that contains the actual user data, not the whole root filesystem of the underlying zone.
Snapshotting a shared volume is done by using VOLAPI's CreateVolumeSnapshot
endpoint. See the
section about REST APIs changes for more details on APIs that
allow users to manage volume snapshots.
As snapshots consume storage space even if no file is present in the delegated dataset, limiting the number of snapshots that a user can create may be needed. This limit could be implemented at various levels:
-
In the Shared Volume API (VOLAPI): VOLAPI could maintain a count of how many snapshots of a given volume exist and a maximum number of snapshots, and send a request error right away without even reaching the compute node to try and create a new snapshot.
-
At the zfs level: snapshotting operation results would bubble up to VOLAPI, which would result in a request error in case of a failure.
Several existing APIs are directly involved when manipulating shared volumes: sdc-docker, VMAPI, CloudAPI, NAPI, etc. . This section presents the changes that will need to be made to existing APIs as well as new APIs that will need to be created to support shared volumes and their use cases.
Machines acting as shared volumes' storage zones are an implementation detail
that should not be exposed to end users. As such, they need to be filtered out
from any of the *Machines
endpoints (e.g ListMachines
, GetMachine
, etc.).
For the ListMachines
endpoint, filtering out NFS volumes' storage VMs will be
done by filtering on the sdc:system_role
internal_metadata``: VMs with the a value of
nfsvolumestoragefor their
sdc:system_role
internal_metadata` will
be filtered out.
For instance, CloudAPI's ListMachines
endpoint will always pass -- in addition
to any other search predicate set due to other ListMachines
parameters -- the
following predicate to VMAPI's ListVms endpoint:
{ne: ['tags', '*triton.system_role=nfsvolumestorage*']}
For all other endpoints, they will result in an error when used on an NFS volume's storage VM.
Volume objects are represented in CloudAPI in a way similar as their internal representation in VOLAPI for both their common properties as well as their type specific ones.
The only differences are:
-
the
uuid
field is namedid
to adhere to current conventions between the representation of Triton objects in CloudAPI and internal APIs -
the
vm_uuid
property that represents the UUID of a volume's storage VM is not exposed to CloudAPI users
When creating a machine via CloudAPI, the new volumes
parameter allows one to
specify a list of volumes to mount in the new machine. This would look as
follows in the CreateMachine
payload:
"volumes": [
{
"name": "volume-name-1",
"type": "tritonnfs",
"mode": "rw",
"mountpoint": "/foo"
},
{
"name": "volume-name-2",
"mode": "ro",
"mountpoint": "/bar"
}
]
The new machine has the specified volumes mounted when it starts, and the appropriate volume references are added to indicate that this machine uses the listed volumes.
The type
property of each object of the volumes
array is optional. Its
default and only currently valid value is 'tritonnfs'
.
The mode
property of each object of the volumes
array is also optional. Its
default value is 'rw'
, and valid values for volumes of type 'tritonnfs'
are
'rw'
and 'ro'
.
Users need to be able to manage their shared volumes from CloudAPI. Most of the
volume related endpoints of CloudAPI will be implemented by forwarding the
initial requests to the Volumes API service and adding the owner_uuid
input
parameter that corresponds to the user making the request.
Param | Type | Milestone | Description |
---|---|---|---|
name | String | master-integration | Allows to filter volumes by name. |
predicate | String | master-integration | URL encoded JSON string representing a JavaScript object that can be used to build a LDAP filter. This LDAP filter can search for volumes on arbitrary indexed properties. More details below. |
size | String | master-integration | Allows to filter volumes by size, e.g size=10240 . |
state | String | master-integration | Allows to filter volumes by state, e.g state=failed . |
tag.key | String | mvp | A string representing the value for the tag with key key to match. More details below. |
type | String | master-integration | Allows to filter volumes by type, e.g tritonnfs . |
name
is a string containing either a full volume name or a partial volume name
prefixed and/or suffixed with a *
character. For example:
- foo
- foo*
- *foo
- *foo*
are all valid name=
searches which will match respectively:
- the exact name
foo
- any name that starts with
foo
such asfoobar
- any name that ends with
foo
such asbarfoo
- any name that contains
foo
such asbarfoobar
The predicate
parameter is a JSON string that can be used to build a LDAP
filter to search on the following indexed properties:
name
billing_id
(volume packages milestone)type
state
tags
(mvp milestone)
Important: when using a predicate, you cannot include the same parameter in both
the predicate and the non-predicate query parameters. For example, if your
predicate includes any checks on the name
field, passing the name=
query
paramter is an error.
Note: searching for tags is to be implemented as part of the mvp milestone.
It is also possible to search for volumes matching one tag by using the tag
parameter as following:
/volumes?tag.key=value
For instance, to search for a volume with the tag foo
set to bar
:
/volumes?tag.foo=bar
This form only allows to specify one tag name/value pair. The predicate search can be used to perform a search with multiple tags.
A list of volume objects of the following form:
[
{
"id": "e435d72a-2498-8d49-a042-87b222a8b63f",
"name": "my-volume",
"owner_uuid": "ae35672a-9498-ed41-b017-82b221a8c63f",
"type": "tritonnfs",
"nfs_path": "host:port/path",
"state": "ready",
"networks": [
"1537d72a-949a-2d89-7049-17b2f2a8b634"
],
"snapshots": [
{
"name": "my-first-snapshot",
"create_timestamp": "2017-08-22T00:11:44.123Z",
"state": "created"
},
{
"name": "my-second-snapshot",
"create_timestamp": "2017-08-22T00:11:54.123Z",
"state": "created"
}
],
"tags: {
"foo": "bar",
"bar": "baz"
}
},
{
"id": "a495d72a-2498-8d49-a042-87b222a8b63c",
"name": "my-other-volume",
"owner_uuid": "d1c673f2-fe9c-4062-bf44-e13959d26407",
"type": "someothervolumetype",
"state": "ready",
"networks": [
"4537d92a-149c-6d83-104a-97b2f2a8b635"
],
"tags: {
"foo": "bar",
"bar": "baz"
}
}
...
]
Param | Type | Mandatory | Description |
---|---|---|---|
name | String | No | The desired name for the volume. If missing, a unique name for the current user will be generated. |
size | Number | No | The desired minimum storage capacity for that volume in mebibytes. Default value is 10240 mebibytes (10 gibibytes). |
type | String | Yes | The type of volume. Currently only 'tritonnfs' is supported. |
networks | Array | Yes | A list of UUIDs representing networks on which the volume is reachable. These networks must be fabric networks owned by the user sending the request. |
labels (mvp milestone) | Object | No | An object representing key/value pairs that correspond to label names/values. |
A volume object representing the volume being created. When
the response is sent, the volume and all its resources are not created and its
state is creating
. Users need to poll the newly created volume with the
GetVolume
API to determine when it's ready to use (its state transitions to
ready
).
If the creation process fails, the volume object has its state set to failed
.
GetVolume can be used to get data from an already created volume, or to determine when a volume being created is ready to be used.
Param | Type | Description |
---|---|---|
id | String | The uuid of the volume object |
A volume object representing the volume with UUID id
.
GetVolumeReferences
can be used to list VMs that are using the volume
with ID id
.
A list of VM UUIDs that are using the volume with ID id
:
[
"a495d72a-2498-8d49-a042-87b222a8b63c",
"b135a72a-1438-2829-aa42-17b231a6b63e"
]
Param | Type | Description |
---|---|---|
id | String | The id of the volume object |
force | Boolean | If true, the volume can be deleted even if there are still non-deleted containers that reference it . |
If force
is not specified or false
, deletion of a shared volume is not
allowed if it has at least one "active user". If force
is true, the constraint
on having no active user of that volume doesn't apply.
See the section "Deletion and usage semantics" for more information.
The output is empty and the status code is 204 if the deletion was scheduled successfully.
A volume is always deleted asynchronously. In order to determine when the volume
is actually deleted, users need to poll the volume using the GetVolume
endpoint until it returns a 404 response.
If resources are using the volume to be deleted, the request results in a
VolumeInUse
error.
The UpdateVolume
endpoint can be used to update the following properties of a
shared volume:
name
, to rename a volume. See the section on renaming volumes for further details.tags
, to add/remove tags for a given volume
Param | Type | Description |
---|---|---|
id | String | The id of the volume object |
name | String | The new name of the volume with id id |
tags (mvp milestone) | Array of string | The new tags for the volume with id id |
Sending any other input parameter will result in an error. Updating other properties of a volume, such as the networks it's attached to, must be performed by using other separate endpoints.
The ListVolumeSizes
endpoint can be used to determine in what sizes volumes of
a certain type are available.
Param | Type | Description |
---|---|---|
type | String | the type of the volume (e.g tritonnfs ). Default value is tritonnfs |
Sending any other input parameter will result in an error.
The response is an array of objects having two properties:
-
size
: a number in mebibytes that represents the size of a volume -
type
: the type of volume for which the size is available
AttachVolumeToNetwork
can be used to make a volume reachable on a given
fabric network. Non-fabric networks are not supported, and passing a
network_id
that doesn't represent a fabric network will result in an error.
Param | Type | Description |
---|---|---|
id | String | The id of the volume object |
network_id | String | The id of the network to which the volume with id id should be attached |
A volume object representing the volume with ID id
.
DetachVolumeFromNetwork
can be used to make a volume that used to be reachable
on a given network not reachable on that network anymore.
Param | Type | Description |
---|---|---|
id | String | The id of the volume object |
network_id | String | The id of the network from which the volume with id id should be detached |
A volume object representing the volume with ID id
.
Param | Type | Description |
---|---|---|
name | String | The desired name for the snapshot to create. The name must be unique per volume. |
The volume object representing the volume with ID id
, with the newly created
snapshot added to its snapshots
list property. Note that creating a snapshot
can fail as no space might be left in the corresponding zfs dataset.
The snapshot object with name snapshot-name
for the
volume with ID id
.
Note that rolling back a NFS shared volume to a given snapshot requires its underlying storage VM to be stopped and restarted, making the storage provided by that volume unavailable for some time.
Param | Type | Description |
---|---|---|
snapshot_id | String | The id of the snapshot object that represents the state to which to rollback. |
name | String | The name of the snapshot object that represents the state to which to rollback. |
The volume object that represents the volume with ID id
with its state
property set to rolling_back
. When the volume has been rolled back to the
snapshot with name name
, the volume's state
property is ready
.
The volume object that represents the volume with ID id
.
This volume object can be polled to determine when the snapshot with name
snapshot-name
is not present in the snapshots
list anymore, which means the
snapshot was deleted successfully.
Param | Type | Description |
---|---|---|
type | String | The type of the volume package object, e.g 'tritonnfs' |
An array of objects representing volume
packages with the
type set to type
.
An object representing the volume
package with UUID
volume-package-uuid
.
A new sdc:volumes
metadata property will be added that will contain all the
data needed for an instance to determine what volumes it requires/mounts.
Machines acting as shared volumes' storage zones will have the value
nfsvolumestorage
for their sdc:system_role
internal_metadata
key. To know
how this new value is used by CloudAPI to prevent users from performing
operations on these storage VMs, refer to the section Not exposing NFS volumes'
storage VMs via any of the Machines
endpoints
When creating a VM via VMAPI, the new volumes
parameter will allow one to
specify a list of volumes to mount in the new VM. This would look as follows in
the CreateVM
payload:
"volumes": [
{
"name": "volume-name-1",
"type": "tritonnfs",
"mode": "rw",
"mountpoint": "/foo"
},
{
"name": "volume-name-2",
"mode": "ro",
"mountpoint": "/bar"
}
]
and the new VM would then have the specified volumes mounted when it starts, and the appropriate volume references will be added to indicate that this VM uses the listed volumes.
The type
property of each object of the volumes
array is optional. Its
default and only valid value is 'tritonnfs'
.
The mode
property of each object of the volumes
array is also optional. Its
default value is 'rw'
, and valid values for volumes of type 'tritonnfs'
are
'rw'
and 'ro'
.
Shared volumes names are unique per account. Thus, in order to be able to easily identify and search for shared volume zones without getting conflicts on VM aliases, shared volume zones' aliases will have the following form:
alias='volume-$volumename-$volumeuuid'
Storage volumes sharing the same traits will be grouped into volume types. For instance, NFS shared volumes will have the volume type "tritonnfs". Potential future "types" of volumes are "tritonebs" for "Triton Elastic Block Storage" or "tritonefs" for "Triton Elastic File System".
Different settings (volume size, QoS, hardware, etc.) for a given volume type will be represented as volume packages. For instance, a 10 GiB tritonnfs volume will have its own package, and a different package will be used for a 20 GiB tritonnfs volume.
Each volume package has a UUID and a name, similarly to current packages used to provision compute instances. In order to avoid confusion, this document uses the term "compute packages" to explicitly distinguish these packages from volume packages.
Here's an example of a volume package object represented as a JSON string:
{
"uuid": "df40d4bf-b4f4-4409-84e7-daa04a347c18",
"name": "nfs4-ssd-10g", // As in 10 GiB
"size": 10, // In GiB,
"type": "tritonnfs",
"created_at": "2016-05-30T17:54:51.511Z",
"description": "A shared NFS volume providing 10 GiB of storage",
"active": true
}
Common properties of volume packages objects are:
uuid
: a unique identifier for a volume package.name
: a unique name that represents a volume package. This name can be used in lieu of the UUID to provide an ID that is easier to use and remember.type
: the kind of volume that a package is associated to. Currently, there is only one type of volumes that users can create:'tritonnfs'
volumes. However in the future it is likely that other types of volumes will be available.created_at
: a timestamp that represents when this volume package was created. Note that a volume package cannot be deleted.description
: a string that gives further details to users about the package.owner_uuids
: if present, an array of user UUIDs that represents what users can use (list, get and provision volumes with) a package. If empty, the package can be used by everyone.active
: a boolean that determines whether a package can be used. Packages withactive === false
cannot be used by any user. They can only be managed by operators using VOLAPI directly.
Volume packages are polymorphic. Different volume types are associated with
different forms of volume packages. For instance, 'tritonnfs'
volumes are
associated with packages that have a size
property because these volumes are
not elastic.
Currently, there is just one volume type named 'tritonnfs'
, and its associated
volume packages objects have only one specific property:
size
: a Number representing the storage capacity in mebibytes provided by volumes created with this package.compute_package_uuid
: the UUID of the compute package to use to provision the storage VMs when a new volume using this package is created.
Note that , when introduced, volume packages will not be used by DAPI to provision any VM. Instead, when creating a volume requires provisioning a storage VM, a compute package that matches the volume package used when creating the volume will be used.
However, volume packages' UUIDs will be used for billing purposes. This
means that in the case of tritonnfs
volume packages, their corresponding
storage VMs packages will not be used for billing.
Volume packages data are stored in a separate Moray bucket than the one used to store compute packages. This has the advantage of not requiring to change and migrate the existing compute packages data, and overall makes the management of both compute and volume packages objects in Moray simpler.
It implies some potential limitations, such as making listing all packages (compute and volume packages) and ordering them by e.g creation time cumbersome and not perform as well as a single indexed search. However that use case doesn't seem to be common enough to be a concern.
Volume packages will have the following naming conventions:
$type$generation-$property1-$property2-$propertyN
where $type
is a volume type such as nfs
, $generation is a monotonically
increasing integer starting at 1
, and $propertyX
are values for
differentiating properties of the volume type $type
, such as the size.
For instance, the proposed names for tritonnfs volume packages are:
- nfs1-10g
- nfs1-20g
- nfs1-30g
- nfs1-40g
- nfs1-50g
- nfs1-60g
- nfs1-70g
- nfs1-80g
- nfs1-90g
- nfs1-100g
- nfs1-200g
- nfs1-300g
- nfs1-400g
- nfs1-500g
- nfs1-600g
- nfs1-700g
- nfs1-800g
- nfs1-900g
- nfs1-1000g
Param | Type | Description |
---|---|---|
type | String | Required. The type of the volume package object, e.g 'tritonnfs' . |
size | Number | Required when type is tritonnfs , otherwise irrelevant. A number in GiB representing the size volumes created with this package. |
compute_package_uuid | String | Required when type is tritonnfs , otherwise irrelevant. It represents the UUID of the compute package to use to provision the storage VMs when a new volume using this package is created. |
description | String | Required. A string that gives further details to users about the package. |
owner_uuids | Array of strings | Optional. An array of user UUIDs that represents what users can use (list, get and provision volumes with) a package. If empty, the package can be used by everyone. |
active | Boolean | Required. A boolean that determines whether a package can be used. Packages with active === false cannot be used by any user. They can only be managed by operators using VOLAPI directly. |
An object representing a volume package:
{
"uuid": "df40d4bf-b4f4-4409-84e7-daa04a347c18",
"name": "nfs4-ssd-10g", // As in 10 GiB
"size": 10, // In GiB,
"type": "tritonnfs",
"created_at": "2016-05-30T17:54:51.511Z",
"description": "A shared NFS volume providing 10 GiB of storage",
"active": true,
"storage_vm_pkg": "ab21792b-6852-4dab-8c78-6ef899172fab"
}
GET /volumepackages/uuid
{
"uuid": "df40d4bf-b4f4-4409-84e7-daa04a347c18",
"name": "nfs4-ssd-10g", // As in 10 GiB
"hardware": "ssd",
"size": 10, // In GiB,
"type": "tritonnfs",
"created_at": "2016-05-30T17:54:51.511Z",
"description": "A shared NFS volume providing 10 GiB of storage",
"active": true,
"storage_vm_pkg": "ab21792b-6852-4dab-8c78-6ef899172fab"
}
As for compute packages, volume packages cannot be destroyed.
Even though shared volumes are implemented as actual zones in a way similar to regular instances, they represent an entirely different concept with different constraints, requirements and life cycle. As such, they need to be represented in Triton as different "Volume" objects.
The creation, modification and deletion of these "Volume" objects could technically be managed by VMAPI, but implementing this API as a separate service has the advantage of building the foundation for supporting volumes that are not implemented in terms of Triton VMs, such as volumes backed by storage appliances or external third party services.
Implementing the Volumes API as a separate service also has some nice side effects of:
-
not growing the surface area of VMAPI, which is already quite large
-
being able to actually decouple the Volumes API development and deployment from VMAPI's.
As a result, this section proposes to add a new API/service named "Volume API" or "VOLAPI".
Param | Type | Milestone | Description |
---|---|---|---|
name | String | master-integration | Allows to filter volumes by name. |
size | Stringified Number | master-integration | Allows to filter volumes by size. |
owner_uuid | String | master-integration | When not empty, only volume objects with an owner whose UUID is owner_uuid will be included in the output |
billing_id | String | volume packages | When not empty, only volume objects with a billing_id whose UUID is billing_id will be included in the output |
type | String | master-integration | Allows to filter volumes by type, e.g type=tritonnfs . |
state | String | master-integration | Allows to filter volumes by state, e.g state=failed . |
predicate | String | master-integration | URL encoded JSON string representing a JavaScript object that can be used to build a LDAP filter. This LDAP filter can search for volumes on arbitrary indexed properties. More details below. |
tag.key (mvp milestone) | String | mvp | A string representing the value for the tag with key key to match. More details below. |
vm_uuid | String | mvp | Allows to get the volume whose storage VM's uuid is vm_uuid . This applies to NFS volumes, and may not apply to other types of volumes in the future |
name
is a string containing either a full volume name or a partial volume name
prefixed and/or suffixed with a *
character. For example:
- foo
- foo*
- *foo
- *foo*
are all valid name=
searches which will match respectively:
- the exact name
foo
- any name that starts with
foo
such asfoobar
- any name that ends with
foo
such asbarfoo
- any name that contains
foo
such asbarfoobar
The predicate
parameter is a JSON string that can be transformed into an LDAP filter to search
on the following indexed properties:
name
owner_uuid
billing_id
(volume packages milestone)type
size
state
tags
(mvp milestone)
Important: when using a predicate, you cannot include the same parameter in both
the predicate and the non-predicate query parameters. For example, if your
predicate includes any checks on the name
field, passing the name=
query
paramter is an error.
Note: searching for tags is to be implemented as part of the mvp milestone.
A list of volume objects of the following form:
[
{
"uuid": "e435d72a-2498-8d49-a042-87b222a8b63f",
"name": "my-volume",
"owner_uuid": "ae35672a-9498-ed41-b017-82b221a8c63f",
"type": "tritonnfs",
"nfs_path": "host:port/path",
"state": "ready",
"networks": [
"1537d72a-949a-2d89-7049-17b2f2a8b634"
],
"snapshots": [
{
"name": "my-first-snapshot",
"create_timestamp": "1562802062480",
"state": "created"
},
{
"name": "my-second-snapshot",
"create_timestamp": "1572802062480",
"state": "created"
}
],
"tags: {
"foo": "bar",
"bar": "baz"
}
},
{
"uuid": "a495d72a-2498-8d49-a042-87b222a8b63c",
"name": "my-other-volume",
"owner_uuid": "d1c673f2-fe9c-4062-bf44-e13959d26407",
"type": "someothervolumetype",
"state": "ready",
"networks": [
"4537d92a-149c-6d83-104a-97b2f2a8b635"
],
"tags: {
"foo": "bar",
"bar": "baz"
}
}
...
]
GetVolume can be used to get data from an already created volume, or to determine when a volume being created is ready to be used.
Param | Type | Description |
---|---|---|
uuid | String | The uuid of the volume object |
owner_uuid | String | The uuid of the volume's owner |
A volume object representing the volume with UUID uuid
.
Param | Type | Description |
---|---|---|
name | String | The desired name for the volume. If missing, a unique name for the current user will be generated |
owner_uuid | String | The UUID of the volume's owner. |
size | Number | The desired storage capacity for that volume in mebibytes. Default value is 10240 mebibytes (10 gibibytes). |
type | String | The type of volume. Currently only 'tritonnfs' is supported. |
networks | Array | A list of UUIDs representing networks on which the volume will be reachable. These networks must be owned by the user with UUID owner_uuid and must be fabric networks. |
server_uuid (mvp milestone) | String | For tritonnfs volumes, a compute node (CN) UUID on which to provision the underlying storage VM. Useful for operators when performing tritonnfs volumes migrations. |
ip_address (mvp milestone) | String | For tritonnfs volumes, the IP address to set for the VNIC of the underlying storage VM. Useful for operators when performing tritonnfs volumes migrations to reuse the IP address of the migrated volume. |
tags (mvp milestone) | Object | An object representing key/value pairs that correspond to tags names/values. Docker volumes' labels are implemented with tags. |
A volume object representing the volume with UUID uuid
. The
state
property of the volume object is either creating
or failed
.
If the state
property of the newly created volume is creating
, sending
GetVolume
requests periodically can be used to determine when the volume is
either ready to use (state
=== 'ready'
) or when it failed to be created
(state
=== 'failed'
.
Param | Type | Description |
---|---|---|
owner_uuid | String | The UUID of the volume's owner. |
uuid | String | The uuid of the volume object |
force | Boolean | If true, the volume can be deleted even if there are still non-deleted containers that reference it . |
If force
is not specified or false
, deletion of a shared volume is not
allowed if it has at least one "active user". If force
is true, a shared
volume can be deleted even if it has active users.
See the section "Deletion and usage semantics" for more information.
The output is empty and the status code is 204 if the deletion was scheduled successfully.
A volume is always deleted asynchronously. In order to determine when the volume
is actually deleted, users need to poll the volume's state
property.
If resources are using the volume to be deleted, the request results in an error and the error contains a list of resources that are using the volume.
The UpdateVolume endpoint can be used to update the following properties of a shared volume:
name
, to rename a volume. See the section on renaming volumes for further details.tags
, to add/remove tags for a given volume
Param | Type | Description |
---|---|---|
owner_uuid | String | The UUID of the volume's owner. |
uuid | String | The uuid of the volume object |
name | String | The new name of the volume with uuid uuid |
tags (mvp milestone) | Array of string | The new tags for the volume with uuid uuid |
The response is empty, and the HTTP status code is 204. This allows the
implementation to not have to reload the updated volume, and thus minimizes
latency. If users need to get an updated representation of the volume, they can
send a GetVolume
request.
The ListVolumeSizes
endpoint can be used to determine in what sizes volumes of
a certain type are available.
Param | Type | Description |
---|---|---|
type | String | the type of the volume (e.g tritonnfs ). Default value is tritonnfs |
Sending any other input parameter will result in an error.
The response is an array of objects having two properties:
-
size
: a number in mebibytes that represents the size of a volume -
type
: the type of volume for which the size is available
GetVolumeReferences
can be used to list VMs that are using the volume with
UUID uuid
.
A list of VM UUIDs that are using the volume with UUID uuid
:
[
"a495d72a-2498-8d49-a042-87b222a8b63c",
"b135a72a-1438-2829-aa42-17b231a6b63e"
]
AttachVolumeToNetwork
can be used to make a volume reachable on a given
network.
Param | Type | Description |
---|---|---|
owner_uuid | String | The UUID of the volume's owner. |
uuid | String | The uuid of the volume object |
network_uuid | String | The uuid of the network to which the volume with uuid uuid should be attached |
A volume object representing the volume with UUID uuid
.
DetachVolumeFromNetwork
can be used to make a volume that used to be reachable
on a given network not reachable on that network anymore.
Param | Type | Description |
---|---|---|
owner_uuid | String | The UUID of the volume's owner. |
uuid | String | The uuid of the volume object |
network_uuid | String | The uuid of the network from which the volume with uuid uuid should be detached |
A volume object representing the volume with UUID uuid
.
Volume references represent a relation of usage between VMs and volumes. A VM is
considered to "use" a volume when it mounts it on startup. A VM can be made to
mount a volume on startup by using he volumes
input parameter of the
CreateVm
VMAPI API.
References are represented in volume objects by a refs
property. It is an
array of VM UUIDs. All VM UUIDs in this array are said to reference the volume
object.
When a volume is referenced by at least one VM, it cannot be deleted, unless the
force
parameter of the DeleteVolume
API is set to true
.
When a VM that references a volume becomes inactive, its reference to that volume is automatically removed. If it becomes active again, it is automatically added.
Volume references are useful to represent a "usage" relationship between existing VMs and volumes. However, sometimes there's a need to represent a future usage relationship between volumes and VMs that do not exist yet.
When volumes are linked to the VM which mounts them at creation time, the volume(s) are created before the VM that mounts them is created.
Indeed, since the existence of the VM is tied to the existence of the volumes it mounts, it wouldn't make sense to create it before all of its volumes are ready to be used.
However, having the volumes created before the VM that mounts them means that there is a window of time during which the volumes are not referenced by any active VM. As such, they could be deleted before the provisioning workflow job of the VM that mounts them completes and the VM becomes active.
Volume reservations are the abstraction that allows a VM that does not exist yet to reference one or more volumes, and prevents those volumes from being deleted until the provisioning job fails or the VM becomes inactive.
Volume reservations are composed of the following attributes:
uuid
: the UUID of the reservation objectvm_uuid
: the UUID of the VM being createdjob_uuid
: the UUID of the job that creates the VMowner_uuid
: the UUID of the owner of the VM and the volumesvolume_name
: the name of the volume being createdcreate_timestamp
: the time at which the volume reservation was created
The workfow of volume reservations can be described as following:
-
The VM provisioning workflow determines the VM being provisioned mounts one or more volume, so it creates those volumes
-
Once all the volumes mounted by the VM being provisioned are created, the provisoning workflow creates a separate volume reservation for each volume mounted
-
The VM starts being provisioned
Once volume reservations are created, it is not possible to delete the volumes reserved unless one of this condition is valid:
- the force flag is passed to the
DeleteVolume
request - the VM provisioning job that reserved the volumes completed its execution and failed
- the VM mounting the volumes became inactive
Volume reservations are cleaned up periodically by VOLAPI so that stalled VM provisioning workflows do not hold volume reservations forever.
Volume reservations are also deleted when a reference from the same VM to the same volume is created. This happens when:
- the corresponding provisioning workflow job completes successfully
- the VM that mounts reserved volumes becomes active
Param | Type | Description |
---|---|---|
volume_name | String | The name of the volume being reserved |
job_uuid | UUID | UUID of the job provisioning the VM that mounts the volume |
owner_uuid | UUID | UUID for the owner of the VM with UUID vm_uuid and the volume with name volume_name |
vm_uuid | UUID | UUID of the VM being provisioned that mounts the volume with name "volume_name" |
A volume reservation object of the following form:
{
"uuid": "1360ef7d-e831-4351-867a-ea350049a934",
"volume_name": "input-name",
"job_uuid": "1db9b975-bd8b-4ed5-9878-fa2a8e45a821",
"owner_uuid": "725624f8-53a9-4f0b-8f4f-3de8922fc4c8",
"vm_uuid": "e7bc54f4-00ea-42c3-90c6-c78ee541572d",
"create_timestamp": "2017-09-07T16:05:17.776Z"
}
Param | Type | Description |
---|---|---|
uuid | String | The uuid of the volume reservation being deleted |
owner_uuid | String | The UUID of the owner associated to that volume reservation |
Empty 204 HTTP response.
Param | Type | Description |
---|---|---|
volume_name | String | The name of the volume being reserved |
job_uuid | UUID | UUID of the job provisioning the VM that mounts the volume |
owner_uuid | UUID | UUID for the owner of the VM with UUID vm_uuid and the volume with name volume_name |
An array of volume reservation objects.
A snapshot represents the state of a volume's storage at a given point in time. Volume snapshot objects are stored within volume objects because the relationship between a volume and a snapshot is one of composition. A volume object is composed, among other things, of zero or more snapshots. When a volume object is deleted, its associated snapshots are irrelevant and are also deleted.
A snapshot object has the following properties:
name
: a human readable name.create_timestamp
: a number that can be converted to the timestamp at which the snapshot was taken.state
: a value of the following set:creating
,failed
,created
.
Param | Type | Description |
---|---|---|
name | String | The desired name for the snapshot to create. The name must be unique per volume. |
An snapshot object with the state 'creating'
:
{
"name": "input-name",
"state": "creating",
"create_timestamp": "2017-08-22T00:11:44.123Z"
}
Note that creating a snapshot can fail as no space might be left in the corresponding zfs dataset.
Param | Type | Description |
---|---|---|
owner_uuid | String | If present, the owner_uuid passed as a parameter will be checked against the owner_uuid of the volume identified by volume-uuid . If they don't match, the request will result in an error. |
The snapshot object with name snapshot-name
for the
volume with UUID volume-uuid
.
Note that rolling back a NFS shared volume to a given snapshot requires its underlying storage VM to be stopped and restarted.
Param | Type | Description |
---|---|---|
uuid | String | The uuid of the snapshot object that represents the state to which to rollback. |
name | String | The name of the snapshot object that represents the state to which to rollback. |
owner_uuid | String | If present, the owner_uuid passed as a parameter will be checked against the owner_uuid of the volume identified by volume-uuid . If they don't match, the request will result in an error. |
The volume object that represents the volume with UUID volume-uuid
, with its
snapshots
list updated to not list the snapshots that were taken after the
one to which the volume was rolled back.
Param | Type | Description |
---|---|---|
name | String | The new name of the snapshot object with uuid snapshot-uuid . |
owner_uuid | String | If present, the owner_uuid passed as a parameter will be checked against the owner_uuid of the volume identified by volume-uuid . If they don't match, the request will result in an error. |
A list of snapshot objects that were created from the
volume with UUID volume-uuid
.
Param | Type | Description |
---|---|---|
uuid | String | The uuid of the snapshot object to delete. |
owner_uuid | String | If present, the owner_uuid passed as a parameter will be checked against the owner_uuid of the volume identified by volume-uuid . If they don't match, the request will result in an error. |
An object that allows for polling the state of the snapshot being deleted:
{
"job_uuid": "job-uuid",
"volume_uuid": "volume-uuid"
}
Volumes are be represented as objects that share a common set of properties:
{
"uuid": "some-uuid",
"owner_uuid": "some-uuid",
"name": "foo",
"type": "tritonnfs",
"create_timestamp": 1462802062480,
"state": "created",
"snapshots": [
{
"name": "my-first-snapshot",
"create_timestamp": "1562802062480",
"state": "created"
},
{
"name": "my-second-snapshot",
"create_timestamp": "1572802062480",
"state": "created"
}
],
tags: {
"foo": "bar",
"bar": "baz"
}
}
-
uuid
: the UUID of the volume itself. -
owner_uuid
: the UUID of the volume's owner. In the example of a NFS shared volume, the owner is the user who created the volume using thedocker volume create
command. -
billing_id
(volume packages milestone): the UUID of the volume package used when creating the volume. -
name
: the volume's name. It must be unique for a given user. This is similar to thealias
property of VMAPI's VM objects. It must match the regular expression/^[a-zA-Z0-9][a-zA-Z0-9_\.\-]+$/
. There is no limit on the length of a volume's name. -
type
: identifies the volume's type. There is currently one possible value for this property:tritonnfs
. Additional types can be added in the future, and they can all have different sets of type specific properties. -
create_timestamp
: a timestamp that indicates the time at which the volume was created. -
state
:creating
,ready
,deleting
,deleted
,failed
orrolling_back
(snapshots milestone). Indicates in which state the volume currently is.failed
volumes are still persisted to Moray for troubleshooting/debugging purposes. See the section Volumes state machine for a diagram and further details about the volumes' state machine. -
networks
: a list of network UUIDs that represents the networks on which this volume can be reached. -
snapshots
(snapshots milestone): a list of snapshot objects. -
tags
(mvp milestone): a map of key/value pairs that represent volume tags. Docker volumes' labels are implemented with tags.
Volume names need to be unique per account. As indicated in the "Shared storage implementation" section, several volumes might be on the same zone at some point.
Renaming a volume is not allowed for volumes that are referenced by active Docker containers. The rationale is that docker volumes are identified by their name for a given owner. Allowing to rename volumes referenced (mounted) by active Docker containers would thus mean to either:
-
Store the reference from Docker containers to volumes as a uuid. A Docker container
foo
mounting a volumebar
would still be able to mount it if that volume's name changed tobaz
, even though that volume might be used for a completely different purpose. Think ofbar
andbaz
asdb-primary
anddb-secondary
for a real-world use case. -
Break volume references.
Different volume types may need to store different properties in addition to the properties listed above. For instance, "tritonnfs" volumes have the following extra properties:
filesystem_path
: the path that can be used by a NFS client to mount the NFS remote filesystem in the host's filesystem.vm_uuid
: the UUID of the Triton VM running the NFS server that exports the actual storage provided by this volume.size
: a Number representing the storage size available for this volume, in mebibytes.
A volume is considered to be "in use" if the
GetVolumeReferences
endpoint doesn't return an empty list of VM UUIDs. When a container which mounts
shared volumes is created and becomes "active", it is added as a "reference" to
those shared volumes.
A container is considered to be active when it's in any state except failed
and destroyed
-- in other words in any state that can transition to running
.
For instance, even if a stopped container is the only remaining container that references a given shared volume, it won't be possible to delete that volume until that container is deleted.
Deleting a shared volume when there's still at least one active container that references it will result in an error. This is in line with Docker's API's documentation about deleting volumes.
However, a shared volume can be deleted if its only users are not mounting it
via the Triton APIs (e.g by using the mount
command manually from within a
VM), because currently there doesn't seem to be any way to track that usage
cleanly and efficiently.
Volume objects of any type are stored in the same moray bucket named
volapi_volumes
. This avoids the need for searches across volume types to be
performed in different tables, then aggregated and sometimes resorted when a
sorted search is requested.
Indexes are setup for the following searchable properties:
owner_uuid
name
type
create_timestamp
state
tags
(mvp milestone)
Contrary to VMAPI's VM objects, moray objects are not stored for deleted volumes.
On the other hand, volumes in state failed
are stored in Moray, but they do
not need to be present in persistent storage forever. Not deleting these entries
has an impact on performance. For instance, searches take longer and responses
are larger, which tends to increase response latency. Maintenance is also
impacted. For instance, migrations take longer as there are more objects to
handle.
Eventually, a separate process running in the VOLAPI service's zone might be added to archive these entries after some time. The delay after which a volume object is archived would need to be chosen so that staled volume objects are still kept in persistent storage for long enough and most debugging tasks that require inspecting them could still take place.
Shared volumes are hosted on zones that need to be operated in a way similar than other user-owned containers. Operators need to be able to migrate, stop, restart, upgrade them, etc.
A new sdc-pkgadm command line tool will be added, initially with support for volume packages only.
sdc-pkgadm volume add --name $volume-pkg-name -type $volume-type --size $volume-pkg-size [--storage-instance-pkg $storage-vm-pkg-uuid] --description '$description'
When creating a tritonnfs
volume package, a matching compute package must be
provided with the --storage-instance-pkg
command line option.
sdc-pkgadm volume activate|deactivate $volume-pkg-uuid
This functionality is not required for the MVP.
Triton operators need to be able to perform new operations using a new
sdc-voladm
command.
- Listing all shared volume zones:
sdc-voladm list-volume-zones
- Listing all shared volume zones for a given user:
sdc-voladm list-volume-zones owner-uuid
Shared volumes zones owned by a given user, or with a specific uuid, can be restarted. Specifying both an owner and a shared volume uuid checks that the shared volume is actually owned by the owner.
sdc-voladm restart-volume-zone [--owner owner-uuid] shared-volume-uuid]
If active containers reference the shared volumes that need to be restarted,
sdc-voladm
doesn't restart the shared volume zones and instead outputs the
containers' uuids so that the operator knows which containers are still using
them.
Shared volume zones can be updated to the latest version, or any specific version, for a specific user or for a specific shared volume. Specifying both an owner and a shared volume uuid checks that the shared volume is actually owned by the owner.
sdc-voladm update-volume-zone [--owner owner-uuid] --image [shared-volumes@uuid] shared-volume-uuid
If Docker containers use the shared volumes that need to be updated, sdcadm
doesn't update the shared volume zones and instead outputs the containers'
uuids so that the operator knows which containers are still using them.
Shared volume zones owned by a specific user, or with a specific uuid, can be deleted by an operator. Specifying both an owner and a volume uuid checks that the shared volume is actually owned by the owner:
sdc-voladm delete-volume-zone [--owner owner-uuid] shared-volume-uuid
If Docker containers use the shared volumes that need to be deleted, sdcadm
doesn't delete the shared volume zones and instead outputs the containers'
uuids so that the operator knows which containers are still using them.
Development of features and changes presented in this document will happen in milestones so that they can be integrated progressively.
The first milestone represents the set of changes that tackles the major design issues and establishes the technical foundations. It also provides basic features to end users so that they can use NFS volumes in a way that is useful and fulfill the original goal of this RFD.
The MVP milestone builds on the first "master integration" milestone and implements features and changes that are part of the core functionality of NFS volumes but that did not make it into the master integration milestone because they did not represent major design risks.
This milestone groups tickets related to allowing users to mount volumes from non-Docker containers using CloudAPI (and other tools such as node-triton).
The operators milestone groups changes and features aimed at making operating NFS volumes easier.
The "volume packages" milestone adds the ability for users to specify volume package names or UUIDs instead of just a size (or any other attribute that belongs to a volume's configuration) when performing operations on volumes.
The "snapshots" milestone groups changes that allow users to create and manage snapshots of their volumes.
The affinity milestone groups changes that allow users to provision volumes and specify locality constraints against other volumes or VMs.
In order to setup support for NFS volumes in a given DC, operators need to run the following commands:
$ sdcadm post-setup volapi
$ sdcadm experimental nfs-volumes docker
$ sdcadm experimental nfs-volumes docker-automount
$ sdcadm experimental nfs-volumes cloudapi
$ sdcadm experimental nfs-volumes cloudapi-automount (mvp milestone)
sdcadm post-setup volapi
creates a new VOLAPI core service and its associated
zone on the headnode.
sdcadm experimental nfs-volumes docker
sets a flag in SAPI that indicates that
the "NFS volumes" feature is turned on for the Triton docker API, but not for
any other external API. It checks that the core services that have changes that
this feature depends on (VMAPI, workflow, sdc-docker) are upgraded to a version
that ships those changes.
sdcadm experimental nfs-volumes docker-automount
sets a flag in SAPI that
indicates that the "NFS volumes automount" feature is turned on for docker
containers. This means that docker containers that are set to depend on shared
volumes when they're created mount those volumes automatically on startup.
Turning on this feature flag depends on all servers running a platform at
version >= 20160613T123039Z
, which is the platform that ships the required
dockerinit
changes. If this requirement is not met, then a warning message is
outputted, but the feature flag is still enabled.
Turning on sdcadm experimental nfs-volumes docker-automount
requires turning
on sdcadm experimental nfs-volumes docker
, but does not do that automatically.
sdcadm experimental nfs-volumes cloudapi
sets a flag in SAPI that indicates
that the "NFS volumes" feature is turned on for the Triton CloudAPI API, but not
for any other external API. It checks that the other core services that have
changes that this feature depends on (VMAPI, workflow, cloudapi) are upgraded to
a version that ships those changes.
sdcadm experimental nfs-volumes cloudapi-automount
is similar to
sdcadm experimental nfs-volumes docker-automount
. It sets a flag in SAPI that
indicates that the "NFS volumes automount" feature is turned on for non-docker
VMs. This means that non-docker VMs that are set to depend on shared volumes
when they're created mount those volumes automatically on startup. Turning on
this feature flag will depend on all servers running a platform at a version
>= 20170925T211846Z
(see
PUBAPI-1420). If this
requirement is not met, then a warning message is outputted, but the feature
flag is still enabled.
Operators may want to turn any of the experimental "nfs-volumes" SAPI flags off when, for instance, it was enabled but caused issues in a given deployment.
They can do that by running the same commands as in the previous section after
appending a -d
command line option to them:
$ sdcadm experimental nfs-volumes docker -d
$ sdcadm experimental nfs-volumes docker-automount -d
$ sdcadm experimental nfs-volumes cloudapi -d
The first milestone to have its changes merged to the master branches of relevant code repositories is the "master integration" milestone.
The list of repositories with changes that need to be integrated as part of that milestone is:
- joyent/sdc-volapi
- joyent/sdcadm
- joyent/sdc-workflow
- joyent/sdc-vmapi
- joyent/sdc-docker
- joyent/node-sdc-clients
- joyent/sdc-sdc
- joyent/sdc-headnode
The joyent/sdc-volapi repository is different from the others in the sense that it's a new repository. As such, development took place in its "master" branch, and so all relevant changes have already been merged.
For all other repositories, all changes relevant to this RFD are in a branch named “tritonnfs”. They can be integrated into their respective master branch in the following sequence:
- node-sdc-clients, sdc-headnode, sdc-sdc
- sdc-workflow, sdc-docker, sc-cloudapi, sdc-vmapi
- sdcadm
Changes in node-sdc-clients need to be integrated first because several other repositories (sdc-docker and sdc-vmapi) depend on them. Changes in sdc-headnode and sdc-sdc can be integrated independently of other changes. They provide the “sdc-volapi” command in the GZ. If the command is not present when support of volumes is enabled, operability suffers, but end users can use NFS volumes. If the command is present but support of volumes is not enabled, then the command will output a clear error message when it’s used.
Then, because sdcadm experimental nfs-volumes
commands check that the core
services mentioned in point 2) with changes in their respective tritonnfs
branches are provisioned with an image that ships those changes, ideally changes
in sdcadm
would be integrated after the changes made to those repositories.
-
Does DAPI need to handle over-provisioning differently for volume packages (e.g less aggressive about over-provisioning disk)?
-
Do volume packages have to be sized in multiples of disk quota in compute packages, assuming that bin-packing will be more efficient?
In order to allow for both optimal utilization of hardware and good performance for I/O operations on shared volumes, package settings of NFS server zones must be chosen carefully.
The current prototype makes NFS server zones have a cpu_cap
of 100
, and
memory and swap of 256
. Running ad-hoc, manual performance tests, CPU
utilization and memory footprint while writing a 4 GiB file from one Docker
container to a shared volume while 9 other containers read the same file did not
go over 40% and 115 MiB respectively.
There is definitely a need for an automated way to run benchmarks that characterize a much wider variety of workloads to determine optimal capacity depending on planned usage.
However, it doesn't seem likely that, at least in the short term, we'll be able to use packages with a smaller memory capacity, since the node process that acts as the user-mode nfs server uses around 100 MiB of memory even with one NFS client performing I/O. Thus, the packages setting used by the current prototype seem to be the minimal settings that still give some room for small unexpected memory footprint growth without degrading performance too much.
It is also unclear whether it is desirable to have different CPU and RAM settings for different volume sizes. It seems that the CPU and RAM values should be chosen based on two different criteria:
-
the expected load on a given volume (number of concurrent clients, rate of I/O, etc.), which is not possible to predict
-
the ability to perform optimal placement
When NFS server zones or the services they run become severely degraded, usage of associated shared volumes is directly impacted. NFS server zones and their services are considered to be an implementation detail of the infrastructure that is not exposed to end users. As a result, end users have no way to react to such problems and can't bring the service to a working state.
Operators, and potentially Triton components, need to be able to react to NFS server zones' services being in a degraded state. Only monitoring the state of NFS server zones is not sufficient, as remote filesystems are exported by a user-level NFS server: a crash of that process, or any problem that makes it unresponsive to I/O requests would keep the zone running, but would still make the service unavailable.
Examples of such problems include:
- High CPU utilization that prevents most requests from being served.
- Functional bugs in the NFS server that prevents most requests from being served.
- Bugs in the NFS server or the zone's software that prevents the service from being restarted when it crashes.
Thus, it seems that there's a need for a separate monitoring solution that allows operators and Triton components to be notified when a NFS server zone is not able to serve I/O requests for their exported NFS file systems.
This monitoring solution could be based on amon or on RFD 27 and needs to allow operators to receive alerts about NFS shared volumes not working at their expected level of service.
Users may want to change the network to which a shared volume is attached for several reasons:
-
The previous network was too small, or too large, and needs to be recreated to be made larger or smaller.
-
Containers belonging to several different networks without a route between them need to access the same shared volume.
In order to achieve this, virtual NICs will need to be added and/or removed to the NFS shared volume's underlying storage VM. These changes currently require a zone reboot, and thus will make the shared volume unavailable for some time.
I'm not sure if there's anything we can do to make network changes not impact shared volumes' availability. If there's nothing we can do, we should probably communicate it in some way (e.g in the API documentation or by making changes to the API/command line tools so that the impact on shared volumes is more obvious).