-
Notifications
You must be signed in to change notification settings - Fork 876
Adds storage driver interface, tests, and three basic implementations #643
Adds storage driver interface, tests, and three basic implementations #643
Conversation
8191555 to
734fda5
Compare
734fda5 to
3f95694
Compare
This is done because libchan/spdystream does not currently support sending serialzied objects of size larger than 16MB See docker-archive-public/docker.libchan#65
1c68bdb to
7c892de
Compare
Preliminary s3 driver implementation
|
I'm a little confused with a lot of the added complexity that this storage driver implementation brings vs. the current registry's storage implementation. There's a lack of documentation here so there's not a lot to go on asides from the discussions in IRC and the NG-related issues, so apologies for my negligence if I missed something: Why do drivers need to be spawned as separate processes? Given a new release of a filesystem driver in a third party package, wouldn't it be easier to just update that specific package and reload the server config to use the newest version? I just don't feel like an IPC system is the solution for zero-downtime registry reloads, and we should be relying on a more package-based approach for drivers. As an example, in the current registry it's as simple as If we were to ignore my above comment, how would installing/starting a new third-party driver look like, both inside and outside of a running container? In the eyes of the core maintainers, how would we maintain and ensure that these drivers are up and running? With this change, configuration is no longer in a central |
|
On Mon, Oct 27, 2014 at 09:29:19PM -0700, Matthew Fisher wrote:
It makes them language-agnostic. Otherwise, folks would have to
All of that is still true with a network-linked storage process.
Using something like my Fig config [1,2], it would look like: $ sed -i 's/image: old-streaming-storage:1.2.3/image: new-streaming-storage:1.0.3/' fig.yml from your host. I don't see a point to being inside a running
I'm not a maintainer at all (just an interested third party), but I'd
I think we can have a central config. You just need good |
storagedriver/storagedriver.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense for List to return a slice of os.FileInfo structures instead for verbosity? Not sure if it'd be compatible with certain key/value store bindings, but it's a thought. http://golang.org/pkg/os/#FileInfo
We should also be pointing out whether List is recursive or non-recursive on child directories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written, the List method is recursive.
In the os.FileInfo interface, we'll likely have Name, Size, ModTime, and IsDir defined with reasonable storage drivers, but Mode is unlikely to provide any reasonable information (Sys is fine returning nil).
Should we create and use our own FileInfo interface/struct with only Name, Size, ModTime and IsDir?
|
@wking all points SGTM (sounds good to me). As I mentioned, there's a whole lot of code here but no documentation on the storage driver endpoints and with the IPC test suite. Is documentation going to be supplied separately? |
storagedriver/storagedriver.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in #630, Move will typically involve full copies followed by deletion, save for the local filesystem. I can see the benefit for the local filesystem but would it make sense to remove this to reduce boilerplate between storage drivers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not committed to the Move method being part of this interface, although for specific implementations such as the local filesystem driver, this can be a huge win for these operations. I guess the main question is whether or not the registry has any need to move objects in its file storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. This is a big win for specific implementations and there's not a large hassle to manage a Move method in downstream drivers. I'm sure drivers would appreciate if a move operation was supported. I don't see a big reason to remove this, so I'm happy with keeping it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Tue, Oct 28, 2014 at 03:04:32PM -0700, Matthew Fisher wrote:
Right. This is a big win for specific implementations so there's not
a large hassle to manage aMovemethod in downstream drivers. I'm
sure drivers would appreciate if a move operation was supported. I
don't see a big reason to remove this, so I'm happy with keeping it
in.
I'm not following this. What is the big win, who is appreciating
Move, and why?
If a storage-driver author wants to implement a move endpoint, I have
no problem with that. If the docker-registry wants to use such an
endpoint (optionally?), I'd like to understand why we need it. I
don't see a reason why docker-registry would need a move endpoint, so
I'd rather leave it out of the API, and have storage-driver authors
not need to bother with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are going to use move in the registry, a lot. Tarsums / checksums need to be computed, and the file needs to be complete. So, the layers will be moved to their final destination once they are complete. Whether the driver implements a smart "move", or is just smart when copying, should be hidden under the specific method call.
I'm +1 on keeping move.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Tue, Oct 28, 2014 at 03:23:37PM -0700, Olivier Gambier wrote:
We are going to use move in the registry, a lot. Tarsums / checksums
need to be computed, and the file needs to be complete. So, the
layers will be moved to their final destination once they are
complete.
I'd rather offload that implementation to the storage driver itself.
The registry just needs to say “I'm going to stream $COUNT bytes [to
$PATH]” (with the presense of $PATH depending on whether or not the
streaming storage is content-addressable). Then the storage driver
can handle that however it likes, to avoid accidentally serving a
partially-complete file when someone asks for $PATH.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we don't hear the same thing with "content-adressibility". This is about being able to retrieve a given binary payload from an id that can be computed again from that binary payload. This is what tarsum is for, and I simply fail to see a point in asking driver author to copy the same code over and over again in every implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Tue, Oct 28, 2014 at 03:48:10PM -0700, Olivier Gambier wrote:
Maybe we don't hear the same thing with
"content-adressibility". This is about being able to retrieve a
given binary payload from an id that can be computed again from that
binary payload. This is what tarsum is for, and I simply fail to see
a point in asking driver author to copy the same code over and over
again in every implementation.
Right. To me “an id that can be computed again from that binary
payload” sounds like “we don't need move”. If you want to specify
that id for the initial write, that's fine 1.
|
On Tue, Oct 28, 2014 at 04:00:36AM -0700, Matthew Fisher wrote:
I don't mind if it's separate or not, but I hope we'll get |
Sorry the code is lacking on that front right now. Let me start off by explaining the intent behind the IPC storage driver system. Because go is a compiled language, we can either distribute binaries or require each user to have a fully configured go development environment with the docker-registry repository and all other dependencies to allow for any modifications including the addition of third-party storage drivers. With an IPC system in place, the user only needs to include a new driver executable to add or replace a storage driver. The actual IPC implementation is inspired by the plugin system in mitchellh/packer, and as currently written, the StorageDriverClient creates and manages a child process with a shared unix domain socket. As the driver is running in a separate process and the IPC system is written on top of libchan, the storage driver is not tied to the same language as the registry itself, but admittedly the language implementations of libchan are currently limited. The language separation is primarily a side effect of the IPC implementation, but writing storage drivers in python or other languages is not out of the question. As for swapping out drivers without restarting your registry process, this is not in the current design, and would have to be approached carefully. Switching to a different storage system with different state is not well-defined behavior for the registry, so this would only be supported for retaining the same driver name and parameters as originally defined while launching the registry process itself. We would also need to agree on the signal or action that triggers a restart of the storage driver child process.
Because the registry process is in charge of launching the storage driver, it can forward any driver configuration parameters to the process on startup. In each of the driver mains, the driver is able to read a json-serialized parameters map, just as defined in the configuration file or environment variables on the parent process (see #646). |
Currently any storage driver can use the same test suites than any of the default drivers are using by calling |
|
On Tue, Oct 28, 2014 at 10:10:21AM -0700, Brian Bland wrote:
Obviously, the registry can only serve images that are stored in its Since it's harder to pass around links to unix sockets, I'd probably I don't think we need to preserve the driver name (why would the I'd leave swapping storage containers to the orchestration layer |
|
On Tue, Oct 28, 2014 at 10:27:32AM -0700, Brian Bland wrote:
If I have a running container that exposes an endpoint for the $ test-streaming-storage http://my-storage.example.com/ and have it run through the test suite without my having to dip into the |
|
@bacongobbler @wking @BrianBland Thanks a lot for this. I feel a quick (short) summary can be useful here:
Now, I'm happy if that design let other things to be done, and opens-up creativity, but the bottom-line is (@bacongobbler): make like simple to drivers authors and clean separation between the drivers and the registry. |
|
@BrianBland let's go for another review of this as soon as @BrianBland provides some documentation and drivers authors (@noxiouz @bacongobbler) can kick the philosophy in the tires. |
Custom storage drivers can register a factory to create the driver by name, similar to the database/sql package's Register and Open factory.Create returns an in-process driver if registered or an IPC driver if one can be found, erroring otherwise This standardizes parameter passing for creation of storage drivers Also adds documentation for storagedriver package and children
ba7bafb to
ca0084f
Compare
storagedriver/ipc/server.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
er -> err
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this is why we have CI...
737b474 to
0e5d41f
Compare
b8f0be3 to
4efdf4d
Compare
4efdf4d to
3e47385
Compare
|
@ahmetalpbalkan registry do need a move operation at the driver level The main reason for that is ids used to store blobs will be content-adressable - hence requiring to first store some given blob in a temporary place, then move it to the final destination. If you don't have a move operation, then no big deal, wrap a copy+delete (as long as your copy is efficient enough). |
|
@dmp42 that works as long as the caller (registry) does not assume that Move operation is atomic (transactional) and if the code crashes in the middle of that, you may end up with 2 copies (since Delete isn't executed yet). |
|
@ahmetalpbalkan fair enough |
libchan now supports io.ReadCloser and io.WriteCloser, so we don't need io.ReadWriteCloser wrapping
storagedriver/storagedriver.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How big do we expect directories to get? Do we have any reasonable bound on the number of directory entries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the use case for using from and size? How do you see them being used in the registry? If we're concerned about having the prefix at a high enough level being slow, we could make it optionally non-recursive:
Count(prefix string) (int, error)
List(prefix string, recursive bool) ([]string, error)That should make List("/", false) only display ["images/", "repositories/"] and List("/images", false) significantly less time-consuming than recursively listing every directory under /images.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Mon, Nov 03, 2014 at 01:22:40PM -0800, Matthew Fisher wrote:
What would be the use case for using
fromandsize? How do you
see them being used in the registry? If we're concerned about having
the prefix at a high enough level being slow, we could make it
optionally non-recursive:Count(prefix string) (int, error) List(prefix string, recursive bool) ([]string, error)That should make
List("/", false)only display["images", "repositories"]andList("/images", false)significantly less
time-consuming than recursively listing * every* directory under
/images.
And List("/repositories", false) would list all the namespaces? I'd
rather say:
List("namespace/", 0, 100)
for “give me the first hundred namespaces”. I'd avoid recursion by
using namespaced keys (intead of heirarchical keys):
namespace/alice
namespace/bob
namespace/library
…
repository/alice/busybox
repository/bob/debian
repository/library/debian
…
tag/alice/busybox/0.1
tag/bob/debian/latest
tag/library/debian/latest
so you could iterate through namespaces (with a “namespace/” prefix),
or through repositories in a namespace (with a “repositories/alice/”
prefix), or through all repositories (with a “repositories/” prefix),
…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After doing a little bit of research it looks like the from and size arguments would cause some issues with certain storage systems including S3.
The S3 list equivalent GET Bucket supports max-keys and a marker argument, which is the last key of then previous result set, or the key before the first that should be returned from the call. Implementing the interface with List(prefix string, size int, from int) with this model seems like it would require loading all results up until the from argument so that we can determine what the appropriate marker is.
I'm all for namespaced keys, but we should find a good solution for loading all namespaces unless we're fine returning a number of results equal to the number of users who own repositories in the worst case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about this more carefully, we thought it might be better to keep this IPC call simple. Without introducing problems with missing or inconsistent directory listings, the variation in the drivers makes getting a reentrant interface right problematic. The other issue here is that whether or not these calls are reentrant, you end up allocating the same amount of memory. Thus, large directories will cause latency in registry requests no matter what.
As an alternative, we can mitigate listing sizes by controlling underlying directory layout. For now, let's keep List(prefix string) ([]sting, error) as the interface method. We should also define toList` to not return recursive results. (@dmp42 Was there a reason to support recursive directory listings?)
We can always revisit this in the future if we've made the wrong decision today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stevvooe We don't need/want recursive directory listing (disclaimer: haven't read all comments, and assume recursive means -r)
And simpler is better :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Mon, Nov 03, 2014 at 07:05:04PM -0800, Stephen Day wrote:
Without introducing problems with missing or inconsistent directory
listings, the variation in the drivers makes getting a reentrant
interface right problematic.
Using either a marker string 1 or an opaque marker type 2 seems to
support all proposed backends. With the marker stored in the
registry, the storage backend doesn't need to keep track of open list
calls, so a reentrant interface is easy. Or am I missing something?
The other issue here is that whether or not these calls are
reentrant, you end up allocating the same amount of memory. Thus,
large directories will cause latency in registry requests no matter
what.
How so? ZRANGEBYSCORE is O(log(N)+M) with N being the number of
elements in the sorted set and M the number of elements being
returned. That scales pretty well. More importantly, it lets you
start streaming results from the registry to the client before you've
finished listing the whole directory.
We can always revisit this in the future if we've made the wrong
decision today.
True, but we're already having trouble with the current layout (#614).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wking Before I address your points, fear not, we intend to address #614, in one way or another. We just need to understand the right way to do it.
Using either a marker string [1] or an opaque marker type [2] seems to
support all proposed backends. With the marker stored in the
registry, the storage backend doesn't need to keep track of open list
calls, so a reentrant interface is easy. Or am I missing something?
Yes, the marker or index will support most backends, but the behavior may be different on updates to that directory. With a marker string, an insert before the marker after the first call will lead to a missed entry. With an index, one may have duplicate entries. The behavior between various backends under these conditions may vary greatly, requiring calling code (ie the registry) to handle that variation. By going with the simpler call, we avoid this variation and leave it up to the driver to present a "consistent" view of the directory state.
How so? ZRANGEBYSCORE is O(log(N)+M) with N being the number of
elements in the sorted set and M the number of elements being
returned. That scales pretty well. More importantly, it lets you
start streaming results from the registry to the client before you've
finished listing the whole directory.
That's not the case I'm worried about. If we want to iterate over the entries of a large directory to support an API operation, the latency of the registry request will still be proportional to the size of the directory, reentrant or not. This should be avoided, if possible.
Let's proceed with this simplified API and table this discussion, for now. We have collected some really good suggestions but we really need to look at how we can avoid this bottleneck. Once the V2 paths are laid out, we can come back and see if we really need the reentrant API call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Mon, Nov 03, 2014 at 09:07:51PM -0800, Stephen Day wrote:
Using either a marker string [1] or an opaque marker type [2]
seems to support all proposed backends. With the marker stored in
the registry, the storage backend doesn't need to keep track of
open list calls, so a reentrant interface is easy. Or am I
missing something?Yes, the marker or index will support most backends, but the
behavior may be different on updates to that directory. With a
marker string, an insert before the marker after the first call will
lead to a missed entry.
That entry wouldn't have been part of the results from an initial
list-all call anyway. I'm fine skipping over it during an iterative
list.
With an index, one may have duplicate entries.
So just have the registry remember the last entry it got (as well as
the opaque marker), and then skip anything <= that entry when parsing
the next batch of entries.
How so? ZRANGEBYSCORE is O(log(N)+M) with N being the number of
elements in the sorted set and M the number of elements being
returned. That scales pretty well. More importantly, it lets you
start streaming results from the registry to the client before
you've finished listing the whole directory.That's not the case I'm worried about. If we want to iterate over
the entries of a large directory to support an API operation, the
latency of the registry request will still be proportional to the
size of the directory, reentrant or not.
No, if the request is “give me a list of tags and their associated
image IDs so I can pull them” (#614), you can start writing back to
the client immediately, without waiting for the tag list to complete.
That's just a:
storage → registry → client
pipe, and there's no reason to block the registry → client leg.
Let's proceed with this simplified API and table this discussion,
for now. We have collected some really good suggestions but we
really need to look at how we can avoid this bottleneck. Once the V2
paths are laid out, we can come back and see if we really need the
reentrant API call.
Well, the API is versioned, so it shouldn't be too hard to adopt this
later. Still, it doesn't seem that complicated to me, and I think
we'll need to adopt something that lets us stream results. This is
what other storage engines already support for streamed listings, so I
expect it's a reasonable choice for how to structure this ;).
1a14efa to
43716a2
Compare
|
On Tue, Oct 21, 2014 at 03:06:43PM -0700, Brian Bland wrote:
Very minor, and feel free to ignore, but I like “memory” more than Random thoughts on the docs:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that this comment belongs with this commit, but it might be reasonable to use an integration test build tag to separate the unit tests from the integration tests.
|
TL; DR LGTM! I think this change is looking good. I'm putting together the specification for the V2 API tonight/tomorrow morning and will have a better idea of the underlying storage layout. We may make some changes going forward with information there but let's go ahead and get this PR merged. |
|
LGTM This has been under scrutiny for two weeks, long enough I guess. Merging. |
Adds storage driver interface, tests, and three basic implementations
Re-opening of #630 with the new orphan next-generation branch.
@dmp42 @wking @bacongobbler @noxiouz