Skip to content

Specification for cloud-native URL pipelines that can reference zarr arrays and other resources within cloud storage and nested formats like zip, OCDBT, icechunk

License

Notifications You must be signed in to change notification settings

jbms/url-pipeline

Repository files navigation

URL pipeline specification

Warning

This specification is still in draft stage. Incompatible changes in both the specification as a whole, and individual schemes, may still be made until it advances to version 1.

Abstract

This specification defines a new URL pipeline syntax for specifying how to locate data resources within possibly nested container formats.

For example:

  • s3://bucket/path/to/archive.zip|zip:path/within/zip.zarr/|zarr3:

    Refers to a Zarr v3 node at the directory named path/within/zip.zarr/ within a ZIP file at the path path/to/archive.zip within the AWS S3 bucket named bucket.

  • file:///tmp/dataset.ocdbt/|ocdbt://2025-01-01T01:23:45.678Z/path/within/database

    Refers to a file at path/within/database as of commit time 2025-01-01T01:23:45.678Z within an OCDBT database stored at the local filesystem path of /tmp/dataset.ocdbt/.

  • s3+https://example.com/path/to/database.icechunk/|icechunk://tag.v5/path/to/node/|zarr3:

    Refers to the Zarr v3 node at path/to/node/ in tag v5 in the Icechunk database at /path/to/database.icechunk on the S3-compatible server https://example.com.

Motivation

Existing well-established URL schemes like file: or http:, and existing but less-standard URL schemes like s3: and gs:, are sufficient for specifying the location of a resource that is directly available as a local file, from an HTTP server, or on cloud storage.

However, for specifying a resource within nested storage mechanisms like a ZIP archive, OCDBT database, Icechunk repository or Zarr hierarchy, or for specifying a data format explicitly, there is no established URL syntax.

Specification

This specification defines a new URL pipeline syntax intended to allow the locations of data resources of various kinds, within arbitrarily nested container formats, to be specified in a uniform way.

An absolute URL pipeline consists of a sequence of sub-URLs separated by | (vertical bar) characters. The first sub-URL has the semantics of a conventional URL and must have a root scheme that specifies the location of a data resource. Subsequent sub-URLs must have adapter schemes that transform the base data resource specified by the sequence of prior sub-URLs (called the base URL) into a new adapted data resource.

Note

The URL pipeline syntax defined by this specification is explicitly an extension of the standard URI syntax defined by RFC 3986. A pipeline consisting of just a single sub-URL conforms to the syntax and semantics of a standard URI syntax, but a pipeline with more than one sub-URL does not.

Relative URL pipelines are not currently supported by this specification, but are expected to be added in a future version.

Data resource kinds

A URL pipeline may refer to several different kinds of data resources:

  • file: Single raw file (sequence of bytes), no specific data format.

  • directory: Single directory hierarchy within a key-value store, no specific data format.

  • array: Multi-dimensional array with a defined format, e.g. a Zarr array or HDF5 array.

  • array-group: Hierarchy of multi-dimensional arrays, e.g. a Zarr group or HDF5 group.

  • other: Other types of data resources for which this specification does not yet give a precise categorization, such as Neuroglancer precomputed mesh, skeleton, or annotation datasets.

Root schemes specify the kinds of data resources to which they may refer.

Adapter schemes specify constraints on their source data resources and how the adapted resource kind relates to the source resource kind.

In some cases the resource kind can be determined syntactically from the URL alone, while in other cases it can only be resolved by actually accessing the resource.

Absolute URL syntax

An absolute URL pipeline is defined by the following ABNF grammar:

absolute-url-pipeline = root-sub-url *( "|" adapter-sub-url )
root-sub-url          = nonstandard-sub-url
adapter-sub-url       = nonstandard-sub-url
; Additional alternatives for <root-sub-url> and <adapter-sub-url> are
; defined in specifications of individual schemes.

The <root-sub-url> specifies an absolute resource location using a defined standard root scheme or a nonstandard scheme. For example, file:/path/to/local/file refers to a file or directory on the local filesystem.

An <adapter-sub-url> specifies an adapted or transformed resource using a defined standard adapter scheme or a nonstandard scheme. For example, zip:path/within/zip specifies a file or directory within a ZIP archive.

The <root-sub-url> and each of the <adapter-sub-url> portions are considered sub-URLs. For a given <adapter-sub-url>, the sequence of prior sub-URLs is called the base URL.

Note

As in RFC 3986, the scheme portion of a sub-url is case-insensitive, but the canonical representation is lowercase.

Individual sub-URLs may support query strings (?query); the interpretation of the query string depends on the scheme.

Fragment syntax (#fragment) is not currently supported but may be supported by a future version of this specification.

Root schemes

Adapter schemes

Container formats:

Array data formats:

Compression:

Image formats:

Nonstandard schemes

Implementations of this URL pipeline specification may support additional URL schemes not defined by this specification. To avoid a naming conflict with schemes that may be defined by future versions of this specification, any such non-standard URL schemes must conform to the following ABNF grammar:

nonstandard-sub-url  = nonstandard-scheme ":" scheme-specific-part
nonstandard-scheme   = vendor "." scheme
vendor               = ALPHA *( ALPHA / DIGIT / "-" )
scheme-specific-part = authority-and-path [ "?" query ]
authority-and-path   = *( pchar / "/" / "[" / "]" )

This is based on the following definitions from RFC 3986:

scheme      = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
query       = *( pchar / "/" / "?" )
pct-encoded = "%" HEXDIG HEXDIG
pchar       = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved  = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims  = "!" / "$" / "&" / "'" / "(" / ")"
            / "*" / "+" / "," / ";" / "="
host        = ipv4address
            / ipv6address
            / 1*( unreserved / pct-encoded )
; For simplicity, a simplified (more permissive) definition of <host> is used.
ipv4address = ipv4seg "." ipv4seg "." ipv4seg "." ipv4seg
ipv4seg     = 1*3DIGIT
ipv6address = "[" 3*( HEXDIG / ":" / "." ) "]"
port        = *DIGIT

The vendor must be some identifier that is unambiguously associated with the author of the scheme, such as tensorstore or zarr-python.

For example:

  • vendor1-2.custom-1+proto.ext://[authority]/path?query/part?x

  • http://example.com/file|vendor1-2.custom+adapter:/path/within/adapter

  • a.b:?

Format auto-detection

Implementations MAY support format auto-detection for certain adapter schemes.

For a given base URL specifying a file or directory resource, and a desired target resource kind, the implementation determines a set of matching adapter URLs:

  • For a base file resource, this is typically done by reading a prefix and/or suffix of the file in order to match expected signatures;

  • For a base directory resource, this is typically done by checking for the presence of certain files.

Given a base URL specifying a file or directory resource, to obtain a desired target resource kind (e.g. array) using format auto-detection, the implementation:

  1. Determines the set of matching adapter URLs for the current base URL. If there is exactly one match, add the matching adapter to the current base URL to obtain a new base URL. Otherwise, return an error.

  2. If the new base URL has one of the desired target resource kinds, return the new base URL as the successful format auto-detection result. Otherwise, continue back at step 1 with the new base URL as the current base URL.

Context-dependent URL pipeline interpretation

For user convenience, implementations MAY interpret URL pipelines in a context-dependent way. For example, consider the following hypothetical APIs (which may not all be part of the same software):

  • open_array: opens an arbitrary array from a URL

    If passed a URL that resolves to a file or directory resource, performs format auto-detection to obtain an array resource.

    If format auto-detection fails or the resultant resource is not an array, fails with an error.

    Otherwise, opens the resolved URL as an array.

  • open_zarr_array: opens a Zarr array from a URL with format auto-detection

    Same as open_array, except that if the resolved array resource is not a Zarr array, fails.

  • open_zarr_array_without_auto_detection: opens a Zarr array without format auto-detection

    If passed a URL that resolves to a file resource, fails with an error.

    If passed a URL that resolves to a directory resource, append the zarr: adapter and open it.

    If passed a URL that resolves to a array resource, open it and fail if it is not in Zarr format.

  • open_kvstore: opens a key-value store file or directory from a URL

    If passed a URL that resolves to a file or directory resource, opens it.

    Otherwise, returns an error.

  • open_file: opens a file from a URL

    If passed a URL that resolves to a file resource, opens it.

    Otherwise, returns an error.

Important

For interoperability with other software, implementations that perform format auto-detection SHOULD report back the fully-resolved URL pipeline rather than the original URL pipeline through APIs and user interfaces when possible. For example:

  • In interactive applications URL pipelines entered into a text entry field could be replaced with the fully-resolved URL pipeline after format auto-detection completes.
  • In a Python library, after any format auto-detection completes, the repr of an open handle to a resource should report the fully-resolved URL pipeline rather than the original URL pipeline.

Rationale

This specification takes into account several key requirements on the URL syntax:

  • Must support specifying a specific Zarr array or group, contained within some other storage format.

  • Must support nested storage formats, like one or more layers of a ZIP archive within some other storage system.

  • Must be compatible with interactive completion as the user types.

  • Must be extensible for use with non-Zarr formats.

The use of outer-to-inner order for the sub-URLs enables completion of both paths and sub-URL schemes as the user types.

The sub-URL delimiter of | was chosen because it is not a valid URL character, and therefore does not have any existing valid interpretation within URLs, and also is evocative of POSIX shell pipe syntax.

Implementations

Related Work

fsspec

The fsspec library is widely used with the zarr-python library to access a variety of storage systems, and includes support for ZIP files and other nested stores.

Like this proposal, the fsspec URL syntax consists of a sequence of sub-URLs separated by a delimiter, but differs as follows:

  • fsspec uses a delimiter of ::, while this specification uses |.
  • fsspec orders sub-URLs from innermost to outermost, while this specification orders from outermost to innermost.

The use of :: as a delimiter of the sub-URLs means that fsspec URLs may conform to the syntax of a normal URL, because :: is permitted within the path, query, and fragment components of a URL. This has both advantages and disadvantages:

  • An fsspec URL may be accepted by existing URL parsers/matchers not specifically designed for fsspec.
  • Because the interpretation of the :: delimiter within an fsspec URL differs from the normal interpretation within a URL, operations such as relative path resolution designed to operate on URLs generically may execute without errors on an fsspec URL but produce an incorrect result. In contrast, the use of | within this proposal ensures that the resultant syntax will not be confused with a valid regular URL, because | is not a permitted character within URLs.

The outer-to-inner order of sub-URLs in the fsspec URL syntax is not compatible with the usual operation of text completion as the user types. It is also opposite to the outer-to-inner order used for specifying paths within URLs.

Apache Commons VFS

The Apache Commons VFS is a Java library that provides capabilities similar to those of the fsspec Python library.

The Apache Commons VFS URL syntax specifies the base scheme and all of the sub-schemes, in inner to outer order, delimited by :, followed by the paths for each scheme, in outer-to-inner order, delimited by !.

For example:

  • http://somehost/downloads/somefile.zip|zip: under this specification is equivalent to the Apache Commons VFS URI zip:http://somehost/downloads/somefile.zip.

As with the fsspec syntax, this URL syntax conforms to the standard URL syntax but has a different interpretation, which has both advantages and disadvantages.

Separating the adapter scheme from the adapter path makes the association of adapter and path less obvious, particularly if there is more than one adapter.

While the outer-to-inner order of the nested paths makes text completion of the paths feasible, the URL syntax is not readily compatible with completion of the nested schemes.

Java Jar URLs

Java Jar URLs provide a way to refer to a resource within a JAR (Java Archive) file.

The syntax is jar:<url>!/{entry}, where <url> is the URL of the JAR file, and {entry} is the path within the JAR.

For example:

  • http://example.com/archive.jar|zip:path/to/file.txt under this specification is equivalent to the Java Jar URL jar:http://example.com/archive.jar!/path/to/file.txt.

Like Apache Commons VFS, the Java Jar URL syntax uses a ! delimiter and embeds the full URL of the container, but it is specific to the JAR format (which is based on ZIP) and does not generalize to arbitrary nested containers or transformations in the same uniform way as this specification.

GDAL Virtual File Systems

https://gdal.org/user/virtual_file_systems.html

This uses a path syntax rather than a URL syntax. It supports nested schemes using { and } as delimiters.

For example:

  • https://host/archive.zip|zip:path/in/outer.zip|zip:path/in/inner under this specification is equivalent to the GDAL path /vsizip/{{/vsizip//vsicurl/https://host/archive.zip}/path/outer.zip}/path/in/inner.

GVfs (GNOME Virtual File System)

The GVfs archive backend supports accessing the contents of archive files using the archive: scheme.

The syntax is archive://<container-uri>/<path>, where <container-uri> is the URL-encoded URI of the container file.

For example:

  • file:///path/to/archive.zip|zip:path/within/archive under this specification is equivalent to archive://file%3A%2F%2F%2Fpath%2Fto%2Farchive.zip/path/within/archive.

This approach requires the container URL to be URL-encoded, which reduces readability and makes manual entry difficult compared to the pipeline syntax.

Versioning

This specification as a whole is assigned an integer version number, as are the specifications for individual root and adapter schemes.

Any future versions of this specification will satisfy the following guarantee:

  1. Any URL pipeline that is valid under the current specification will remain valid, with the same interpretation, under any future version of the specification. The only exception is minor corrections that are believed to be highly unlikely to impact any actual usage.

  2. Future versions of the specification may extend the allowed syntax of URL pipelines. The version number of the specification as a whole will be incremented in such cases.

  3. Future versions of the specification of individual schemes may extend the allowed syntax for the scheme. The version number of the scheme will be incremented in such cases.

  4. The version number of the specification as a whole, or of individual schemes, may not be incremented for purely editorial changes.

Contributing

To make changes to this specification, and in particular to propose a new scheme, refer to CONTRIBUTING.md for detailed guidelines.

Copyright

© 2025 The URL Pipeline Specification Authors.

This specification is licensed under CC BY 4.0.

About

Specification for cloud-native URL pipelines that can reference zarr arrays and other resources within cloud storage and nested formats like zip, OCDBT, icechunk

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages