- Version: 0 (draft)
- Editor: Jeremy Maitin-Shepard jeremy@jeremyms.com
Warning
This specification is still in draft stage. Incompatible changes in both the specification as a whole, and individual schemes, may still be made until it advances to version 1.
This specification defines a new URL pipeline syntax for specifying how to locate data resources within possibly nested container formats.
For example:
-
s3://bucket/path/to/archive.zip|zip:path/within/zip.zarr/|zarr3:Refers to a Zarr v3 node at the directory named
path/within/zip.zarr/within a ZIP file at the pathpath/to/archive.zipwithin the AWS S3 bucket namedbucket. -
file:///tmp/dataset.ocdbt/|ocdbt://2025-01-01T01:23:45.678Z/path/within/databaseRefers to a file at
path/within/databaseas of commit time2025-01-01T01:23:45.678Zwithin an OCDBT database stored at the local filesystem path of/tmp/dataset.ocdbt/. -
s3+https://example.com/path/to/database.icechunk/|icechunk://tag.v5/path/to/node/|zarr3:Refers to the Zarr v3 node at
path/to/node/in tagv5in the Icechunk database at/path/to/database.icechunkon the S3-compatible serverhttps://example.com.
Existing well-established URL schemes like file: or http:, and existing but
less-standard URL schemes like s3: and gs:, are sufficient for specifying
the location of a resource that is directly available as a local file, from an
HTTP server, or on cloud storage.
However, for specifying a resource within nested storage mechanisms like a ZIP archive, OCDBT database, Icechunk repository or Zarr hierarchy, or for specifying a data format explicitly, there is no established URL syntax.
This specification defines a new URL pipeline syntax intended to allow the locations of data resources of various kinds, within arbitrarily nested container formats, to be specified in a uniform way.
An absolute URL pipeline consists of a sequence of
sub-URLs separated by | (vertical bar) characters. The first sub-URL has the
semantics of a conventional URL and must have a root scheme
that specifies the location of a data resource.
Subsequent sub-URLs must have adapter schemes that transform
the base data resource specified by the sequence of prior sub-URLs (called the
base URL) into a new adapted data resource.
Note
The URL pipeline syntax defined by this specification is explicitly an extension of the standard URI syntax defined by RFC 3986. A pipeline consisting of just a single sub-URL conforms to the syntax and semantics of a standard URI syntax, but a pipeline with more than one sub-URL does not.
Relative URL pipelines are not currently supported by this specification, but are expected to be added in a future version.
A URL pipeline may refer to several different kinds of data resources:
-
file: Single raw file (sequence of bytes), no specific data format. -
directory: Single directory hierarchy within a key-value store, no specific data format. -
array: Multi-dimensional array with a defined format, e.g. a Zarr array or HDF5 array. -
array-group: Hierarchy of multi-dimensional arrays, e.g. a Zarr group or HDF5 group. -
other: Other types of data resources for which this specification does not yet give a precise categorization, such as Neuroglancer precomputed mesh, skeleton, or annotation datasets.
Root schemes specify the kinds of data resources to which they may refer.
Adapter schemes specify constraints on their source data resources and how the adapted resource kind relates to the source resource kind.
In some cases the resource kind can be determined syntactically from the URL alone, while in other cases it can only be resolved by actually accessing the resource.
An absolute URL pipeline is defined by the following ABNF grammar:
absolute-url-pipeline = root-sub-url *( "|" adapter-sub-url )
root-sub-url = nonstandard-sub-url
adapter-sub-url = nonstandard-sub-url
; Additional alternatives for <root-sub-url> and <adapter-sub-url> are
; defined in specifications of individual schemes.The <root-sub-url> specifies an absolute resource location using a defined
standard root scheme or a
nonstandard scheme. For example,
file:/path/to/local/file refers to a file or directory on the local
filesystem.
An <adapter-sub-url> specifies an adapted or transformed resource using a
defined standard adapter scheme or a
nonstandard scheme. For example, zip:path/within/zip
specifies a file or directory within a ZIP archive.
The <root-sub-url> and each of the <adapter-sub-url> portions are considered
sub-URLs. For a given <adapter-sub-url>, the sequence of prior sub-URLs is
called the base URL.
Note
As in RFC 3986, the scheme portion of a sub-url is case-insensitive, but the canonical representation is lowercase.
Individual sub-URLs may support query strings (?query); the interpretation of
the query string depends on the scheme.
Fragment syntax (#fragment) is not currently supported but may be supported by
a future version of this specification.
- file: for local files
- gs: for Google Cloud Storage (GCS)
- http: and https: for HTTP servers
- memory: for ephemeral in-memory storage
- s3: for AWS S3
- s3+http: and s3+https: for S3-compatible endpoints
Container formats:
- byte-range: for specifying a byte range within a file
- icechunk: for Icechunk repositories
- ocdbt: for OCDBT databases
- zip: for ZIP archives
Array data formats:
- hdf5: for HDF5 files
- json: for accessing a value within a JSON document
- n5: for N5 arrays and groups
- neuroglancer-precomputed: for Neuroglancer precomputed datasets
- zarr:, zarr2:, and zarr3: for Zarr arrays and groups
Compression:
- gzip: for Gzip-compressed files
- zstd: for Zstd-compressed files
Image formats:
- avif: for AVIF images
- bmp: for BMP images
- jpeg: for JPEG images
- png: for PNG images
- tiff: for TIFF images
- webp: for WebP images
Implementations of this URL pipeline specification may support additional URL schemes not defined by this specification. To avoid a naming conflict with schemes that may be defined by future versions of this specification, any such non-standard URL schemes must conform to the following ABNF grammar:
nonstandard-sub-url = nonstandard-scheme ":" scheme-specific-part
nonstandard-scheme = vendor "." scheme
vendor = ALPHA *( ALPHA / DIGIT / "-" )
scheme-specific-part = authority-and-path [ "?" query ]
authority-and-path = *( pchar / "/" / "[" / "]" )This is based on the following definitions from RFC 3986:
scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
query = *( pchar / "/" / "?" )
pct-encoded = "%" HEXDIG HEXDIG
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
host = ipv4address
/ ipv6address
/ 1*( unreserved / pct-encoded )
; For simplicity, a simplified (more permissive) definition of <host> is used.
ipv4address = ipv4seg "." ipv4seg "." ipv4seg "." ipv4seg
ipv4seg = 1*3DIGIT
ipv6address = "[" 3*( HEXDIG / ":" / "." ) "]"
port = *DIGITThe vendor must be some identifier that is unambiguously associated with the
author of the scheme, such as tensorstore or zarr-python.
For example:
-
vendor1-2.custom-1+proto.ext://[authority]/path?query/part?x -
http://example.com/file|vendor1-2.custom+adapter:/path/within/adapter -
a.b:?
Implementations MAY support format auto-detection for certain adapter schemes.
For a given base URL specifying a file or directory resource, and a desired
target resource kind, the implementation determines a set of matching adapter
URLs:
-
For a base
fileresource, this is typically done by reading a prefix and/or suffix of the file in order to match expected signatures; -
For a base
directoryresource, this is typically done by checking for the presence of certain files.
Given a base URL specifying a file or directory resource, to obtain a
desired target resource kind (e.g. array) using format auto-detection, the
implementation:
-
Determines the set of matching adapter URLs for the current base URL. If there is exactly one match, add the matching adapter to the current base URL to obtain a new base URL. Otherwise, return an error.
-
If the new base URL has one of the desired target resource kinds, return the new base URL as the successful format auto-detection result. Otherwise, continue back at step 1 with the new base URL as the current base URL.
For user convenience, implementations MAY interpret URL pipelines in a context-dependent way. For example, consider the following hypothetical APIs (which may not all be part of the same software):
-
open_array: opens an arbitrary array from a URLIf passed a URL that resolves to a
fileordirectoryresource, performs format auto-detection to obtain anarrayresource.If format auto-detection fails or the resultant resource is not an array, fails with an error.
Otherwise, opens the resolved URL as an array.
-
open_zarr_array: opens a Zarr array from a URL with format auto-detectionSame as
open_array, except that if the resolvedarrayresource is not a Zarr array, fails. -
open_zarr_array_without_auto_detection: opens a Zarr array without format auto-detectionIf passed a URL that resolves to a
fileresource, fails with an error.If passed a URL that resolves to a
directoryresource, append thezarr:adapter and open it.If passed a URL that resolves to a
arrayresource, open it and fail if it is not in Zarr format. -
open_kvstore: opens a key-value store file or directory from a URLIf passed a URL that resolves to a
fileordirectoryresource, opens it.Otherwise, returns an error.
-
open_file: opens a file from a URLIf passed a URL that resolves to a
fileresource, opens it.Otherwise, returns an error.
Important
For interoperability with other software, implementations that perform format auto-detection SHOULD report back the fully-resolved URL pipeline rather than the original URL pipeline through APIs and user interfaces when possible. For example:
- In interactive applications URL pipelines entered into a text entry field could be replaced with the fully-resolved URL pipeline after format auto-detection completes.
- In a Python library, after any format auto-detection completes, the
reprof an open handle to a resource should report the fully-resolved URL pipeline rather than the original URL pipeline.
This specification takes into account several key requirements on the URL syntax:
-
Must support specifying a specific Zarr array or group, contained within some other storage format.
-
Must support nested storage formats, like one or more layers of a ZIP archive within some other storage system.
-
Must be compatible with interactive completion as the user types.
-
Must be extensible for use with non-Zarr formats.
The use of outer-to-inner order for the sub-URLs enables completion of both paths and sub-URL schemes as the user types.
The sub-URL delimiter of | was chosen because it is not a valid URL character,
and therefore does not have any existing valid interpretation within URLs, and
also is evocative of POSIX shell pipe syntax.
-
TensorStore (https://google.github.io/tensorstore/spec.html#json-TensorStoreUrl)
Format auto-detection is also implemented.
-
Neuroglancer (https://neuroglancer-docs.web.app/datasource/index.html#url-syntax)
Format auto-detection is also implemented.
-
zarr-python (zarr-developers/zarr-python#3369)
The fsspec library is widely used with the zarr-python library to access a variety of storage systems, and includes support for ZIP files and other nested stores.
Like this proposal, the fsspec URL syntax consists of a sequence of sub-URLs separated by a delimiter, but differs as follows:
- fsspec uses a delimiter of
::, while this specification uses|. - fsspec orders sub-URLs from innermost to outermost, while this specification orders from outermost to innermost.
The use of :: as a delimiter of the sub-URLs means that fsspec URLs may
conform to the syntax of a normal URL, because :: is permitted within the
path, query, and fragment components of a URL. This has both advantages and
disadvantages:
- An fsspec URL may be accepted by existing URL parsers/matchers not specifically designed for fsspec.
- Because the interpretation of the
::delimiter within an fsspec URL differs from the normal interpretation within a URL, operations such as relative path resolution designed to operate on URLs generically may execute without errors on an fsspec URL but produce an incorrect result. In contrast, the use of|within this proposal ensures that the resultant syntax will not be confused with a valid regular URL, because|is not a permitted character within URLs.
The outer-to-inner order of sub-URLs in the fsspec URL syntax is not compatible with the usual operation of text completion as the user types. It is also opposite to the outer-to-inner order used for specifying paths within URLs.
The Apache Commons VFS is a Java library that provides capabilities similar to those of the fsspec Python library.
The Apache Commons VFS URL syntax specifies the base scheme and all of the
sub-schemes, in inner to outer order, delimited by :, followed by the paths
for each scheme, in outer-to-inner order, delimited by !.
For example:
http://somehost/downloads/somefile.zip|zip:under this specification is equivalent to the Apache Commons VFS URIzip:http://somehost/downloads/somefile.zip.
As with the fsspec syntax, this URL syntax conforms to the standard URL syntax but has a different interpretation, which has both advantages and disadvantages.
Separating the adapter scheme from the adapter path makes the association of adapter and path less obvious, particularly if there is more than one adapter.
While the outer-to-inner order of the nested paths makes text completion of the paths feasible, the URL syntax is not readily compatible with completion of the nested schemes.
Java Jar URLs provide a way to refer to a resource within a JAR (Java Archive) file.
The syntax is jar:<url>!/{entry}, where <url> is the URL of the JAR file,
and {entry} is the path within the JAR.
For example:
http://example.com/archive.jar|zip:path/to/file.txtunder this specification is equivalent to the Java Jar URLjar:http://example.com/archive.jar!/path/to/file.txt.
Like Apache Commons VFS, the Java Jar URL syntax uses a ! delimiter and embeds
the full URL of the container, but it is specific to the JAR format (which is
based on ZIP) and does not generalize to arbitrary nested containers or
transformations in the same uniform way as this specification.
https://gdal.org/user/virtual_file_systems.html
This uses a path syntax rather than a URL syntax. It supports nested schemes
using { and } as delimiters.
For example:
https://host/archive.zip|zip:path/in/outer.zip|zip:path/in/innerunder this specification is equivalent to the GDAL path/vsizip/{{/vsizip//vsicurl/https://host/archive.zip}/path/outer.zip}/path/in/inner.
The GVfs archive backend supports
accessing the contents of archive files using the archive: scheme.
The syntax is archive://<container-uri>/<path>, where <container-uri> is the
URL-encoded URI of the container file.
For example:
file:///path/to/archive.zip|zip:path/within/archiveunder this specification is equivalent toarchive://file%3A%2F%2F%2Fpath%2Fto%2Farchive.zip/path/within/archive.
This approach requires the container URL to be URL-encoded, which reduces readability and makes manual entry difficult compared to the pipeline syntax.
This specification as a whole is assigned an integer version number, as are the specifications for individual root and adapter schemes.
Any future versions of this specification will satisfy the following guarantee:
-
Any URL pipeline that is valid under the current specification will remain valid, with the same interpretation, under any future version of the specification. The only exception is minor corrections that are believed to be highly unlikely to impact any actual usage.
-
Future versions of the specification may extend the allowed syntax of URL pipelines. The version number of the specification as a whole will be incremented in such cases.
-
Future versions of the specification of individual schemes may extend the allowed syntax for the scheme. The version number of the scheme will be incremented in such cases.
-
The version number of the specification as a whole, or of individual schemes, may not be incremented for purely editorial changes.
To make changes to this specification, and in particular to propose a new scheme, refer to CONTRIBUTING.md for detailed guidelines.
© 2025 The URL Pipeline Specification Authors.
This specification is licensed under CC BY 4.0.