spec: describe components of EBNF grammar

To clarify the grammar definitions, we define the subset of EBNF used by this specification to specify various field formats. Signed-off-by: Stephen J Day <stephen.day@docker.com>
opencontainers · Jun 21, 2017 · b692dee · b692dee
1 parent 6772079
commit b692dee
Show file tree

Hide file tree

Showing 3 changed files with 115 additions and 11 deletions.
diff --git a/annotations.md b/annotations.md
@@ -31,12 +31,12 @@ This specification defines the following annotation keys, intended for but not l
 * **org.opencontainers.image.ref.name** Name of the reference for a target (string).
   * SHOULD only be considered valid when on descriptors on `index.json` within [image layout](image-layout.md).
   * Character set of the value SHOULD conform to alphanum of `A-Za-z0-9` and separator set of `-._:@/+`
-  * An EBNF'esque grammar + regular expression like:
+  * The reference must match the following [grammar](considerations.md#ebnf):
     ```
-    ref := component ["/" component]*
-    component := alphanum [separator alphanum]*
-    alphanum := /[A-Za-z0-9]+/
-    separator := /[-._:@+]/ | "--"
+    ref       ::= component ("/" component)*
+    component ::= alphanum (separator alphanum)*
+    alphanum  ::= [A-Za-z0-9]+
+    separator ::= [-._:@+] | "--"
     ```
 * **org.opencontainers.image.title** Human-readable title of the image (string)
 * **org.opencontainers.image.description** Human-readable description of the software packaged in the image (string)

diff --git a/considerations.md b/considerations.md
@@ -24,3 +24,107 @@ Implementations:
 [github.com/docker/go]: https://github.com/docker/go/
 [Go]: https://golang.org/
 [JSON]: http://json.org/
+
+# EBNF
+
+For field formats described in this specification, we use a limited subset of [Extended Backus-Naur Form][ebnf], similar to that used by the [XML specification][xmlebnf].
+Grammars present in the OCI specification are regular and can be converted to a single regular expressions.
+However, regular expressions are avoided to limit abiguity between regular expression syntax.
+By defining a subset of EBNF used here, the possibility of variation, misunderstanding or ambiguities from linking to a larger specification can be avoided.
+
+Grammars are made up of rules in the following form:
+
+```
+symbol ::= expression
+```
+
+We can say we have the production identified by symbol if the input is matched by the expression.
+Whitespace is completely ignored in rule definitions.
+
+## Expressions
+
+The simplest expression is the literal, surrounded by quotes:
+
+```
+literal ::= "matchthis"
+```
+
+The above expression defines a symbol, "literal", that matches the exact input of "matchthis".
+Character classes are delineated by brackets (`[]`), describing either a set, range or multiple range of characters:
+
+```
+set := [abc]
+range := [A-Z]
+```
+
+The above symbol "set" would match one character of either "a", "b" or "c".
+The symbol "range" would match any character, "A" to "Z", inclusive.
+Currently, only matching for 7-bit ascii literals and character classes is defined, as that is all that is required by this specification.
+
+Expressions can be made up of one or more expressions, such that one must be followed by the other.
+This is known as an implicit concatenation operator.
+For example, to satisfy the following rule, both `A` and `B` must be matched to satisfy the rule:
+
+```
+symbol ::= A B
+```
+
+Each expression must be matched once and only once, `A` followed by `B`.
+To support the description of repetition and optional match criteria, the postfix operators `*` and `+` are defined.
+`*` indicates that the preceeding expression can be matched zero or more times.
+`+` indicates that the preceeding expression must be matched one or more times.
+These appear in the following form:
+
+```
+zeroormore ::= expression*
+oneormore ::= expression+
+```
+
+Parentheses are used to group expressions into a larger expression:
+
+```
+group ::= (A B)
+```
+
+Like simpler expressions above, operators can be applied to groups, as well.
+To allow for alternates, we also define the infix operator `|`.
+
+```
+oneof ::= A | B
+```
+
+The above indicates that the expression should match one of the expressions, `A` or `B`.
+
+## Precedence
+
+The operator precedence is in the following order:
+
+- Terminals (literals and character classes)
+- Grouping `()`
+- Unary operators `+*`
+- Concatenation
+- Alternates `|`
+
+The precedence can be better described using grouping to show equivalents.
+Concatenation has higher precedence than alernates, such `A B | C D` is equivalent to `(A B) | (C D)`.
+Unary operators have higher precedence than alternates and concatenation, such that `A+ | B+` is equivalent to `(A+) | (B+)`.
+
+## Examples
+
+The following combines the previous definitions to match a simple, relative path name, describing the individual components:
+
+```
+path      ::= component ("/" component)*
+component ::= [a-z]+
+```
+
+The production "component" is one or more lowercase letters.
+A "path" is then at least one component, possibly followed by zero or more slash-component pairs.
+The above can be converted into the following regular expression:
+
+```
+[a-z]+(?:/[a-z]+)*
+```
+
+[ebnf]: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form
+[xmlebnf]: https://www.w3.org/TR/REC-xml/#sec-notation
diff --git a/descriptor.md b/descriptor.md
@@ -66,14 +66,14 @@ If the _digest_ can be communicated in a secure manner, one can verify content f
 The value of the `digest` property is a string consisting of an _algorithm_ portion and an _encoded_ portion.
 The _algorithm_ specifies the cryptographic hash function and encoding used for the digest; the _encoded_ portion contains the encoded result of the hash function.
 
-A digest string MUST match the following grammar:
+A digest string MUST match the following [grammar](considerations.md#ebnf):
 
 ```
-digest                := algorithm ":" encoded
-algorithm             := algorithm-component [algorithm-separator algorithm-component]*
-algorithm-component   := /[a-z0-9]+/
-algorithm-separator   := /[+._-]/
-encoded               := /[a-zA-Z0-9=_-]+/
+digest                ::= algorithm ":" encoded
+algorithm             ::= algorithm-component (algorithm-separator algorithm-component)*
+algorithm-component   ::= [a-z0-9]+
+algorithm-separator   ::= [+._-]
+encoded               ::= [a-zA-Z0-9=_-]+
 ```
 
 Note that _algorithm_ MAY impose algorithm-specific restriction on the grammar of the _encoded_ portion.