Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed formal syntax for edn #56

Open
bpsm opened this issue Jul 8, 2013 · 9 comments
Open

Proposed formal syntax for edn #56

bpsm opened this issue Jul 8, 2013 · 9 comments

Comments

@bpsm
Copy link

bpsm commented Jul 8, 2013

My purpose in writing this is to get a discussion started which will hopefully lead to hashing out a formal syntax for edn.

(This grammar is written in slightly extended version of Wirth Syntax Notation. A description is appended to the end of this document. I'm not married to using WSN. I chose it because it's familiar to me, is easily explained and is compact, but translating these productions to some other notation would be no great hardship once errors and ambiguities have been inroned out.)

Points to Consider

This syntax considers code point NUL (U+0) unacceptable in valid edn input.

This syntax forbids supurfluous leading zeros in numeric literals. (Unlike current implementations.)

This syntax does not allow unicode characters outside of the US-ASCII range to appear in symbol or keyword names.

This syntax does not allow < and > to appear in symbol or keyword names.

This syntax supports \backspace and \formfeed, "\b" and "\f", as do current implementations.

Elements

START = elements.

elements = { s element }.

element = nil | boolean | symbol | keyword | number | character
      | string
      | "{" s { element s element s } "}"
      | "#{" elements "}"
      | "[" elements "]"
      | "(" elements ")"
      | "#" tagSymbol s element.

White space is allowed between elements, but not always required. For example: {}{} parses as two empty maps. "MeaningOfLife"42 parses as a String followed by an integer. However, a:b parses as a single symbol, not as the symbol a followed by the keyword :b.

White Space

For parsing we treat comments and discarded elements to be as we do whitespace. This production describes some (possibly zero-length) run of text that is ignored by the parser:

s = { whitespace | comment | discardedElement }.

Edn, like Clojure, considers the comma (,) to be white space.

whitespace = " " | HT | LF | CR | ",".

(The syntax is written to be quite narrow here. Perhaps we should just accept "," and anything in the range U+1 … " " as white space.)

Issue#31 states that only \newline (i.e. U+A) terminates comments.

comment = ";" { commentContent } LF.

Comments can contain any code points, except LF (which terminates the comment). This is compatible with both unix (U+A) and windows (U+D U+A) style line breaks. It's incompatible with the line break style of classic Mac OS (U+D). Presumably that's no great loss.

commentContent = U+1 | … | U+9 | U+B | … | MaxCodePoint.
    (* all characters except NUL, LF. *)

discardedElement = "#_" s element.

HT  = U+9.
LF  = U+A.
CR  = U+D.

MaxCodePoint = U+10FFFF.

Symbols

symbol = "/" | [name  "/"]  name.

Issue#30 states that "tag symbols must begin with an alphabetic character", so we introduce a production for this:

tagSymbol = letter [ name "/" ] name.

Issue#32 says that :/ is not a valid keyword. Otherwise keywords adhere the naming rules of symbols following the initial :.

keyword = ":" [name "/"] name.

Names beginning with a nameStart2 must continue with a letter if they continue at all. This avoids possible ambiguity with numeric literals.

README.md, does not allow < and > to appear in symbol (or keyword) names. clojure.edn/read and clojure.core/read both accept < and > as members of nameStart1. Is this deliberate?

name = nameStart1 { nameConstituent }
     | nameStart2 [ letter { nameConstituent } ].

nameStart1 = "!" | "*" | "?" | "_" | "$" | "%" | "&" | "=" | letter.

nameStart2 = "." | "-" | "+".

nameConstituent = nameStart1 | nameStart2 | digit | "#" | ":".

letter = "a" | … | "z" | "A" | … | "Z".

digit = "0" | nonZeroDigit.

nonZeroDigit = "1" | … | "9".

The words true, false and nil look syntactically like symbols, but are not parsed as such. I've given them their own productions to hint at this:

boolean = "true" | "false".

nil = "nil".

(But, that makes this grammar ambiguous since there are now two ways of parsing "true", "false" and "nil".)

Numbers

number = integer | float.

This syntax does not allow supurfluous leading zeros in integers.

README.md (the current informal standard) does allow leading zeros.

clojure.core/read and clojure.edn/read both allow leading zeros here but interpret the remaining digits in base 8! See also Issue#33.

edn-java allows leading zeros, but gives them no special meaning.

integer = [ "+" | "-" ] cardinal [ "N" ].

The float syntax disagrees with the formal syntax from README.md, but does so in order to comply with "In addition, a floating-point number may have the suffix M to indicate that exact precision is desired."

The grammar in README.md does not allow leading zeros in the integer and exponent portions of a float. clojure.core/read, clojure.edn/read and edn-java all accept leading zeros in these cases, but given them no special interpretation.

float = [ "+" | "-" ] cardinal
        ((frac [exp] ["M"]) | (exp ["M"]) |  "M").

The fractional portion can consist of only a "." (not followed by any digits). This is consistent with the current spec and with the behavior of clojure.core/read and clojure.edn/read

frac =   "." { digit }.

exp = ("E" | "e") ["+" | "-"] cardinal.

cardinal = digit | nonZeroDigit { digit }.

Characters

character = "\" (characterName | printableCharacter).

Certain characters can be refered to by names. In their literal form these characters would display as whitespace, making the resulting edn quite confusing for a person to read or edit.

The specification only mentions the first four of these explicitly; backspace and formfeed are included for symmentry with string and because clojure.edn/read and clojure.core/read both support them.

characterName = "newline" | "space" | "tab" | "return"
              | "backspace" | "formfeed".

Characters literals for other code points are written by simply including the literal character immediate following the "".

printableCharacter = "!" | … | "~" | U+A1 | … | MaxCodePoint.

This definition of printableCharacter is sloppy. There are probably other Unicode code points that we don't want to use in character literals because they have no printed representation or perform some control function.

Strings

string = """" {stringChar | "\" stringEscape} """".

stringChar = U+1 | … | "!" | "#" | … | "[" | "]" | … | MaxCodePoint.
    (* all code points except NUL, " and \ *)

README.md only mentions "Standard C/Java escape characters \t \r \n are supported", but clearly \\ and \" must be included. \b \f are included because they are supported by clojure.core/read and clojure.edn/read. \' is excluded despite being supported by Java because clojure.core/read and clojure.edn/read reject \'.

stringEscape = """" | "b" | "t" | "n" | "f" | "r" | "\".

Wirth Syntax Notation

Wikipedia provide as a brief description of WSN. The formal Syntax of WSN in WSN follows:

SYNTAX     = { PRODUCTION } .
PRODUCTION = IDENTIFIER "=" EXPRESSION "." .
EXPRESSION = TERM { "|" TERM } .
TERM       = FACTOR { FACTOR } .
FACTOR     = IDENTIFIER
           | LITERAL
           | "[" EXPRESSION "]"
           | "(" EXPRESSION ")"
           | "{" EXPRESSION "}" .
IDENTIFIER = letter { letter } .
LITERAL    = """" character { character } """" .

The only oddity for modern readers is that string literals do not provide anything like the \ escaping convention we have grown used to C-like languages. One embeds a double quote in a string by doubling it. All other characters in the string literal are taken literally, as written.

An Extension to WSN to Represent Unicode Codepoints

Edn's syntax is specified in terms of unicode code points, the first 128 of which are identical to US-ASCII. (Edn is always serialized as UTF-8, which can represent the full set of unicode code points.)

For the purposes of this grammar, we'll use the following extension to represent a unicode code point:

HexDigit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
         | "A" | "B" | "C" | "D" | "E" | "F".

CodePoint = "U+" HexDigit { HexDigit }.

For example: NUL is written as U+0, " " can be written as U+20 and "~" can be written as U+7F.

To represent large contiguous subsets of the unicode codepoints, we use an elipsis as follows:

HexDigit = "0" | … | "9" | "A" | … | "F".
@bpsm
Copy link
Author

bpsm commented Jul 8, 2013

See also https://gist.github.com/bpsm/5951638 which is just the plain WSN with comments.

@zaiquirii
Copy link

This syntax disagrees with the specification with regards to whitespace before and after { } as well as "s. As far as I can tell it does not allow opening brackets to be directly followed by a space, Nor does your example of "MeaningOfLife"42 actually work.
I know you state that whitespace isn't always necessary in the comment, but is there a way to remove all ambiguity. I tried myself, using the [ ] in WSN means optional. but I was always ignoring one of the cases.

changing elements to be defined as

elements = element { s element }.

would take care of spaces around { } [ ( ), leaving just the spaces between strings and other elements and maps

@bpsm
Copy link
Author

bpsm commented Jul 11, 2013

@hubrys Thanks for taking the time to review the syntax. I'll try to respond to your objections one at at time:

it does not allow opening brackets to be directly followed by a space

Let's take the production for vectors as an example:

"[" elements "]".

If we substitute in the definition of elements we get:

"[" { s element } "]".

The production for s, allows any amount of white space (including no white space, since {x} means 0 or more repetitions of x in WSN):

"[" { { whitespace | comment | discardedElement } element } "]".

Here are a few examples that would match this production: [ nil], [nil] and [ #_ nil].

But, there is a bug, the following don't match: [] and [nil ].

I believe this fixes it by allowing s following elements in collections.

element = nil | boolean | symbol | keyword | number | character
      | string
      | "{" s { element s element s } "}"
      | "#{" elements s "}"
      | "[" elements s "]"
      | "(" elements s ")"
      | "#" tagSymbol s element.

With this change [] and [nil ] are both legal.

Nor does your example of "MeaningOfLife"42 actually work.

I don't follow, "MeaningOfLife"42 parses as:

elements
    s
        <empty>
    element
        string
            "MeaningOfLife"
    s
        <empty>
    element
        number
            integer
                "42"

Am I missing something?

@zaiquirii
Copy link

@bpsm Ahhhh, makes sense now. For some reason I was under the impression that the { } brackets implied that the item inside of them would have to be there at least once, not 0 or more times. That actually clears up and answers all the questions I had. Thanks.

@tomjakubowski
Copy link

A few things I've discovered tools.reader.edn accepts, which disagree with this grammar and which disagree with the README or are not specified there.

  • Keyword names starting with . - or + need not be followed by a letter, unlike symbol names. For example: :-1 is accepted.
  • The above also applies to the second name component of prefixed symbols. For example: foo/-1 is accepted as a single symbol foo/-1 (not as two values foo/- and 1). The README specifies that this name component "should follow the first-character restrictions for symbols as a whole."

@aadrian
Copy link

aadrian commented Sep 28, 2016

How about an ANTLR https://github.com/antlr/antlr4 grammar for EDN of this to be even more precise?

There seems to be already one for Clojure https://github.com/antlr/grammars-v4/tree/master/clojure

Thank you.

@Marti2203
Copy link

Hello, I just made a grammar for the EDN format. If you are still interested, please check it at antlr/grammars-v4#1831

@aadrian
Copy link

aadrian commented Aug 27, 2020

@Marti2203 Thank you for the grammar file !

@Marti2203
Copy link

:) With pleasure, please check it and give a review as I had never heard of EDN before seeing the issue in the grammar repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants