forked from codequarterly/cq-challenge-markup
-
Notifications
You must be signed in to change notification settings - Fork 0
/
markup-spec.txt
266 lines (195 loc) · 9.77 KB
/
markup-spec.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
-*- mode: markup; -*-
* Markup Specification
Markup is a text markup language primarily useful for prose documents
such as books and articles. It is designed to be editable in a plain
text editor\note{At least to the extent that Emacs can be considered a
plain text editor.} and to allow for arbitrary logical markup. The
grammar of a Markup file is defined in terms of a mapping to an
abstract syntax tree which can then be rendered into a number of
formats, e.g. HTML, PDF, TeX, RTF, etc.
** Syntax
Markup files consist of Unicode text encoded in UTF-8. Lines can be
terminated with carriage return (U+000D), carriage-return/line-feed
(U+000D U+000A), or line feed (U+000A). Tab characters (U+0009) are
equivalent to eight spaces. A blank line (which has a syntactic
meaning to be described later) is defined as two consecutive
end-of-line sequences possibly with white space between
them.\note{Trailing white space has no meaning in Markup and do not
need to be preserved by a Markup processor.} The basic syntax is
similar to Markdown and reStructuredText with a bit of TeX thrown in
for good measure.
** Grammar
As mentioned above, the Markup grammar defines a mapping between a
Markup document and an abstract syntax tree. The tree is built out of
tagged elements and strings.\note{Markup was originally developed in
Lisp where the obvious representation for a Markup document is as
s-expressions, with each tree represented by a list whose first
element is a symbol indicating the tree’s tag.
(:body
(:h1 "This is a header")
(:p "This is a paragraph")
(:p "This is another paragraph with some" (:i "italic") " text in it."))
This kind of tree structure also has an obvious representation in XML
or HTML:
<body>
<h1>This is a header</h1>
<p>This is a paragraph</p>
<p>This is another paragraph with some <i>italic</i> text in it.</p>
</body>
} The abstract syntax tree is rooted in a single element whose tag is
\code{body}. Its children are the elements described below.\note{The
element names were, as will be obvious to anyone who knows HTML,
chosen so that a trivial mapping from Markup to HTML gives a useful
result but other than that pleasant coincidence, Markup defines no
particular semantics for Markup documents.}
*** Normal paragraphs
Normal paragraphs are simply blocks of text separated by one or more
blank lines. They can contain single line breaks, which are converted
to spaces during parsing. The body of a paragraph can contain tagged
markup as discussed below. The tag of a paragraph node is \code{p}.
*** Headers
Headers are paragraphs marked as in Emacs outline-mode, with leading
\code{*}’s followed by a single space. The more stars the lower in the
hierarchy the header. The content of the header is everything after
the \code{*} and the space and is otherwise parsed just like a
paragraph. Header nodes are tagged with \code{h\i{n}} where \i{n} is
the number of stars.
*** Block quotes
Block quotes are one of three kinds of “sections” indicated by
indentation. A section ends at the end of the file or by the
occurrence of a less-indented non-blank line. Sections can also be
nested. A block quote is demarcated by two spaces of indentation
relative to the enclosing section and can contain their own
paragraphs, headers, lists, and verbatim sections. Block
quote nodes are tagged with \code{blockquote}.
*** Verbatim sections
Verbatim sections are indented three spaces relative to the enclosing
section. Within a verbatim section all text is captured exactly as is.
Verbatim sections are tagged with \code{pre}.
*** Lists
Lists are demarcated by two spaces of indentation followed by a list
marker, either ‘\code{#}’ for an ordered (i.e. numbered) list or
‘\code{-}’ for an unordered (i.e. bulleted) list. An ordered list is
tagged with \code{ol} and an unordered list with \code{ul}.
The list marker must be followed by a space and then the text of the
first list item. List items are tagged with \code{li} and can contain
multiple paragraphs, the contents of which are indented to line up
under first character of the beginning of the list item. Subsequent
items are marked with another list marker in the same column as the
original list marker and another space. For example:
This is a regular paragraph.
# This is the first item of a list consisting of one paragraph
that spans a couple lines.
# This is the second item.
# This is the third item.
This is another paragraph in the third item.
This is another paragraph.
Could be rendered in HTML as:
<p>This is a regular paragraph.</p>
<ol>
<li>
<p>This is the first item of a list consisting of one
paragraph that spans a couple lines.</p>
</li>
<li>
<p>This is the second item.</p>
</li>
<li>
<p>This is the third item.</p>
<p>This is another paragraph in the third item.</p>
</li>
</ol>
<p>This is another paragraph.</p>
*** Links
A Markup processor can optionally support a few bits of syntax to make
it more convenient to add hyperlinks to a document. Within normal text
(i.e. anywhere but a verbatim section) a link can be indicated by
enclosing the text to act as the hyperlink with \code{\[]}s. This maps
to an element tagged \code{link}. If the text between the \code{\[]}s
includes a \code{|}, the text after the \code{|} is wrapped in a
\code{key} element.
A paragraph consisting solely of text in \code{\[]}s followed by zero
or more spaces followed by text enclosed in \code{<>}s is parsed as an
element tagged \code{link_def} whose two children are a \code{link}
element comprising the text between the \code{\[]}s and a \code{url}
element comprising the text between the \code{<>}s.\note{The idea is
that a Markup backend would render all the in-text \code{link}
elements as hyperlinks with the \code{link} text linking to the URL
given in the corresponding \code{link_def} element.} A given Markup
processor can choose to implement the link syntax or not and, if it
does, may provide a way to indicate whether or not it should be used
when parsing a given document.
*** Tagged markup
For all other markup, Markup uses the TeX-like notation
\code{\\\i{tagname}\{\i{stuff}\}}. Tag names can consist of letters,
numbers, ‘\code{-}’, ‘\code{.}’, and ‘\code{+}’. Tagged markup can
nest so you can have:
\i{italic with \b{some bold added} and back to just italic}
An element created from tagged markup is tagged with the tagname.
Certain tag names can be used to mark sub-documents which are parsed
differently than simple spans of text. The content of a
sub-document—between the opening and closing \{\}s—is parsed like a
document so it will contain at least one paragraph and can contain
headers, block quotes, lists, verbatim sections, and even nested
sub-documents. Footnotes, for example, are commonly set up to be parsed
as sub-documents. For example:
This is an example paragraph.\note{This is a footnote whose
reference will appear right after the period before ‘paragraph’.
This is a second paragraph of the footnote.} Now back to the main
paragraph.
Note that the blank line separating the paragraphs of the sub-document
has no effect on the enclosing paragraph.
If a sub-document is embedded in a paragraph that is part of an
indented section (i.e. a block quote or a list) then subsequent lines
of the sub-document should be indented the same as the enclosing
paragraph:
This is a regular paragraph.
This is a block quote.\note{This is a footnote within the
block quote.
This is a second paragraph in the footnote.} Back to the
block quote paragraph.
A Markup processor will need to provide some more or less convenient
way to specify that certain tag names should be parsed as
sub-documents rather than character markup.
*** Escapes
Outside of verbatim sections, a backslash can escape any character
that is not a legal tag name character, stripping it of its syntactic
significance. The characters \code{\\}, \code{\{}, and \code{\}} must
be escaped whenever they appear outside a verbatim section if they are
to be part of the text. Other non-tag-name characters may be escaped
anytime, but it is only necessary when they would otherwise have
syntactic significance. For example, \code{*} does not need to be
escaped except at the beginning of a paragraph, where it would
otherwise mark the paragraph as a header.
* This is a header
\* This is a paragraph that starts with * (note no escape here)
that contains a backslash: \\, an open brace: \{, and a close
brace: \}
\# This is a block quote paragraph starting with #, not a list.
*** One last (optional) convenience
In the real world, Markup documents are often (usually) edited in
Emacs. Emacs has a mechanism whereby a a line starting with \code{-*-}
indicates a \i{mode line} which tells emacs about how to edit the
file. For instance, in the Markup sources of this specification, the
first line is:
-*- mode: markup; -*-
A Markup parser can choose to strip such modelines at the top level of
a document to save having to strip them out later in processing.
** Trivial XML backend
A complete Markup system consists of a parser that can parse a text
file in Markup syntax into a data structure representing the resulting
abstract syntax tree and one or more back-ends that can render such a
tree into some other form. For the purposes of testing we specify a
trivial mapping from a Markup abstract tree to well-formed XML: each
Markup element is mapped to an XML element with the same name and with
the node children mapped to XML in the same way and string children as
text. Thus:
* Header 1
** Header 2
Regular paragraph. With \i{italic} text.
maps to (indentation for clarity):
<body>
<h1>Header 1</h1>
<h2>Header 2</h2>
<p>Regular paragraph. With <i>italic</i> text.</p>
</body>