Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update lexer development guide #1145

Merged
merged 6 commits into from
Jun 21, 2019
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 72 additions & 52 deletions docs/LexerDevelopment.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,11 @@ This guide assumes a familiarity with git. If you're new to git, GitHub has
Rouge automatically loads lexers saved in the `lib/rouge/lexers/` directory and
so if you're submitting a new lexer, that's the right place to put it.

Your lexer needs to be a subclass of the {Rouge::Lexer Lexer} abstract class.
Most lexers are in fact subclassed from {Rouge::RegexLexer RegexLexer} as the
simplest way to define the states of a lexer is to use rules consisting of
regular expressions. The remainder of this guide assumes your lexer is
subclassed from {Rouge::RegexLexer RegexLexer}.
Your lexer needs to be a subclass of the {Rouge::Lexer} abstract class. Most
lexers are in fact subclassed from {Rouge::RegexLexer} as the simplest way to
define the states of a lexer is to use rules consisting of regular expressions.
The remainder of this guide assumes your lexer is subclassed from
{Rouge::RegexLexer}.

You can learn a lot by reading through some of the existing lexers. A good
example that's not too long is [the JSON lexer][json-lexer].
Expand Down Expand Up @@ -68,31 +68,30 @@ To be usable by Rouge, a lexer should declare a **title**, a **description**, a
title "JSON"
```

The title of the lexer. It is declared using the {Rouge::Lexer.title
Lexer.title} method.
The title of the lexer. It is declared using the {Rouge::Lexer.title} method.

Note: As a subclass of {Rouge::RegexLexer RegexLexer}, the JSON lexer inherits
this method (and its inherited methods) into its namespace and can call those
methods without needing to prefix each with `Rouge::Lexer`. This is the case
with all of the property defining methods.
Note: As a subclass of {Rouge::RegexLexer}, the JSON lexer inherits this method
(and its inherited methods) into its namespace and can call those methods
without needing to prefix each with `Rouge::Lexer`. This is the case with all
of the property defining methods.

#### Description

```rb
desc "JavaScript Object Notation (json.org)"
```

The description of the lexer. It is declared using the {Rouge::Lexer.desc
Lexer.desc} method.
The description of the lexer. It is declared using the {Rouge::Lexer.desc}
method.

#### Tag

```rb
tag "json"
```

The tag associated with the lexer. It is declared using the {Rouge::Lexer.tag
Lexer.tag} method.
The tag associated with the lexer. It is declared using the {Rouge::Lexer.tag}
method.

A tag provides a way to specify the lexer that should apply to text within a
given code block. In various flavours of Markdown, it's used after the opening
Expand All @@ -110,8 +109,8 @@ https://github.com/rouge-ruby/rouge/blob/master/lib/rouge/lexers/ruby.rb
#### Aliases

The aliases associated with a lexer. These are declared using the
{Rouge::Lexer.aliases Lexer.aliases} method. Aliases are alternative ways that
the lexer can be identified.
{Rouge::Lexer.aliases} method. Aliases are alternative ways that the lexer can
be identified.

The JSON lexer does not define any aliases but [the Ruby one][ruby-lexer] does.
We can see how it could be used by looking at another example in Markdown. This
Expand All @@ -129,7 +128,7 @@ filenames "*.json"
```

The filename(s) associated with a lexer. These are declared using the
{Rouge::Lexer.filenames Lexer.filenames} method.
{Rouge::Lexer.filenames} method.

Filenames are declared as "globs" that will match a particular pattern. A
"glob" may be merely the specific name of a file (eg. `Rakefile`) or it could
Expand All @@ -142,25 +141,25 @@ mimetypes "application/json", "application/vnd.api+json", "application/hal+json"
```

The mimetype(s) associated with a lexer. These are declared using the
{Rouge::Lexer.mimetypes Lexer.mimetypes} method.
{Rouge::Lexer.mimetypes} method.

### Lexer States

The other major element of a lexer is the collection of one or more states.
For lexers that subclass {Rouge::RegexLexer RegexLexer}, a state will consist
For lexers that subclass {Rouge::RegexLexer}, a state will consist
of one or more rules with a rule consisting of a regular expression and an
action. The action yields tokens and manipulates the _state stack_.

#### The State Stack

The state stack represents the series of states through which the lexer has
passed. States are added and removed from the "top" of the stack. The oldest
state is on the bottom of the stack and the newest state is on the top.
The state stack represents an ordered sequence of states the lexer is currently
processing. States are added and removed from the "top" of the stack. The
oldest state is on the bottom of the stack and the newest state is on the top.

The initial (and therefore bottommost) state is the `:root` state. The lexer
works by looking at the rules that are in the state that is on top of the
stack. These are tried _in order_ until a match is found. At this point, the
action defined in the rule is run, the match is removed from the input stream
action defined in the rule is run, the head of the input stream is advanced
and the process is repeated with the state that is now on top of the stack.

Now that we've explained the concepts, let's look at how you actually define
Expand All @@ -174,14 +173,14 @@ state :root do
end
```

A state is defined using the {Rouge::RegexLexer.state RegexLexer.state} method.
A state is defined using the {Rouge::RegexLexer.state} method.
The method consists of the name of the state as a `Symbol` and a block
specifying the rules that Rouge will try to match as it parses the text.

#### Rules

A rule is defined using the {Rouge::RegexLexer::StateDSL#rule StateDSL#rule}
method. The `rule` method can define either "simple" rules or "complex" rules.
A rule is defined using the {Rouge::RegexLexer::StateDSL#rule}e method. The
pyrmont marked this conversation as resolved.
Show resolved Hide resolved
`rule` method can define either "simple" rules or "complex" rules.

*Simple Rules*

Expand Down Expand Up @@ -232,9 +231,9 @@ The block called can take one argument, usually written as `m`, that contains
the regular expression match object.

These kind of rules allow for more fine-grained control of the state stack.
Inside a complex rule's block, it's possible to {Rouge::RegexLexer#push push},
{Rouge::RegexLexer#pop! pop}, {Rouge::RegexLexer#token yield a token} and
{Rouge::RegexLexer#delegate delegate to another lexer}.
Inside a complex rule's block, it's possible to call {Rouge::RegexLexer#push},
{Rouge::RegexLexer#pop!}, {Rouge::RegexLexer#token} and
{Rouge::RegexLexer#delegate}.

You can see an example of these more complex rules in [the Ruby
lexer][ruby-lexer].
Expand All @@ -256,19 +255,22 @@ Rouge will attempt to guess the appropriate lexer if it is not otherwise clear.
If Rouge is unable to do this on the basis of any tag, associated filename or
associated mimetype, it will try to detect the appopriate lexer on the basis of
pyrmont marked this conversation as resolved.
Show resolved Hide resolved
the text itself (the source). This is done by calling `self.detect?` on the
possible lexer (a default `self.detect?` method is defined in {Rouge::Lexer
Lexer} and simply returns `false`).
possible lexer (a default `self.detect?` method is defined in {Rouge::Lexer}
and simply returns `false`).

A lexer can implement its own `self.detect?` method that takes as a parameter a
pyrmont marked this conversation as resolved.
Show resolved Hide resolved
{Rouge::TextAnalyzer TextAnalyzer} object. If the `self.detect?` method returns
true, the lexer will be selected as the appropriate lexer.
{Rouge::TextAnalyzer} object. If the `self.detect?` method returns true, the
lexer will be selected as the appropriate lexer.

The `self.detect?` method is intended to work by looking at the shebang or
doctype that identifies a piece of text. To make this easier, Rouge provides
the {Rouge::TextAnalyzer#shebang TextAnalyzer#shebang} method and the
{Rouge::TextAnalyzer#doctype TextAnalyzer#doctype} method. For more general
disambiguation between different lexers, see [Conflicting Filename
Globs][conflict-globs] below.
It is important to note that `self.detect?` should _only_ return `true` if it
is 100% sure that the language is detected. The most common ways for source
code to identify the language it's written in is with a shebang or a doctype
and Rouge provides the {Rouge::TextAnalyzer#shebang} method and the
{Rouge::TextAnalyzer#doctype} method specifically for use with `self.detect?`
to make these checks easy to perform.

For more general disambiguation between different lexers, see [Conflicting
Filename Globs][conflict-globs] below.

[conflict-globs]: #Conflicting_Filename_Globs

Expand All @@ -280,7 +282,7 @@ for these words easier, many lexers will put the applicable keywords in an
array and make them available in a particular way (be it as a local variable,
an instance variable or what have you).

We recommend lexers use a class method:
For performance and safety, we strongly recommend lexers use a class method:

```rb
module Rouge
Expand All @@ -297,10 +299,24 @@ module Rouge
end
```

These keywords can then be included in a regular expression like so:
These keywords can then be used like so:

```rb
rule /(#{keywords.join('|')})\b/, Keyword
rule /\w+/ do |m|
if self.class.keywords.include?(m[0])
token Keyword
elsif
token Name
end
end
```

In some cases, you may want to interpolate your keywords into a regular
expression. If you do, be careful to use the `\b` anchor to avoid inadvertently
matching part of a longer word (eg. `if` matching `iff`)::

```rb
rule /\b(#{keywords.join('|')})\b/, Keyword
```

#### Startup
Expand All @@ -312,16 +328,16 @@ start do
end
```

The {Rouge::RegexLexer.start RegexLexer.start} method can take a block that
The {Rouge::RegexLexer.start} method can take a block that
will be called when the lexer commences lexing. This provides a way to enter
into a special state "before" entering into the `:root` state (the `:root`
state is still the bottommost state in the state stack; the state pushed by
`start` sits "on top" but is the state in which the lexer begins.

Why would you want to do this? In some languages, there may be language
structures that can appear at the beginning of a file. {Rouge::RegexLexer.start
RegexLexer.start} provides a way to parse these structures. An example is a
preprocessor directive in C. You can see how these are lexed in [the C
structures that can appear at the beginning of a file.
{Rouge::RegexLexer.start} provides a way to parse these structures. An example
is a preprocessor directive in C. You can see how these are lexed in [the C
lexer][c-lexer].

[c-lexer]: https://github.com/rouge-ruby/rouge/blob/master/lib/rouge/lexers/c.rb
Expand All @@ -340,13 +356,12 @@ lexer][cpp-lexer] and [the JSX lexer][jsx-lexer] for examples.
#### Conflicting Filename Globs

If two or more lexers define the same filename glob, this will cause an
{Rouge::Guesser::Ambiguous Ambiguous} error to be raised by certain guessing
methods (including the one used by the `assert_guess` method used in your
spec).
{Rouge::Guesser::Ambiguous} error to be raised by certain guessing methods
(including the one used by the `assert_guess` method used in your spec).

The solution to this is to define a disambiguation procedure in the
{Rouge::Guessers::Disambiguation Disambiguation} class. Here's the procedure
for the `*.pl` filename glob as an example:
{Rouge::Guessers::Disambiguation} class. Here's the procedure for the `*.pl`
filename glob as an example:

```rb
disambiguate "*.pl" do
Expand Down Expand Up @@ -431,6 +446,11 @@ returns true should be tested.
The demo file is tested automatically as part of Rouge's test suite. The file
should be able to be parsed without producing any `Error` tokens.

The demo is also used on [rouge.jneen.net][hp] as the default text to display
when a lexer is chosen. It should be short (less than 20 lines if possible).

[hp]: http;//rouge.jneen.net/
pyrmont marked this conversation as resolved.
Show resolved Hide resolved

### Visual Samples

While the visual sample is tested by the testing suite to ensure that it does
Expand Down