Proposal - redesign how Elixir adds links to source code formatted by Pygments 

I'm wondering if it wouldn't be a good idea to redesign how Elixir turns identifiers into links in formatted source code. Current approach is kind of hacky and some bugs found in it seem rather hard to fix.

## How filters work right now

1. [tokenize-file](https://github.com/bootlin/elixir/blob/5c88c885da3d2b09c1b48da9067dafaca2fb2518/script.sh#L97) is a (bash + perl regex) tokenizer that splits a source file into lines. Every other line is (supposed to be) a potential identifier. The rest are supposed to be comments, operators, generally not interesting.
2. The result of the tokenizer is consumed by [query.py](https://github.com/bootlin/elixir/blob/5c88c885da3d2b09c1b48da9067dafaca2fb2518/query.py#L168). Every other line is looked up in the definitions database. If a match is found, unprintable markings are added at the beginning and end of the line.
3. [Lines are then concatenated back into a string](https://github.com/bootlin/elixir/blob/5c88c885da3d2b09c1b48da9067dafaca2fb2518/query.py#L184)
4. Filters are used to find all interesting pieces of code, like identifiers and includes, and replace them with links. Filters consist of two parts. The first part runs on unformatted code, the second part runs in the HTML produced by Pygments. [Here](https://github.com/bootlin/elixir/blob/5c88c885da3d2b09c1b48da9067dafaca2fb2518/http/web.py#L388), the first part is ran and it most cases it replaces matching identifiers with special identifiers generated by filters.
5. [The string is formatted into HTML by Pygments](https://github.com/bootlin/elixir/blob/5c88c885da3d2b09c1b48da9067dafaca2fb2518/http%2Fweb.py#L398)
6. [The second part of filters is ran](https://github.com/bootlin/elixir/blob/5c88c885da3d2b09c1b48da9067dafaca2fb2518/http/web.py#L411) on formatted HTML, replacing identifiers replaced in original source with HTML.

Some more notes:
* Current filters design allows them to only replace sections of source that can be matched by a regex (this applies to both parts of a filter)
* [ident](https://github.com/bootlin/elixir/blob/5c88c885da3d2b09c1b48da9067dafaca2fb2518/http%2Ffilters%2Fident.py#L15) is used to find and replace identifiers marked by [file cmd in query.py](https://github.com/bootlin/elixir/blob/5c88c885da3d2b09c1b48da9067dafaca2fb2518/query.py#L181) in most files. In Kconfig marked identifiers are replaced by [kconfigidents](https://github.com/bootlin/elixir/blob/5c88c885da3d2b09c1b48da9067dafaca2fb2518/http/filters/kconfigidents.py), in DTS by [dtscompc](https://github.com/bootlin/elixir/blob/5c88c885da3d2b09c1b48da9067dafaca2fb2518/http/filters/dtscompC.py)
* [tokenize-file is also used in update.py](https://github.com/bootlin/elixir/blob/5c88c885da3d2b09c1b48da9067dafaca2fb2518/update.py#L308). 

## Problems with this approach

* Filters have to make sure to not break formatting. 
* Filters have to trust Pygments to not split their special identifiers into unregexable parts. 
* A single regex (and some bash) is used to tokenize source code for most languages. Although there is a special tokenizer regex, [just for DTS](https://github.com/bootlin/elixir/blob/5c88c885da3d2b09c1b48da9067dafaca2fb2518/script.sh#L107). If we want to fix #291, we probably would have to add another lexer there, making the script aware of another language. #306 also needs to be fixed there.
* Elixir has to be aware of the programming language in more parts than necessary. Not only in the script and the filters, there is also another [hard-coded language-dependent exception in the Python part of the tokenizer](https://github.com/bootlin/elixir/blob/5c88c885da3d2b09c1b48da9067dafaca2fb2518/query.py#L172). 

While of course a some of this could be fixed while leaving the filters design mostly the same, I started thinking if there maybe is a better way to design this whole part.

## Alternative design with Pygments

First, something that I want to be sure you noticed - the responsibility of filters seems to be to work around a missing functionality of Pygments. Pygments currently does not allow adding links to selected tokens (except if you generate a ctags file, but that's another hack, please, let's not do that).

My idea is simple:
Pygments is split into lexers and formatters - the lexer turns source into a sequence of tokens. The formatter then takes the tokens and turns them into formatted code. Pygments also supports filters - functions that take a sequence of tokens and transform it in some way. 
This could be leveraged, unfortunately Pygments currently does not allow filters to add link information to tokens.

What I would like to do:
1. Try to implement and upstream a functionality in Pygments that would allow filters to add certain, formatter dependent, information to tokens. People have been asking for this already (https://github.com/pygments/pygments/issues/1930), the maintainer agrees that this would be useful, and I think I have a design that could lead to something that gets merged.
2. If my changes get merged, reimplement current double search-and-replace filters as Pygments filters. Tokenizing would then be done by Pygments, adding link information to tokens (including the database lookup) would be contained entirely in filters.

This should lead to less and simpler code - for example, filters won't have to track what they replaced in the unformatted source. Filters will also receive more information which in the future could allow them to, for example, track context to supply links with more information. And, less regex (in Elixir at least). 

## Possible problems

A problem I see so far is that Pygments treats each macro line as a single token. Macros are easily recognizable because they use a separate token type. So I think this could be handled by a filter that would lex all macros into smaller tokens and supply them with identifier information. I realize that this hits into some of my points, but I still think this approach is better - most code would be handled by the cleaner codepath. And also, support for identifiers in macros is also already very hacky at best - maybe this would allow us to improve something, since the job of recognizing macros will now be done for us.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal - redesign how Elixir adds links to source code formatted by Pygments #307

How filters work right now

Problems with this approach

Alternative design with Pygments

Possible problems

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal - redesign how Elixir adds links to source code formatted by Pygments #307

Description

How filters work right now

Problems with this approach

Alternative design with Pygments

Possible problems

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions