Description
I'm wondering if it wouldn't be a good idea to redesign how Elixir turns identifiers into links in formatted source code. Current approach is kind of hacky and some bugs found in it seem rather hard to fix.
How filters work right now
- tokenize-file is a (bash + perl regex) tokenizer that splits a source file into lines. Every other line is (supposed to be) a potential identifier. The rest are supposed to be comments, operators, generally not interesting.
- The result of the tokenizer is consumed by query.py. Every other line is looked up in the definitions database. If a match is found, unprintable markings are added at the beginning and end of the line.
- Lines are then concatenated back into a string
- Filters are used to find all interesting pieces of code, like identifiers and includes, and replace them with links. Filters consist of two parts. The first part runs on unformatted code, the second part runs in the HTML produced by Pygments. Here, the first part is ran and it most cases it replaces matching identifiers with special identifiers generated by filters.
- The string is formatted into HTML by Pygments
- The second part of filters is ran on formatted HTML, replacing identifiers replaced in original source with HTML.
Some more notes:
- Current filters design allows them to only replace sections of source that can be matched by a regex (this applies to both parts of a filter)
- ident is used to find and replace identifiers marked by file cmd in query.py in most files. In Kconfig marked identifiers are replaced by kconfigidents, in DTS by dtscompc
- tokenize-file is also used in update.py.
Problems with this approach
- Filters have to make sure to not break formatting.
- Filters have to trust Pygments to not split their special identifiers into unregexable parts.
- A single regex (and some bash) is used to tokenize source code for most languages. Although there is a special tokenizer regex, just for DTS. If we want to fix Comments in assembler files for some architectures are indexed as identifiers #291, we probably would have to add another lexer there, making the script aware of another language. Tokenizer gets confused by escaped quotes #306 also needs to be fixed there.
- Elixir has to be aware of the programming language in more parts than necessary. Not only in the script and the filters, there is also another hard-coded language-dependent exception in the Python part of the tokenizer.
While of course a some of this could be fixed while leaving the filters design mostly the same, I started thinking if there maybe is a better way to design this whole part.
Alternative design with Pygments
First, something that I want to be sure you noticed - the responsibility of filters seems to be to work around a missing functionality of Pygments. Pygments currently does not allow adding links to selected tokens (except if you generate a ctags file, but that's another hack, please, let's not do that).
My idea is simple:
Pygments is split into lexers and formatters - the lexer turns source into a sequence of tokens. The formatter then takes the tokens and turns them into formatted code. Pygments also supports filters - functions that take a sequence of tokens and transform it in some way.
This could be leveraged, unfortunately Pygments currently does not allow filters to add link information to tokens.
What I would like to do:
- Try to implement and upstream a functionality in Pygments that would allow filters to add certain, formatter dependent, information to tokens. People have been asking for this already (Adding anchors and links in HTML output pygments/pygments#1930), the maintainer agrees that this would be useful, and I think I have a design that could lead to something that gets merged.
- If my changes get merged, reimplement current double search-and-replace filters as Pygments filters. Tokenizing would then be done by Pygments, adding link information to tokens (including the database lookup) would be contained entirely in filters.
This should lead to less and simpler code - for example, filters won't have to track what they replaced in the unformatted source. Filters will also receive more information which in the future could allow them to, for example, track context to supply links with more information. And, less regex (in Elixir at least).
Possible problems
A problem I see so far is that Pygments treats each macro line as a single token. Macros are easily recognizable because they use a separate token type. So I think this could be handled by a filter that would lex all macros into smaller tokens and supply them with identifier information. I realize that this hits into some of my points, but I still think this approach is better - most code would be handled by the cleaner codepath. And also, support for identifiers in macros is also already very hacky at best - maybe this would allow us to improve something, since the job of recognizing macros will now be done for us.