Skip to content

Proposal - redesign how Elixir adds links to source code formatted by Pygments  #307

@fstachura

Description

@fstachura

I'm wondering if it wouldn't be a good idea to redesign how Elixir turns identifiers into links in formatted source code. Current approach is kind of hacky and some bugs found in it seem rather hard to fix.

How filters work right now

  1. tokenize-file is a (bash + perl regex) tokenizer that splits a source file into lines. Every other line is (supposed to be) a potential identifier. The rest are supposed to be comments, operators, generally not interesting.
  2. The result of the tokenizer is consumed by query.py. Every other line is looked up in the definitions database. If a match is found, unprintable markings are added at the beginning and end of the line.
  3. Lines are then concatenated back into a string
  4. Filters are used to find all interesting pieces of code, like identifiers and includes, and replace them with links. Filters consist of two parts. The first part runs on unformatted code, the second part runs in the HTML produced by Pygments. Here, the first part is ran and it most cases it replaces matching identifiers with special identifiers generated by filters.
  5. The string is formatted into HTML by Pygments
  6. The second part of filters is ran on formatted HTML, replacing identifiers replaced in original source with HTML.

Some more notes:

Problems with this approach

While of course a some of this could be fixed while leaving the filters design mostly the same, I started thinking if there maybe is a better way to design this whole part.

Alternative design with Pygments

First, something that I want to be sure you noticed - the responsibility of filters seems to be to work around a missing functionality of Pygments. Pygments currently does not allow adding links to selected tokens (except if you generate a ctags file, but that's another hack, please, let's not do that).

My idea is simple:
Pygments is split into lexers and formatters - the lexer turns source into a sequence of tokens. The formatter then takes the tokens and turns them into formatted code. Pygments also supports filters - functions that take a sequence of tokens and transform it in some way.
This could be leveraged, unfortunately Pygments currently does not allow filters to add link information to tokens.

What I would like to do:

  1. Try to implement and upstream a functionality in Pygments that would allow filters to add certain, formatter dependent, information to tokens. People have been asking for this already (Adding anchors and links in HTML output pygments/pygments#1930), the maintainer agrees that this would be useful, and I think I have a design that could lead to something that gets merged.
  2. If my changes get merged, reimplement current double search-and-replace filters as Pygments filters. Tokenizing would then be done by Pygments, adding link information to tokens (including the database lookup) would be contained entirely in filters.

This should lead to less and simpler code - for example, filters won't have to track what they replaced in the unformatted source. Filters will also receive more information which in the future could allow them to, for example, track context to supply links with more information. And, less regex (in Elixir at least).

Possible problems

A problem I see so far is that Pygments treats each macro line as a single token. Macros are easily recognizable because they use a separate token type. So I think this could be handled by a filter that would lex all macros into smaller tokens and supply them with identifier information. I realize that this hits into some of my points, but I still think this approach is better - most code would be handled by the cleaner codepath. And also, support for identifiers in macros is also already very hacky at best - maybe this would allow us to improve something, since the job of recognizing macros will now be done for us.

Metadata

Metadata

Assignees

No one assigned

    Labels

    indexingRelated to the index content — missing definitions/references, lexer bugs, new ctags features...

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions