Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to generate single (raw) backslash #133

Open
sorbits opened this issue Jul 17, 2018 · 3 comments
Open

Unable to generate single (raw) backslash #133

sorbits opened this issue Jul 17, 2018 · 3 comments

Comments

@sorbits
Copy link

sorbits commented Jul 17, 2018

I am unable to output a single backslash in raw mode.

On the left is input and right shows output:

`\`    =>   <p>``</p>
`\\`   =>   <p><code>\\</code></p>

Using v6.3.2 installed via homebrew.

@fletcher
Copy link
Owner

Allan -- I hope you are doing well!!

This is basically an effect of precedence order between two quasi-conflicting rules:

  1. A backslash "escapes" the following punctuation character, removing its usual Markdown meaning

  2. Backslashes don't escape inside code spans

So:

  • If (1), then `\` isn't a code span, because the second backtick is escaped.
  • If (2), then the second backtick is not escaped, since the backslash is in a code span
  • But we can't know that (2) applies until we determine if we're in a code span, but if the backtick is escaped, it's not a code span....

First, I agree that:

  1. It intuitively appears that `\` should be a code span containing a single backslash character
  2. Other variants interpret it that way.

But I'm not absolutely 100% entirely certain that is correct.... (Don't get me wrong -- I have no strong objection to the common interpretation being considered right, but I think it does require some hand-picking of how to follow the rules in order to get there.)

First, there is a workaround:

` \ ` => <p><code>\ </code></p>
`\ ` => <p><code>\ </code></p>

(The first is more "symmetric", but since it's only the trailing space that matters the second one is also correct.)

As for a solution....

Option 1

A simple possible solution is to prevent \` from triggering an escaped character, since backticks aren't "meaningful" in HTML like & or <. Whether it is escaped or not doesn't really matter.

However, unlike other Markdown variants, MultiMarkdown creates more than just HTML and a backtick may be important in current/future formats. So just disabling it is not without consequence.

Also, it seems strange to single out ` as the one punctuation character where the escape rule doesn't apply.

So I don't think simply disabling escaped-backticks is a good solution.

Option 2

Another option is to consider `\` as a single special token so that the normal escape rule doesn't apply? I worry about this causing problems with other edge cases, but those edge cases would be quite rare and could probably be worked around?? re2c allows lookaheads, so I could easily limit this to situations where there was whitespace following the new special token. re2c doesn't have lookbehinds, however, so I can't require a space in front of it without considering that space to be part of the token. Which provides more edge cases to consider.

All of which means I think this option is only realistic if used in the "dumb" sense where `\` is a special token regardless of what is around it.

Conclusion

  1. This is a rare circumstance (IIRC I've only discussed it once before)
  2. There is a workaround that requires no code change, no negative effects on other users, and only a single space character to implement
  3. A code fix will likely create some additional edge case issues, though these may be even more vanishingly rare than the current issue. I'll have to think about them some more and welcome input from others.

@sorbits
Copy link
Author

sorbits commented Jul 25, 2018

Thanks for your detailed reply, and your well wishes :)

As for your conclusion, adding a space can have an impact when the inline raw is followed by more content, as the space will then be using the style of code, for example this snippet:

<style>code { background: yellow; }</style>
<p>Foo <code>\ </code> Bar</p>

Will render as:

image

As for this being rare, I believe that a lot of technical documentation will need to render literal escape characters, in fact, your own README.md uses both a single escape character in inline raw, and an escape character followed by a space (and expects them to render differently).

I think having raw inline strings that escapes everything is crucial! Knowing that whatever content I have, I can wrap it in backticks and not have to worry about triggering some special mode/character, is not only valuable, but it also makes for nicer to read content.

A rule about being able to escape backticks after a backtick, and thus cancelling the inline raw, seems like a can of worms to me, take a string like this: It can show as `\` or `<ESC>`. Here the inline raw would, if allowing escapes after backtick, include the “or” part of the sentence.

As for a practical solution, can you make re2c recognize the full inline raw string as a single token? Although I see that you allow them to span multiple lines, so this could lead to excess backtracking in the lexer for non-paired backticks, but I think this would be negligible in practice, as only non-paired backticks would result in this backtracking, and it would at most be down to the next blank line (AFAIK).

@fletcher
Copy link
Owner

fletcher commented Jul 30, 2018

Forgive the length -- this is partially a response and partially a way for me to track my thoughts on this issue for the future.

MMD (version 6) is written such that it doesn't perform backtracking at all (unless there is a small edge case I am forgetting.) Paired tokens (e.g. [...], "..", etc.) are identified individually, and then paired in a single left-to-right pass. MMD (versions 3-5) were written with a PEG that did allow backtracking and therefore used a different algorithm for paired tokens that was more flexible, but often slower (and potentially much slower if attention was not paid to edge cases.) Eliminating backtracking is one of the reasons for the huge performance gains in v6 compared to earlier versions.

re2c, I believe, prevents backtracking and does not offer a "lookback" functionality, so it is limited in the complexity of regular expression it can handle.

Additional Background

It's probably useful to explain what happens when processing with MultiMarkdown v6.

  1. The text is broken into tokens (e.g. [, ], some text, etc.)

  2. The tokens are broken into lines, which are grouped into blocks (e.g. paragraphs, block quotes, list items, etc.)

  3. Within each block (where appropriate), tokens are paired to form certain spans (e.g. [ is matched to ])

  4. The block/span token tree is then processed to generate the output (e.g. HTML, LaTeX, etc.)

Escaped characters are treated as a single token, since they don't participate in matching (e.g. \[ won't be used to define a link such as [foo]).

Because \` is treated as a single token in step 1, the backtick character is not available for matching to form code spans in step 3.

Another way of thinking about this, is that escaped characters are treated has having higher precedence than pairing backticks to form code spans. (I'm not arguing that this is intrinsically "right", just stating what currently happens.)

To work around this, there are a few options:

  • Don't treat escaped characters as a single token. This means that we will have to add further logic to steps 2 and 3 to look for tokens that should not be allowed to participate.

  • Treat \` as a special token, and add special rules to handle it. This leads to some difficulty with edge cases, though the importance of these may be debatable for the vast majority of users.

  • Use another workaround, e.g. `\ `

Raw String as Single Token

I don't think that treating the entire raw string as a single token is the right way to go, as MMD would be unable to accurately reflect possible edge cases, nor to identify tokens within that string that must be treated differently for the desired output format (e.g. using & in the raw string would still need to be converted to &amp;).

More importantly, I'm not sure that there is a way to properly configure this in re2c so that it properly matches the following examples:

`foo `` bar` (ignore `` in middle)
``foo ` bar`` (ignore ` in middle)

(If you can prove me wrong, please let me know!)

Option 2 (from above)

Even treating `\` as a single token can lead to tricky edge cases, for example this example (admittedly contrived) doesn't work:

Using `foo`\`bar`

Normally, because the third backtick is escaped, this becomes (in almost all variants):

<p>Using <code>foo</code>`bar` to</p>

If instead the middle was treated as a single token, it would become:

<p>Using <code>foo`\`bar</code> to</p>

This is because the first backtick would start a code span, but a matched "closer" would not be found until the 4th backtick, since the middle 2 are wrapped up as a single token.

To work around this, `\` could be added as a "closer" for `. If they are paired together, then the `\` token would need to be split into two tokens -- the closing backtick, and an escaped backtick (which would not start a new code span). This would result in the same behavior as other variants.

However, this would not work "properly" here:

Using ``foo``\``bar``

In this case, the opening double-backtick would not match with a single backtick followed by the new token.

Option 1 (from above)

I experimented with removing \` from the set of escaped characters. This allowed for more flexibility in the "closers", but led to a different problem...

\`foo`
`foo\`

This works ok if we prevent a backtick from opening if preceded by a backslash.

However, this does not work:

\``foo`

Because now the entire group of backticks are blocked from opening.

This would lead to the need to further analyze backtick tokens and modify them based on surrounding text, some of which won't be understood until after later processing steps.

Option 3

A tweak to the previous "workaround" would be to check for an escaped space at the end of the raw string, and then ignore it in that case:

`\ ` -> <code>\<code>

This actually brings MultiMarkdown closer inline with regular Markdown, as the trailing space would be ignored normally, since Markdown doesn't have a concept of escaped spaces (MultiMarkdown uses them to trigger a non-breaking space when desired.)

I have pushed this change into the develop branch and it will be in the next release, provided there are no major issues discovered.

Other Comments

As for this being rare, I believe that a lot of technical documentation will need to render literal escape characters, in fact, your own README.md uses both a single escape character in inline raw, and an escape character followed by a space (and expects them to render differently).

Saying that it's not rare when it's used isn't really a logically helpful statement.... ;)

Amongst all MultiMarkdown users, having a backslash as the final character in a code span is sufficiently rare that I believe you are only the second person to contact me about it. I suspect that it is not rare for you, but it's still rare for most others.

I think having raw inline strings that escapes everything is crucial! Knowing that whatever content I have, I can wrap it in backticks and not have to worry about triggering some special mode/character, is not only valuable, but it also makes for nicer to read content.

I agree, however as I mentioned before here we have a situation where two rules contradict each other. The escape syntax \[something] allows you to prevent something from being interpreted as a special character. In this case, however, you want the first backtick to trigger a code span, thereby disabling the escape syntax.

I am not inherently opposed to `\` being treated as <code>\</code>, but implementing this is not as straightforward as it sounds (see above), at least not without adding one or more additional processing phases with the accompanying performance hit.

Conclusion

  1. I updated the `\ ` workaround as above to remove the trailing space. I think this means that MultiMarkdown-6 can be used for any of the code span edge cases discussed here with a simple change to the source text.

  2. This change should also degrade when used in regular Markdown, such that it doesn't break when used in other Markdown programs.

  3. I am open to additional changes, but will need to find a way that doesn't significantly affect performance, and preferably one that doesn't require rewriting large chunks of MultiMarkdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants