Pattern that doesn't consume a set of terminal characters #83

venkatd · 2020-10-12T15:32:14Z

venkatd
Oct 12, 2020

We are currently using a parser for a chat message composer in our Flutter app. Among other tokens, we need to parse urls to display previews and give feedback on the UI.

I've followed the spec as closely as possible, but we don't want to consume the chars .,? at the end of a url because our users will often terminate links with punctuation.

Here is how our test + failing test looks like

import "package:test/test.dart";
import 'url.dart';
import 'package:petitparser/petitparser.dart';

void main() {
  group('url', () {
    test('does not consume ending punctuation', () {
      expect(url.matchesSkipping('hi please use https://acme.com next time.'),
          ['https://acme.com']);

      expect(url.matchesSkipping('hi please use https://acme.com.'),
          ['https://acme.com']);
    });
  });
}

Test results:

00:01 +1 -1: url does not consume ending punctuation [E]                                                                                                                                               
  Expected: ['https://acme.com']
    Actual: ['https://acme.com.']
     Which: was 'https://acme.com.' instead of 'https://acme.com' at location [0]

And our corresponding code:

import 'package:petitparser/petitparser.dart';

Parser<String> _url() {
  final scheme = string('https') | string('http');
  final query =
      (char('?') & pattern('0-9A-Za-z+&@/%=~_|!:,;-').star()).flatten();

  final safe = anyOf('\$-_@.&+-'); // anyOf('\$-_@.&+-');
  final extra = anyOf('!*"\'(),');
  final port = (char(':') & digit().plus()).flatten();

  final hex = pattern('0-9A-Fa-f');
  final escape = (char('%') & hex & hex).flatten();

  final segmentChar = (word() | safe | extra | escape);
  final segment = segmentChar.plus().flatten();

  // in reality, these grammars differ slightly but this won't matter for our case
  final hostSegment = segment;
  final pathSegment = segment;

  final host = hostSegment.separatedBy(char('.'));
  final hostPort = host & port.optional();
  final path = (char('/') & pathSegment.separatedBy(char('/'))).flatten();

  // https://stackoverflow.com/a/26119120/67655
  // chars allowed in a fragment
  final fragmentChars = pattern('0-9a-zA-Z?/:@._~!\$&\'()*+,;=-');
  final fragment = (char('#') & fragmentChars.star()).flatten();

  return (scheme &
          string('://') &
          hostPort &
          path.optional() &
          query.optional() &
          fragment.optional())
      .flatten();
}

final url = _url();

Is there a clean way to prevent consuming the ending chars of anyOf(',.?') without special casing the entire grammar?

Answered by renggli

Oct 13, 2020

I see.

The clean way would be to rewrite the URL grammar so that it does not consume the separator characters at the end. There are various ways of doing that (separatedBy, and, not), but none of them is particularly simple. Also you wrote you don't want to change the existing grammar, so I assume this is not a solution.

Another approach is to use what I originally proposed, but do the transformation in the continuation parser (callCC) and resume the parsing one character earlier. Something along the lines of:

return (...)
  .token() // get token with start and stop position in input
  .callCC((continuation, context) {
    final result = continuation(context);
    if (result.isFailure) re…

View full answer

renggli · 2020-10-12T18:59:31Z

renggli
Oct 12, 2020
Maintainer

Probably the easiest is if you crop it off at the end, something along ...

return (...)
  .flattten()
  .map((value) => /* cut off the last character, if it ends with punctuation */);

0 replies

venkatd · 2020-10-13T16:09:43Z

venkatd
Oct 13, 2020
Author

Thanks! The problem is that I would like these characters to still be consumed by the other parts of the grammar. Because it will be composed with other tokens. For example:

    test('matches skipping', () {
      final input =
          'hi <file:abc> and <file:123456><datetime:2020-10-08T21:57:50.118523Z> @Venkat @Vlad Lokshin https://starkindustries.com/projects/iron-man check it';

      final mtb = MessageBodyParser(
        users: [
          User(id: 'ven', name: 'Venkat'),
          User(id: 'vlad', name: 'Vlad Lokshin'),
        ],
      );

      final tokens = mtb.parse(input);

      final serializedTokens = tokens.map(mtb.tokenToMarkup).toList();
      expect(serializedTokens, [
        'hi ',
        '<file:abc>',
        ' and ',
        '<file:123456>',
        '<datetime:2020-10-08T21:57:50.118523Z>',
        ' ',
        '<user:ven>',
        ' ',
        '<user:vlad>',
        ' ',
        'https://starkindustries.com/projects/iron-man',
        ' check it',
      ]);
    });

If I were to add a period after iron-man, it would fail because the . would be dropped. I would want it to be part of '.check it'

The tokens are defined as:

  final specialToken = users.isEmpty
      ? entityToken | dateTimeToken | linkToken
      : buildMentionedUserToken(users) |
          entityToken |
          dateTimeToken |
          linkToken;

  final stringToken = any()
      .plusLazy(specialToken | endOfInput())
      .flatten()
      .map((v) => StringObject(string: v));

  return (specialToken | stringToken).cast();

0 replies

renggli · 2020-10-13T19:11:52Z

renggli
Oct 13, 2020
Maintainer

I see.

The clean way would be to rewrite the URL grammar so that it does not consume the separator characters at the end. There are various ways of doing that (separatedBy, and, not), but none of them is particularly simple. Also you wrote you don't want to change the existing grammar, so I assume this is not a solution.

Another approach is to use what I originally proposed, but do the transformation in the continuation parser (callCC) and resume the parsing one character earlier. Something along the lines of:

return (...)
  .token() // get token with start and stop position in input
  .callCC((continuation, context) {
    final result = continuation(context);
    if (result.isFailure) return result;
    final token = result.value;
    if (token.input.endsWithPunctuation) {
      final newToken = Token(token.value, token.buffer, token.start, token.stop - 1); // shorten token by 1 char
      return result.success(newToken, result.position - 1); // continue 1 char earlier
    } else {
      return result;  // just continue
    }
  })
  .map((token) => token.input);  // replace token with consumed input

0 replies

venkatd · 2020-10-14T02:25:32Z

venkatd
Oct 14, 2020
Author

Works perfectly, thanks!

I didn't want to modify the grammar because I copied the official grammar as closely as possible and wanted to keep it intact. (It looks urls are technically allowed to end in periods based on the official grammar, but 99% of the time, when a url ends with a period or comma, the intent is punctuation.)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pattern that doesn't consume a set of terminal characters #83

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Pattern that doesn't consume a set of terminal characters #83

venkatd Oct 12, 2020

Replies: 4 comments

renggli Oct 12, 2020 Maintainer

venkatd Oct 13, 2020 Author

renggli Oct 13, 2020 Maintainer

venkatd Oct 14, 2020 Author

venkatd
Oct 12, 2020

renggli
Oct 12, 2020
Maintainer

venkatd
Oct 13, 2020
Author

renggli
Oct 13, 2020
Maintainer

venkatd
Oct 14, 2020
Author