-
We are currently using a parser for a chat message composer in our Flutter app. Among other tokens, we need to parse urls to display previews and give feedback on the UI. I've followed the spec as closely as possible, but we don't want to consume the chars Here is how our test + failing test looks like import "package:test/test.dart";
import 'url.dart';
import 'package:petitparser/petitparser.dart';
void main() {
group('url', () {
test('does not consume ending punctuation', () {
expect(url.matchesSkipping('hi please use https://acme.com next time.'),
['https://acme.com']);
expect(url.matchesSkipping('hi please use https://acme.com.'),
['https://acme.com']);
});
});
} Test results:
And our corresponding code: import 'package:petitparser/petitparser.dart';
Parser<String> _url() {
final scheme = string('https') | string('http');
final query =
(char('?') & pattern('0-9A-Za-z+&@/%=~_|!:,;-').star()).flatten();
final safe = anyOf('\$-_@.&+-'); // anyOf('\$-_@.&+-');
final extra = anyOf('!*"\'(),');
final port = (char(':') & digit().plus()).flatten();
final hex = pattern('0-9A-Fa-f');
final escape = (char('%') & hex & hex).flatten();
final segmentChar = (word() | safe | extra | escape);
final segment = segmentChar.plus().flatten();
// in reality, these grammars differ slightly but this won't matter for our case
final hostSegment = segment;
final pathSegment = segment;
final host = hostSegment.separatedBy(char('.'));
final hostPort = host & port.optional();
final path = (char('/') & pathSegment.separatedBy(char('/'))).flatten();
// https://stackoverflow.com/a/26119120/67655
// chars allowed in a fragment
final fragmentChars = pattern('0-9a-zA-Z?/:@._~!\$&\'()*+,;=-');
final fragment = (char('#') & fragmentChars.star()).flatten();
return (scheme &
string('://') &
hostPort &
path.optional() &
query.optional() &
fragment.optional())
.flatten();
}
final url = _url(); Is there a clean way to prevent consuming the ending chars of |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments
-
Probably the easiest is if you crop it off at the end, something along ... return (...)
.flattten()
.map((value) => /* cut off the last character, if it ends with punctuation */); |
Beta Was this translation helpful? Give feedback.
-
Thanks! The problem is that I would like these characters to still be consumed by the other parts of the grammar. Because it will be composed with other tokens. For example: test('matches skipping', () {
final input =
'hi <file:abc> and <file:123456><datetime:2020-10-08T21:57:50.118523Z> @Venkat @Vlad Lokshin https://starkindustries.com/projects/iron-man check it';
final mtb = MessageBodyParser(
users: [
User(id: 'ven', name: 'Venkat'),
User(id: 'vlad', name: 'Vlad Lokshin'),
],
);
final tokens = mtb.parse(input);
final serializedTokens = tokens.map(mtb.tokenToMarkup).toList();
expect(serializedTokens, [
'hi ',
'<file:abc>',
' and ',
'<file:123456>',
'<datetime:2020-10-08T21:57:50.118523Z>',
' ',
'<user:ven>',
' ',
'<user:vlad>',
' ',
'https://starkindustries.com/projects/iron-man',
' check it',
]);
}); If I were to add a period after The tokens are defined as: final specialToken = users.isEmpty
? entityToken | dateTimeToken | linkToken
: buildMentionedUserToken(users) |
entityToken |
dateTimeToken |
linkToken;
final stringToken = any()
.plusLazy(specialToken | endOfInput())
.flatten()
.map((v) => StringObject(string: v));
return (specialToken | stringToken).cast(); |
Beta Was this translation helpful? Give feedback.
-
I see. The clean way would be to rewrite the URL grammar so that it does not consume the separator characters at the end. There are various ways of doing that ( Another approach is to use what I originally proposed, but do the transformation in the continuation parser ( return (...)
.token() // get token with start and stop position in input
.callCC((continuation, context) {
final result = continuation(context);
if (result.isFailure) return result;
final token = result.value;
if (token.input.endsWithPunctuation) {
final newToken = Token(token.value, token.buffer, token.start, token.stop - 1); // shorten token by 1 char
return result.success(newToken, result.position - 1); // continue 1 char earlier
} else {
return result; // just continue
}
})
.map((token) => token.input); // replace token with consumed input |
Beta Was this translation helpful? Give feedback.
-
Works perfectly, thanks! I didn't want to modify the grammar because I copied the official grammar as closely as possible and wanted to keep it intact. (It looks urls are technically allowed to end in periods based on the official grammar, but 99% of the time, when a url ends with a period or comma, the intent is punctuation.) |
Beta Was this translation helpful? Give feedback.
I see.
The clean way would be to rewrite the URL grammar so that it does not consume the separator characters at the end. There are various ways of doing that (
separatedBy
,and
,not
), but none of them is particularly simple. Also you wrote you don't want to change the existing grammar, so I assume this is not a solution.Another approach is to use what I originally proposed, but do the transformation in the continuation parser (
callCC
) and resume the parsing one character earlier. Something along the lines of: