-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved tokenizing of context sensitive keywords #3484
Conversation
2a9833f
to
1f2580d
Compare
1f2580d
to
a51355f
Compare
@gsherwood Please, this PR should be merged before #3478 and #3483. It should really help in the future. I've already simplified the support for |
Left a couple of comments for minor stuff, but looks good to merge otherwise. |
a51355f
to
5383484
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one!, This would be a great step forward, but there's still quite a lot more to do in this regards (though not necessarily all in this PR). Also see: #3336
Note: I locally made a start on a PR to handle this when I opened that issue, but got side-tracked before I could finish it as it is hard to get that one right, especially when trying to account for all situations, including multi-use, group use etc.
Happy to share the code I have though.
Regarding this PR:
I wonder if there aren't more chunks of code in the tokenizer after this new block, which could now be removed, or simplified ?
In particular, I'm looking at the "string-like token after a function keyword" block round line 1740 and the "special case for PHP 5.6 use function and use const" block round line 2168.
More extensive unit tests would also help catch edge cases.
$preserveKeyword = true; | ||
} | ||
|
||
// `namespace\` should be preserved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about namespace ... ;
, i.e declaration ? Every part of the name should be tokenized as T_STRING
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not what I meant (please unresolve this).
Think:
namespace foreach;
namespace my\class\extensions;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted. That test case still only safeguards the first of the above examples though, the current tokenizer would still fail on the second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added one more test :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Would be wonderful though if the other parts of the namespace name would also be checked. The test currently only checks the first part, i.e. my
.
What I'm trying to point out, is that on PHP < 8.0, this snippet:
<?php
namespace my\class\extensions;
namespace my\foreach\yield;
... would be tokenized as - take note of class
, yield
and foreach
:
Ptr | Ln | Col | Cond | ( #) | Token Type | [len]: Content
-------------------------------------------------------------------------
0 | L1 | C 1 | CC 0 | ( 0) | T_OPEN_TAG | [5]: <?php
1 | L2 | C 1 | CC 0 | ( 0) | T_WHITESPACE | [0]:
2 | L3 | C 1 | CC 0 | ( 0) | T_NAMESPACE | [9]: namespace
3 | L3 | C 10 | CC 0 | ( 0) | T_WHITESPACE | [1]:
4 | L3 | C 11 | CC 0 | ( 0) | T_STRING | [2]: my
5 | L3 | C 13 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \
6 | L3 | C 14 | CC 0 | ( 0) | T_CLASS | [5]: class
7 | L3 | C 19 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \
8 | L3 | C 20 | CC 0 | ( 0) | T_STRING | [10]: extensions
9 | L3 | C 30 | CC 0 | ( 0) | T_SEMICOLON | [1]: ;
10 | L3 | C 31 | CC 0 | ( 0) | T_WHITESPACE | [0]:
11 | L4 | C 1 | CC 0 | ( 0) | T_NAMESPACE | [9]: namespace
12 | L4 | C 10 | CC 0 | ( 0) | T_WHITESPACE | [1]:
13 | L4 | C 11 | CC 0 | ( 0) | T_STRING | [2]: my
14 | L4 | C 13 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \
15 | L4 | C 14 | CC 0 | ( 0) | T_FOREACH | [7]: foreach
16 | L4 | C 21 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \
17 | L4 | C 22 | CC 0 | ( 0) | T_YIELD | [5]: yield
18 | L4 | C 27 | CC 0 | ( 0) | T_SEMICOLON | [1]: ;
19 | L4 | C 28 | CC 0 | ( 0) | T_WHITESPACE | [0]:
... while with your change, it will be tokenized (better), like so:
Ptr | Ln | Col | Cond | ( #) | Token Type | [len]: Content
-------------------------------------------------------------------------
0 | L1 | C 1 | CC 0 | ( 0) | T_OPEN_TAG | [5]: <?php
1 | L2 | C 1 | CC 0 | ( 0) | T_WHITESPACE | [0]:
2 | L3 | C 1 | CC 0 | ( 0) | T_NAMESPACE | [9]: namespace
3 | L3 | C 10 | CC 0 | ( 0) | T_WHITESPACE | [1]:
4 | L3 | C 11 | CC 0 | ( 0) | T_STRING | [2]: my
5 | L3 | C 13 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \
6 | L3 | C 14 | CC 0 | ( 0) | T_STRING | [5]: class
7 | L3 | C 19 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \
8 | L3 | C 20 | CC 0 | ( 0) | T_STRING | [10]: extensions
9 | L3 | C 30 | CC 0 | ( 0) | T_SEMICOLON | [1]: ;
10 | L3 | C 31 | CC 0 | ( 0) | T_WHITESPACE | [0]:
11 | L4 | C 1 | CC 0 | ( 0) | T_NAMESPACE | [9]: namespace
12 | L4 | C 10 | CC 0 | ( 0) | T_WHITESPACE | [1]:
13 | L4 | C 11 | CC 0 | ( 0) | T_STRING | [2]: my
14 | L4 | C 13 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \
15 | L4 | C 14 | CC 0 | ( 0) | T_STRING | [7]: foreach
16 | L4 | C 21 | CC 0 | ( 0) | T_NS_SEPARATOR | [1]: \
17 | L4 | C 22 | CC 0 | ( 0) | T_STRING | [5]: yield
18 | L4 | C 27 | CC 0 | ( 0) | T_SEMICOLON | [1]: ;
19 | L4 | C 28 | CC 0 | ( 0) | T_WHITESPACE | [0]:
There is no difference on PHP 8.0+ as the "retokenization of namespaced names" already takes care of it in that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, now I finally get it :) Added.
@@ -1113,6 +1165,7 @@ protected function tokenize($string) | |||
&& $tokenIsArray === true | |||
&& $token[0] === T_STRING | |||
&& strtolower($token[1]) === 'yield' | |||
&& isset($this->tstringContexts[$finalTokens[$lastNotEmptyToken]['code']]) === false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should a similar condition be added for the match
retokenization round line 1439 ? And in that case, could the "final check" in PHP::processAdditional()
be possibly removed/simplified ?
Similar question for the fn
condition round line 1721.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T_MATCH
can probably simplified but I'm not sure about this test case:
/* testLiveCoding */
// Intentional parse error. This has to be the last test in the file.
echo match
The token is currently tokenized as T_STRING
but I'm not sure if it's right.
src/Util/Tokens.php
Outdated
T_MATCH => T_MATCH, | ||
T_NAMESPACE => T_NAMESPACE, | ||
T_NEW => T_NEW, | ||
T_PARENT => T_PARENT, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
T_PARENT
is PHPCS native token and doesn't exist in PHP itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test if failing when I remove the T_PARENT
token.
1) PHP_CodeSniffer\Tests\Core\Tokenizer\ContextSensitiveKeywordsTest::testKeywords with data set #19 ('/* testParentIsKeyword */', 'T_PARENT')
Failed to find test target token for comment string: /* testParentIsKeyword */
Failed asserting that true is false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kukulich That's because it can't find token after the comment (as the custom token is no longer in the list), not because the tokenization is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jrfnl Yes, you're right. However I'm not sure if the tokens should be removed here or not...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose that depends on where else this token list will be used, but for the current use case, those tokens aren't needed in the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed for now.
2be5165
to
6ce46dc
Compare
"string-like token after a function keyword" is gone. "special case for PHP 5.6 use function and use const" cannot be removed (without other work) but "This is a special case for the PHP 5.5 classname::class syntax" is gone. |
Nice find! Yeah, of course, with the |
e4b1a34
to
8a71b96
Compare
8a71b96
to
c73f456
Compare
Just an update that I'm trying to merge this at the moment. I'm needing to resolve a lot of conflicts in the 4.0 branch first, and adjust the namespace changes over there as well. |
Finally got this merged and tested after Github had issues today, but all working. Appreciate all the work getting this done. Keen to have anyone else run their eyes over things as well to make sure I haven't missed something obvious. |
If you like, I'll try to have a look at the 4.x branch later in the week ? (on that note: there are still a few PRs open for that branch) |
Other tokens should be added after these PR are merged:
enum
keyword #3478readonly
keyword #3480T_ENUM_CASE
#3483