Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved tokenizing of context sensitive keywords #3484

Merged
merged 2 commits into from
Jan 12, 2022

Conversation

kukulich
Copy link
Contributor

@kukulich kukulich commented Nov 21, 2021

@kukulich kukulich force-pushed the context-sensitive branch 3 times, most recently from 2a9833f to 1f2580d Compare November 21, 2021 12:29
@kukulich kukulich marked this pull request as ready for review November 21, 2021 12:31
@kukulich
Copy link
Contributor Author

@gsherwood Please, this PR should be merged before #3478 and #3483.

It should really help in the future. I've already simplified the support for T_READONLY. I also cannot write some tests for T_ENUM_CASE without this PR.

@gsherwood gsherwood added this to the 3.7.0 milestone Dec 17, 2021
src/Tokenizers/PHP.php Outdated Show resolved Hide resolved
src/Tokenizers/PHP.php Outdated Show resolved Hide resolved
@gsherwood
Copy link
Member

Left a couple of comments for minor stuff, but looks good to merge otherwise.

Copy link
Contributor

@jrfnl jrfnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one!, This would be a great step forward, but there's still quite a lot more to do in this regards (though not necessarily all in this PR). Also see: #3336

Note: I locally made a start on a PR to handle this when I opened that issue, but got side-tracked before I could finish it as it is hard to get that one right, especially when trying to account for all situations, including multi-use, group use etc.
Happy to share the code I have though.


Regarding this PR:

I wonder if there aren't more chunks of code in the tokenizer after this new block, which could now be removed, or simplified ?

In particular, I'm looking at the "string-like token after a function keyword" block round line 1740 and the "special case for PHP 5.6 use function and use const" block round line 2168.

More extensive unit tests would also help catch edge cases.

$preserveKeyword = true;
}

// `namespace\` should be preserved
Copy link
Contributor

@jrfnl jrfnl Nov 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about namespace ... ;, i.e declaration ? Every part of the name should be tokenized as T_STRING.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not what I meant (please unresolve this).

Think:

namespace foreach;
namespace my\class\extensions;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. That test case still only safeguards the first of the above examples though, the current tokenizer would still fail on the second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added one more test :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Would be wonderful though if the other parts of the namespace name would also be checked. The test currently only checks the first part, i.e. my.

What I'm trying to point out, is that on PHP < 8.0, this snippet:

<?php

namespace my\class\extensions;
namespace my\foreach\yield;

... would be tokenized as - take note of class, yield and foreach:

Ptr | Ln | Col  | Cond | ( #) | Token Type                 | [len]: Content
-------------------------------------------------------------------------
  0 | L1 | C  1 | CC 0 | ( 0) | T_OPEN_TAG                 | [5]: <?php

  1 | L2 | C  1 | CC 0 | ( 0) | T_WHITESPACE               | [0]:

  2 | L3 | C  1 | CC 0 | ( 0) | T_NAMESPACE                | [9]: namespace
  3 | L3 | C 10 | CC 0 | ( 0) | T_WHITESPACE               | [1]:
  4 | L3 | C 11 | CC 0 | ( 0) | T_STRING                   | [2]: my
  5 | L3 | C 13 | CC 0 | ( 0) | T_NS_SEPARATOR             | [1]: \
  6 | L3 | C 14 | CC 0 | ( 0) | T_CLASS                    | [5]: class
  7 | L3 | C 19 | CC 0 | ( 0) | T_NS_SEPARATOR             | [1]: \
  8 | L3 | C 20 | CC 0 | ( 0) | T_STRING                   | [10]: extensions
  9 | L3 | C 30 | CC 0 | ( 0) | T_SEMICOLON                | [1]: ;
 10 | L3 | C 31 | CC 0 | ( 0) | T_WHITESPACE               | [0]:

 11 | L4 | C  1 | CC 0 | ( 0) | T_NAMESPACE                | [9]: namespace
 12 | L4 | C 10 | CC 0 | ( 0) | T_WHITESPACE               | [1]:
 13 | L4 | C 11 | CC 0 | ( 0) | T_STRING                   | [2]: my
 14 | L4 | C 13 | CC 0 | ( 0) | T_NS_SEPARATOR             | [1]: \
 15 | L4 | C 14 | CC 0 | ( 0) | T_FOREACH                  | [7]: foreach
 16 | L4 | C 21 | CC 0 | ( 0) | T_NS_SEPARATOR             | [1]: \
 17 | L4 | C 22 | CC 0 | ( 0) | T_YIELD                    | [5]: yield
 18 | L4 | C 27 | CC 0 | ( 0) | T_SEMICOLON                | [1]: ;
 19 | L4 | C 28 | CC 0 | ( 0) | T_WHITESPACE               | [0]:

... while with your change, it will be tokenized (better), like so:

Ptr | Ln | Col  | Cond | ( #) | Token Type                 | [len]: Content
-------------------------------------------------------------------------
  0 | L1 | C  1 | CC 0 | ( 0) | T_OPEN_TAG                 | [5]: <?php

  1 | L2 | C  1 | CC 0 | ( 0) | T_WHITESPACE               | [0]:

  2 | L3 | C  1 | CC 0 | ( 0) | T_NAMESPACE                | [9]: namespace
  3 | L3 | C 10 | CC 0 | ( 0) | T_WHITESPACE               | [1]:
  4 | L3 | C 11 | CC 0 | ( 0) | T_STRING                   | [2]: my
  5 | L3 | C 13 | CC 0 | ( 0) | T_NS_SEPARATOR             | [1]: \
  6 | L3 | C 14 | CC 0 | ( 0) | T_STRING                   | [5]: class
  7 | L3 | C 19 | CC 0 | ( 0) | T_NS_SEPARATOR             | [1]: \
  8 | L3 | C 20 | CC 0 | ( 0) | T_STRING                   | [10]: extensions
  9 | L3 | C 30 | CC 0 | ( 0) | T_SEMICOLON                | [1]: ;
 10 | L3 | C 31 | CC 0 | ( 0) | T_WHITESPACE               | [0]:

 11 | L4 | C  1 | CC 0 | ( 0) | T_NAMESPACE                | [9]: namespace
 12 | L4 | C 10 | CC 0 | ( 0) | T_WHITESPACE               | [1]:
 13 | L4 | C 11 | CC 0 | ( 0) | T_STRING                   | [2]: my
 14 | L4 | C 13 | CC 0 | ( 0) | T_NS_SEPARATOR             | [1]: \
 15 | L4 | C 14 | CC 0 | ( 0) | T_STRING                   | [7]: foreach
 16 | L4 | C 21 | CC 0 | ( 0) | T_NS_SEPARATOR             | [1]: \
 17 | L4 | C 22 | CC 0 | ( 0) | T_STRING                   | [5]: yield
 18 | L4 | C 27 | CC 0 | ( 0) | T_SEMICOLON                | [1]: ;
 19 | L4 | C 28 | CC 0 | ( 0) | T_WHITESPACE               | [0]:

There is no difference on PHP 8.0+ as the "retokenization of namespaced names" already takes care of it in that case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I finally get it :) Added.

src/Tokenizers/PHP.php Show resolved Hide resolved
src/Tokenizers/PHP.php Outdated Show resolved Hide resolved
@@ -1113,6 +1165,7 @@ protected function tokenize($string)
&& $tokenIsArray === true
&& $token[0] === T_STRING
&& strtolower($token[1]) === 'yield'
&& isset($this->tstringContexts[$finalTokens[$lastNotEmptyToken]['code']]) === false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should a similar condition be added for the match retokenization round line 1439 ? And in that case, could the "final check" in PHP::processAdditional() be possibly removed/simplified ?

Similar question for the fn condition round line 1721.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T_MATCH can probably simplified but I'm not sure about this test case:

/* testLiveCoding */
// Intentional parse error. This has to be the last test in the file.
echo match

The token is currently tokenized as T_STRING but I'm not sure if it's right.

src/Util/Tokens.php Show resolved Hide resolved
T_MATCH => T_MATCH,
T_NAMESPACE => T_NAMESPACE,
T_NEW => T_NEW,
T_PARENT => T_PARENT,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

T_PARENT is PHPCS native token and doesn't exist in PHP itself.

Copy link
Contributor Author

@kukulich kukulich Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test if failing when I remove the T_PARENT token.

1) PHP_CodeSniffer\Tests\Core\Tokenizer\ContextSensitiveKeywordsTest::testKeywords with data set #19 ('/* testParentIsKeyword */', 'T_PARENT')
Failed to find test target token for comment string: /* testParentIsKeyword */
Failed asserting that true is false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kukulich That's because it can't find token after the comment (as the custom token is no longer in the list), not because the tokenization is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jrfnl Yes, you're right. However I'm not sure if the tokens should be removed here or not...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that depends on where else this token list will be used, but for the current use case, those tokens aren't needed in the list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed for now.

src/Util/Tokens.php Outdated Show resolved Hide resolved
tests/Core/Tokenizer/ContextSensitiveKeywordsTest.php Outdated Show resolved Hide resolved
tests/Core/Tokenizer/ContextSensitiveKeywordsTest.php Outdated Show resolved Hide resolved
@kukulich kukulich force-pushed the context-sensitive branch 6 times, most recently from 2be5165 to 6ce46dc Compare December 21, 2021 09:37
@kukulich
Copy link
Contributor Author

@jrfnl

In particular, I'm looking at the "string-like token after a function keyword" block round line 1740 and the "special case for PHP 5.6 use function and use const" block round line 2168.

"string-like token after a function keyword" is gone.

"special case for PHP 5.6 use function and use const" cannot be removed (without other work) but "This is a special case for the PHP 5.5 classname::class syntax" is gone.

@jrfnl
Copy link
Contributor

jrfnl commented Dec 21, 2021

but "This is a special case for the PHP 5.5 classname::class syntax" is gone.

Nice find! Yeah, of course, with the T_PAAMAYIM_NEKUDOTAYIM in the $tstringContexts that would be redundant.

@kukulich kukulich force-pushed the context-sensitive branch 4 times, most recently from e4b1a34 to 8a71b96 Compare December 21, 2021 11:26
@gsherwood
Copy link
Member

Just an update that I'm trying to merge this at the moment. I'm needing to resolve a lot of conflicts in the 4.0 branch first, and adjust the namespace changes over there as well.

@gsherwood gsherwood merged commit cfdc6c9 into squizlabs:master Jan 12, 2022
@gsherwood
Copy link
Member

Finally got this merged and tested after Github had issues today, but all working.

Appreciate all the work getting this done. Keen to have anyone else run their eyes over things as well to make sure I haven't missed something obvious.

@kukulich kukulich deleted the context-sensitive branch January 12, 2022 07:30
@jrfnl
Copy link
Contributor

jrfnl commented Jan 12, 2022

Keen to have anyone else run their eyes over things as well to make sure I haven't missed something obvious.

If you like, I'll try to have a look at the 4.x branch later in the week ? (on that note: there are still a few PRs open for that branch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants