PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language. All the documentation for this project can be found in the wiki.
Add PHP Text Analysis to your project
composer require yooper/php-text-analysis
Documentation for the library resides in the wiki. https://github.com/yooper/php-text-analysis/wiki
$tokens = tokenize($text);
You can customize which type of tokenizer to tokenize with by passing in the name of the tokenizer class
$tokens = tokenize($text, \TextAnalysis\Tokenizers\PennTreeBankTokenizer::class);
The default tokenizer is \TextAnalysis\Tokenizers\GeneralTokenizer::class . Some tokenizers require parameters to be set upon instantiation.
By default, normalize_tokens uses the function strtolower to lowercase all the tokens. To customize the normalize function, pass in either a function or a string to be used by array_map.
$normalizedTokens = normalize_tokens(array $tokens);
$normalizedTokens = normalize_tokens(array $tokens, 'mb_strtolower');
$normalizedTokens = normalize_tokens(array $tokens, function($token){ return mb_strtoupper($token); });
The call to freq_dist returns a FreqDist instance.
$freqDist = freq_dist(tokenize($text));
By default bigrams are generated.
$bigrams = ngrams($tokens);
Customize the ngrams
// create trigrams with a pipe delimiter in between each word
$trigrams = ngrams($tokens,3, '|');
To do