PHP Text Analysis is a library for performing Information Retrieval (IR) and Natural Language Processing (NLP) tasks using the PHP language. There are tools in this library that can perform:
- document classification
- sentiment analysis
- compare documents
- frequency analysis
- tokenization
- stemming
- collocations with Pointwise Mutual Information
- lexical diversity
- corpus analysis
- text summarization
All the documentation for this project can be found in the book and wiki.
A book is in the works and your contributions are needed. You can find the book at https://github.com/yooper/php-text-analysis-book
Also, documentation for the library resides in the wiki, too. https://github.com/yooper/php-text-analysis/wiki
Add PHP Text Analysis to your project
composer require yooper/php-text-analysis
$tokens = tokenize($text);
You can customize which type of tokenizer to tokenize with by passing in the name of the tokenizer class
$tokens = tokenize($text, \TextAnalysis\Tokenizers\PennTreeBankTokenizer::class);
The default tokenizer is \TextAnalysis\Tokenizers\GeneralTokenizer::class . Some tokenizers require parameters to be set upon instantiation.
By default, normalize_tokens uses the function strtolower to lowercase all the tokens. To customize the normalize function, pass in either a function or a string to be used by array_map.
$normalizedTokens = normalize_tokens(array $tokens);
$normalizedTokens = normalize_tokens(array $tokens, 'mb_strtolower');
$normalizedTokens = normalize_tokens(array $tokens, function($token){ return mb_strtoupper($token); });
The call to freq_dist returns a FreqDist instance.
$freqDist = freq_dist(tokenize($text));
By default bigrams are generated.
$bigrams = ngrams($tokens);
Customize the ngrams
// create trigrams with a pipe delimiter in between each word
$trigrams = ngrams($tokens,3, '|');
By default stem method uses the Porter Stemmer.
$stemmedTokens = stem($tokens);
You can customize which type of stemmer to use by passing in the name of the stemmer class name
$stemmedTokens = stem($tokens, \TextAnalysis\Stemmers\MorphStemmer::class);
There is a short cut method for using the Rake algorithm. You will need to clean your data prior to using. Second parameter is the ngram size of your keywords to extract.
$rake = rake($tokens, 3);
$results = $rake->getKeywordScores();
Need Sentiment Analysis with PHP Use Vader, https://github.com/cjhutto/vaderSentiment . The PHP implementation can be invoked easily. Just normalize your data before hand.
$sentimentScores = vader($tokens);
Need to do some document classification with PHP, trying using the Naive Bayes implementation. An example of classifying movie reviews can be found in the unit tests
$nb = naive_bayes();
$nb->train('mexican', tokenize('taco nacho enchilada burrito'));
$nb->train('american', tokenize('hamburger burger fries pop'));
$nb->predict(tokenize('my favorite food is a burrito'));