-
Notifications
You must be signed in to change notification settings - Fork 87
Home
yooper edited this page Oct 6, 2017
·
14 revisions
Want to process text using PHP? Well, you picked the right library for the task.
PHP Text Analysis provides a variety of tools for :
- Analysis
- Date Analysis - use to extract dates from a given corpus
- Frequency Distribution - provides you with the basic tools to do simple analysis and is used as a base for many other algorithms
- Rapid Automatic Keyword Extraction (RAKE) - use the RAKE algorithm to rapidly automate keyword extraction
- Collections - data structures for managing documents during analysis
- Collocation - helps you find terms that co-occur more often than would be expected by chance.
- Comparisons - Algorithms for comparing text and text documents
- Console - a command line interface for performing base indexing and text mining analysis with PHP
- Entity Extraction - helps you find entities such as people, places and dates
- Downloaders - Downloads 3rd party data files from the web
- Filters - A set of tools for normalizing the terms and tokens before data analysis begins
- Phonetics - Phonetic algorithms for fixing data. Helpful when you need to perform record linkage tasks with PHP
- Ngrams - PHP code for generating NGrams from a given set of tokens or terms
- Stemmers - Several stemmers are available for normalizing the data sets prior to further analysis
- Tokenizers - A common set of tokenizers is availble for breaking up the corpus into tokens or sentences
- Utilities - helper utilities for manipulating text data
PHP Text Analysis is a light weight Information Retrieval and NLP library built using PHP. In addition, to analysis tools, PHP Text Analysis can be used to create a search engine that supports simple and advanced query types. This is especially useful when your data models have raw text that must be searchable.
- Adapters
- Engines
- Indexes
- Query
Performance is always very challenging. Here are a couple suggestions on how to improve the speed of your code.
- Use the whitespace tokenizer, it works better than the general tokenizer
- Use the filter classes on the whole text/corpus, avoid the applyTranformation method calls within the TokenDoc class. They are useful when each token must be validated or transformed. A lot of the filter classes have been re-written to better support the above approach
In order to run all the unit tests successfully you must have JAVA installed. Here is the command used to run all the tests.
JAVA_HOME=/opt/jdk1.8.0_111/bin/java ./vendor/bin/phpunit