-
Notifications
You must be signed in to change notification settings - Fork 87
Entity Extraction with PHP
Entity Extraction is performed by using a 3rd party library. This functionality is dependent upon having the Stanford Named Entity Extraction java jar files available and also java must be available.
There are two ways to install the Stanford's NER jar files
-
Download and install the latest jar files from Stanford. Here is link to the downloadable zip If you manually install you must set the path to the jar and the classifier
-
Install the files by running the following command
php textconsole pta:package:install stanford_ner_tagger
If you install with this method, the jar and classifier will automatically be detected.
You must set the $JAVA_HOME environment variable. Here is an example that will run the unit tests for the Stanford NER tagger. You must be in the root directory of the project to run the following line.
JAVA_HOME=/opt/jdk1.8.0_111/bin/java ./vendor/bin/phpunit tests/TextAnalysis/Taggers/StanfordNerTaggerTest.php
use TextAnalysis\Taggers\StanfordNerTagger;
use TextAnalysis\Tokenizers\WhitespaceTokenizer;
use TextAnalysis\Documents\TokensDocument;
Class EntityExtractionTest extends \PHPUnit_Framework_TestCase
{
protected $text = "Marquette County is a county located in the Upper Peninsula of the US state of Michigan. As of the 2010 census, the population was 67,077.";
public function testStanfordNer()
{
$document = new TokensDocument((new WhitespaceTokenizer())->tokenize($this->text));
$tagger = new StanfordNerTagger();
$output = $tagger->tag($document->getDocumentData());
$this->assertFileExists($tagger->getTmpFilePath());
$this->assertEquals(138, filesize($tagger->getTmpFilePath()));
$this->assertEquals(['LOCATION','Michigan'], $output[15], "Did you set JAVA_HOME env variable?");
}