Skip to content

Entity Extraction with PHP

yooper edited this page Nov 9, 2017 · 5 revisions

Entity Extraction is performed by using a 3rd party library. This functionality is dependent upon having the Stanford Named Entity Extraction java jar files available and also java must be available.

There are two ways to install the Stanford's NER jar files

  1. Download and install the latest jar files from Stanford. Here is link to the downloadable zip If you manually install you must set the path to the jar and the classifier

  2. Install the files by running the following command

php textconsole pta:package:install stanford_ner_tagger

If you install with this method, the jar and classifier will automatically be detected.

JAVA_HOME

You must set the $JAVA_HOME environment variable. Here is an example that will run the unit tests for the Stanford NER tagger. You must be in the root directory of the project to run the following line.

JAVA_HOME=/opt/jdk1.8.0_111/bin/java ./vendor/bin/phpunit tests/TextAnalysis/Taggers/StanfordNerTaggerTest.php

Example Usage

use TextAnalysis\Taggers\StanfordNerTagger;
use TextAnalysis\Tokenizers\WhitespaceTokenizer;
use TextAnalysis\Documents\TokensDocument;

Class EntityExtractionTest extends \PHPUnit_Framework_TestCase
{

    protected $text = "Marquette County is a county located in the Upper Peninsula of the US state of Michigan. As of the 2010 census, the population was 67,077.";


    public function testStanfordNer()
    {
        $document = new TokensDocument((new WhitespaceTokenizer())->tokenize($this->text));
        $tagger = new StanfordNerTagger();
        $output = $tagger->tag($document->getDocumentData());
        
        $this->assertFileExists($tagger->getTmpFilePath());        
        $this->assertEquals(138, filesize($tagger->getTmpFilePath()));        
        $this->assertEquals(['LOCATION','Michigan'], $output[15], "Did you set JAVA_HOME env variable?");  
}