GitHub - shazinahmed/opennlp-word-finder

Approximate word search using Apache OpenNLP

System requirements

Solution overview

Train a model using Apache OpenNLP to read company names from the news articles. This has been done using set of articles. That file is available as training.TXT in the project. This step is done only once.
Extract data from the CSV and XML files. OpenCSV and JAXB has been used respectively.
Put the company names (used as the word to find) from the CSV in a PatriciaTrie.
Using the trained model, extract the possible company names from each article and check for their presence in the Trie.
On my machine, the application runs in around 500 seconds with the given data.

TODO

Currently company name is picked up from the Trie by doing a prefix search. This will not get the former names. Also, it might pick up the wrong company (for exampel, xyz Ltd instead of xyz BV). Instead, a new method should be implemented in the Trie to do a partial String match.
The model should be trained with more data to minimize false positives. Currently there is a fair share of false positives picked up by the model.
Modify the program to accept file path and file name. Currently, it has been hard coded.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.settings		.settings
bin/com/demo/rep		bin/com/demo/rep
sample		sample
src/com/demo/rep		src/com/demo/rep
target		target
test/com/demo/rep		test/com/demo/rep
.classpath		.classpath
.project		.project
en-ner-company.bin		en-ner-company.bin
en-token.bin		en-token.bin
pom.xml		pom.xml
readme.md		readme.md
training.TXT		training.TXT