System requirements
- Java 8
- Maven 3.x.x
- Apache OpenNLP
Solution overview
-
Train a model using Apache OpenNLP to read company names from the news articles. This has been done using set of articles. That file is available as training.TXT in the project. This step is done only once.
-
Extract data from the CSV and XML files. OpenCSV and JAXB has been used respectively.
-
Put the company names (used as the word to find) from the CSV in a PatriciaTrie.
-
Using the trained model, extract the possible company names from each article and check for their presence in the Trie.
-
On my machine, the application runs in around 500 seconds with the given data.
TODO
-
Currently company name is picked up from the Trie by doing a prefix search. This will not get the former names. Also, it might pick up the wrong company (for exampel, xyz Ltd instead of xyz BV). Instead, a new method should be implemented in the Trie to do a partial String match.
-
The model should be trained with more data to minimize false positives. Currently there is a fair share of false positives picked up by the model.
-
Modify the program to accept file path and file name. Currently, it has been hard coded.