an implementation for the text classifier task which was given here
to build the project run the following command from root project :
./gradlew clean build
the uber jar will be placed in build/libs/textclassifier.jar
command for running the java program:
java -jar build/libs/textclassifier.jar --config <path_to_json_configuration> --scan <path_to_file_or_directory>
for additonal help run
java -jar build/libs/textclassifier.jar --help
few points about the design:
- The classification rules are loaded via
ClassificationRulesLoader, with the implementation to loadjsonfrom any Reader toPOJO. - Since text files can get very big and not fit into memory we stream the tokens.
TokenizerStreameris an abstraction forStreamTokenizerofjava.ioto havestream()api, this will lift the tokenzier to be more declartive about how to filter and transform the token before putting into use. - Each token is normalized so it does not matter if the data which flows is
rawfile orcsv. - The
Classifieris a component with the responsibility to classify the tokens into tags(domains). - to be able to store the classification rules in compact manner and search each
indicator, in massive texts I made aTrielike data structure which stores in each node atoken, eachindicatoris splitted into continuation of tokens, and at the leaf nodes we keep aSetofdomain(singleton tags). To cover cases like partially combined indicators we keep a window of tokens (which will never be more than the longest indicator, hence we can be sure it will fit into memory according to the requirements).
