SimpleTokeniser

The goal of the SimpleTokeniser is simple: for some given text build a "concordance". A concordance is an alphabetical list of all words in the text, with the frequency they appeared followed by citations on which sentence each word appears (word {frequency:citations})

For example, here's the output for: "There are only two hard things in Computer Science. Cache invalidation and naming things."

and                       {1:1}
are                       {1:0}
cache                     {1:1}
computer                  {1:0}
hard                      {1:0}
in                        {1:0}
invalidation              {1:1}
naming                    {1:1}
only                      {1:0}
science                   {1:0}
there                     {1:0}
things                    {2:0,1}
two                       {1:0}

The code is relatively short and -hopefully- self explanatory. A custom "TokenMap" class was implemented to store the concordance, that provides some convenience methods. The "EarlyTokeniser.java" is an earlier attempt with different logic that doesn't match the requirements.

The bulk of the solution is in "Tokeniser.java". You can look at the first commit of that file for an early version of the algorithm before it was refactored to its current form.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src/net/mharry/tokeniser		src/net/mharry/tokeniser
test/net/mharry/tokeniser		test/net/mharry/tokeniser
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimpleTokeniser

About

Uh oh!

Releases

Packages

Languages

cmargiol/SimpleTokeniser

Folders and files

Latest commit

History

Repository files navigation

SimpleTokeniser

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages