angeloskath
diff --git a/‎CONTRIBUTING.md
Lines changed: 140 additions & 0 deletions b/‎CONTRIBUTING.md
Lines changed: 140 additions & 0 deletions
diff --git a/‎README.markdown
Lines changed: 16 additions & 3 deletions b/‎README.markdown
Lines changed: 16 additions & 3 deletions
diff --git a/‎src/NlpTools/Analysis/FreqDist.php
Lines changed: 133 additions & 0 deletions b/‎src/NlpTools/Analysis/FreqDist.php
Lines changed: 133 additions & 0 deletions
@@ -0,0 +1,140 @@
+Contribution guidelines
+===================
+
+This document contains guidelines for contributing to NlpTools.
+
+Coding style
+------------------
+
+NlpTools adheres to the [psr-2][1] standard. It also follows the convention of
+appending the word *Interface* to any interface.
+
+To enforce the psr-2 style it is suggested to use the [php-cs-fixer][2] tool.
+While you 're at it why not enforce some more styles as well. The fixers used
+are the **default** (which are more than the psr-2 level uses) but they will be
+explicitly listed here in case they change in the future.
+
+* indentation
+* linefeed
+* trailing_spaces
+* unused_use
+* phpdoc_params
+* visibility
+* return
+* short_tag
+* braces
+* include
+* php_closing_tag
+* extra_empty_lines
+* psr0
+* control_spaces
+* elseif
+* eof_ending
+
+The above fixers are the default.
+
+Commenting Style
+--------------------------
+
+Every public method must have comments that follow the php doc convention.
+@param and @return annotations are mandatory. The comments should be
+explanatory not simply rewriting the method's name in a sentence. If the method
+is too simple or the name explains the actions sufficiently then just add the
+@param and @return annotations.
+
+Examples of bad commenting currently in the develop branch:
+
+``` php
+/**
+ * Calls internal functions to handle data processing
+ * @param type $string
+ */
+public function tokenize($str)
+{
+    ......
+}
+```
+
+It should be something along the lines of:
+
+``` php
+/**
+ * Splits $str to smaller strings according to Penn Treebank tokenization rules.
+ *
+ * You can see the regexes in function initPatternAndReplacement()
+ * @param  string $str The string to be tokenized
+ * @return array  An array of smaller strings (the tokens)
+ */
+....
+```
+
+Equally necessary are class comments. The class comment should be explaining
+what the class does from a high point of view. Redirections to online resources
+like wikipedia are welcome. A good example that also contains a reference to an
+external resource is the following:
+
+``` php
+/**
+ * Implement a gradient descent algorithm that maximizes the conditional
+ * log likelihood of the training data.
+ *
+ * See page 24 - 28 of http://nlp.stanford.edu/pubs/maxent-tutorial-slides.pdf
+ * @see NlpTools\Models\Maxent
+ */
+class MaxentGradientDescent extends GradientDescentOptimizer implements MaxentOptimizerInterface
+```
+
+Pull Requests
+--------------------
+
+### Find something to work on ###
+
+If it is your first contribution try to find something straightforward and
+concise to implement without many design decisions as much as development
+decisions. You could first submit an issue, if you like, and state your will to
+correct this issue yourself.
+
+### Branch off ###
+
+When you 've found something to develop, create a new branch off of the develop
+branch. Make your changes, add your tests (see below for testing) and then make
+a pull request. Always keep your develop branch in sync with the remote and
+before you create a pull request **rebase** your local branch to develop.
+
+If you rebased but there has been a change pushed since, you don't have to
+remove the pull request, rebase and recreate it. I will pull your changes
+rebase them, merge them and then close the pull request. This will have the
+effect of showing some merged pull requests as simply closed but it is worth
+keeping the commit history clean.
+
+So in two small sentences: Always create a new branch to develop on. Always
+rebase before making a pull request.
+
+### Tests ###
+
+If you are implementing a new feature always include tests in your pull request.
+
+Also contributing just tests is extremely welcome.
+
+Testing
+-----------
+
+A bit of information can be found in the tests folder in the README file.
+
+Tests should test the implementation thoroughly. You can test your
+implementation like a black box, based only on the outputs given some inputs,
+or you can test every small part for how it works. Either is acceptable. I will
+make my point clear with an example.
+
+The PorterStemmer implementation has 5 steps and some even have sub steps. One
+way to write the test would be to expose those steps (maybe by extending the
+PorterStemmer class) and write tests for each one. One other way would be to
+simply take a big list of English words and their stems according to the
+canonical implementation and check if your code produces the same results.
+
+While the second is a lot easier to implement, in case of failure, it gives
+very little information regarding the cause of the error. Both are acceptable
+(in the case of the example the second is implemented).
+
+[1]: https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-2-coding-style-guide.md
+[2]: http://cs.sensiolabs.org/
@@ -25,15 +25,19 @@ Lda is still experimental and quite slow but it works. [See an example](http://p
 
 ### Clustering ###
 
-Hierarchical and Expectations Maximization are coming soon.
-
 1. [K-Means](http://php-nlp-tools.com/documentation/clustering.html)
+2. [Hierarchical Agglomerative Clustering](http://php-nlp-tools.com/documentation/clustering.html)
+   * SingleLink
+   * CompleteLink
+   * GroupAverage
 
 ### Tokenizers ###
 
 1. [WhitespaceTokenizer](http://php-nlp-tools.com/documentation/api/#NlpTools/Tokenizers/WhitespaceTokenizer)
 2. [WhitespaceAndPunctuationTokenizer](http://php-nlp-tools.com/documentation/api/#NlpTools/Tokenizers/WhitespaceAndPunctuationTokenizer)
-3. [ClassifierBasedTokenizer](http://php-nlp-tools.com/documentation/api/#NlpTools/Tokenizers/ClassifierBasedTokenizer)
+3. [PennTreebankTokenizer](http://php-nlp-tools.com/documentation/api/#NlpTools/Tokenizers/PennTreebankTokenizer)
+4. [RegexTokenizer](http://php-nlp-tools.com/documentation/api/#NlpTools\Tokenizers\RegexTokenizer)
+5. [ClassifierBasedTokenizer](http://php-nlp-tools.com/documentation/api/#NlpTools/Tokenizers/ClassifierBasedTokenizer)
    This tokenizer allows us to build a lot more complex tokenizers
    than the previous ones
 
@@ -67,6 +71,8 @@ Hierarchical and Expectations Maximization are coming soon.
 
 1. [PorterStemmer](http://php-nlp-tools.com/documentation/api/#NlpTools/Stemmers/PorterStemmer)
 2. [RegexStemmer](http://php-nlp-tools.com/documentation/api/#NlpTools/Stemmers/RegexStemmer)
+3. [LancasterStemmer](http://php-nlp-tools.com/documentation/api/#NlpTools/Stemmers/LancasterStemmer)
+4. [GreekStemmer](http://php-nlp-tools.com/documentation/api/#NlpTools/Stemmers/GreekStemmer)
 
 ### Optimizers (MaxEnt only) ###
 
@@ -79,3 +85,10 @@ Hierarchical and Expectations Maximization are coming soon.
    resides in another [repo](https://github.com/angeloskath/nlp-maxent-optimizer),
    it is used via the [external optimizer](http://php-nlp-tools.com/documentation/api/#NlpTools/Optimizers/ExternalMaxentOptimizer).
    TODO: At least write a readme for the optimizer written in Go.
+
+### Other ###
+
+1. Idf Inverse document frequency
+2. Stop words
+3. Language based normalizers
+4. Classifier based transformation for creating flexible preprocessing pipelines
@@ -0,0 +1,133 @@
+<?php
+namespace NlpTools\Analysis;
+
+use NlpTools\Documents\TokensDocument;
+
+/**
+ * Extract the Frequency distribution of keywords
+ * @author Dan Cardin
+ */
+class FreqDist
+{
+
+    /**
+     * An associative array that holds all the frequencies per token
+     * @var array
+     */
+    protected $keyValues = array();
+
+    /**
+     * The total number of tokens originally passed into FreqDist
+     * @var int
+     */
+    protected $totalTokens = null;
+
+    /**
+     * This sorts the token meta data collection right away so use
+     * frequency distribution data can be extracted.
+     * @param array $tokens
+     */
+    public function __construct(array $tokens)
+    {
+        $this->preCompute($tokens);
+        $this->totalTokens = count($tokens);
+    }
+
+    /**
+     * Get the total number of tokens in this tokensDocument
+     * @return int
+     */
+    public function getTotalTokens()
+    {
+        return $this->totalTokens;
+    }
+
+    /**
+     * Internal function for summarizing all the data into a key value store
+     * @param array $tokens The set of tokens passed into the constructor
+     */
+    protected function preCompute(array &$tokens)
+    {
+        //count all the tokens up and put them in a key value store
+        $this->keyValues = array_count_values($tokens);
+        arsort($this->keyValues);
+    }
+
+    /**
+     * Return the weight of a single token
+     * @return float
+     */
+    public function getWeightPerToken()
+    {
+        return 1 / $this->getTotalTokens();
+    }
+
+    /**
+     * Return get the total number of unique tokens
+     * @return int
+     */
+    public function getTotalUniqueTokens()
+    {
+        return count($this->keyValues);
+    }
+
+    /**
+     * Return the sorted keys by frequency desc
+     * @return array
+     */
+    public function getKeys()
+    {
+        return array_keys($this->keyValues);
+    }
+
+    /**
+     * Return the sorted values by frequency desc
+     * @return array
+     */
+    public function getValues()
+    {
+        return array_values($this->keyValues);
+    }
+
+    /**
+     * Return the full key value store
+     * @return array
+     */
+    public function getKeyValues()
+    {
+        return $this->keyValues;
+    }
+
+    /**
+     *
+     * Returns an array of tokens that occurred once
+     * @todo This is an inefficient approach
+     * @return array
+     */
+    public function getHapaxes()
+    {
+            $hapaxes = array();
+
+            //get the head key
+            $head = key($this->keyValues);
+
+            //get the tail value,. set the internal pointer to the tail
+            $tail = end($this->keyValues);
+            // no hapaxes available
+            if ($tail > 1) {
+                return array();
+            }
+
+            do {
+                $hapaxes[] = key($this->keyValues);
+                prev($this->keyValues);
+
+            } while (current($this->keyValues) == 1 && key($this->keyValues) !== $head);
+
+            //reset the internal pointer in the array
+            reset($this->keyValues);
+
+            return $hapaxes;
+    }
+
+}