Skip to content

Commit d92524c

Browse files
committed
Merge branch 'develop' and advance to v0.1
2 parents e3a5c28 + e3e7be5 commit d92524c

File tree

89 files changed

+2946
-195
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

89 files changed

+2946
-195
lines changed

CONTRIBUTING.md

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
Contribution guidelines
2+
===================
3+
4+
This document contains guidelines for contributing to NlpTools.
5+
6+
Coding style
7+
------------------
8+
9+
NlpTools adheres to the [psr-2][1] standard. It also follows the convention of
10+
appending the word *Interface* to any interface.
11+
12+
To enforce the psr-2 style it is suggested to use the [php-cs-fixer][2] tool.
13+
While you 're at it why not enforce some more styles as well. The fixers used
14+
are the **default** (which are more than the psr-2 level uses) but they will be
15+
explicitly listed here in case they change in the future.
16+
17+
* indentation
18+
* linefeed
19+
* trailing_spaces
20+
* unused_use
21+
* phpdoc_params
22+
* visibility
23+
* return
24+
* short_tag
25+
* braces
26+
* include
27+
* php_closing_tag
28+
* extra_empty_lines
29+
* psr0
30+
* control_spaces
31+
* elseif
32+
* eof_ending
33+
34+
The above fixers are the default.
35+
36+
Commenting Style
37+
--------------------------
38+
39+
Every public method must have comments that follow the php doc convention.
40+
@param and @return annotations are mandatory. The comments should be
41+
explanatory not simply rewriting the method's name in a sentence. If the method
42+
is too simple or the name explains the actions sufficiently then just add the
43+
@param and @return annotations.
44+
45+
Examples of bad commenting currently in the develop branch:
46+
47+
``` php
48+
/**
49+
* Calls internal functions to handle data processing
50+
* @param type $string
51+
*/
52+
public function tokenize($str)
53+
{
54+
......
55+
}
56+
```
57+
58+
It should be something along the lines of:
59+
60+
``` php
61+
/**
62+
* Splits $str to smaller strings according to Penn Treebank tokenization rules.
63+
*
64+
* You can see the regexes in function initPatternAndReplacement()
65+
* @param string $str The string to be tokenized
66+
* @return array An array of smaller strings (the tokens)
67+
*/
68+
....
69+
```
70+
71+
Equally necessary are class comments. The class comment should be explaining
72+
what the class does from a high point of view. Redirections to online resources
73+
like wikipedia are welcome. A good example that also contains a reference to an
74+
external resource is the following:
75+
76+
``` php
77+
/**
78+
* Implement a gradient descent algorithm that maximizes the conditional
79+
* log likelihood of the training data.
80+
*
81+
* See page 24 - 28 of http://nlp.stanford.edu/pubs/maxent-tutorial-slides.pdf
82+
* @see NlpTools\Models\Maxent
83+
*/
84+
class MaxentGradientDescent extends GradientDescentOptimizer implements MaxentOptimizerInterface
85+
```
86+
87+
Pull Requests
88+
--------------------
89+
90+
### Find something to work on ###
91+
92+
If it is your first contribution try to find something straightforward and
93+
concise to implement without many design decisions as much as development
94+
decisions. You could first submit an issue, if you like, and state your will to
95+
correct this issue yourself.
96+
97+
### Branch off ###
98+
99+
When you 've found something to develop, create a new branch off of the develop
100+
branch. Make your changes, add your tests (see below for testing) and then make
101+
a pull request. Always keep your develop branch in sync with the remote and
102+
before you create a pull request **rebase** your local branch to develop.
103+
104+
If you rebased but there has been a change pushed since, you don't have to
105+
remove the pull request, rebase and recreate it. I will pull your changes
106+
rebase them, merge them and then close the pull request. This will have the
107+
effect of showing some merged pull requests as simply closed but it is worth
108+
keeping the commit history clean.
109+
110+
So in two small sentences: Always create a new branch to develop on. Always
111+
rebase before making a pull request.
112+
113+
### Tests ###
114+
115+
If you are implementing a new feature always include tests in your pull request.
116+
117+
Also contributing just tests is extremely welcome.
118+
119+
Testing
120+
-----------
121+
122+
A bit of information can be found in the tests folder in the README file.
123+
124+
Tests should test the implementation thoroughly. You can test your
125+
implementation like a black box, based only on the outputs given some inputs,
126+
or you can test every small part for how it works. Either is acceptable. I will
127+
make my point clear with an example.
128+
129+
The PorterStemmer implementation has 5 steps and some even have sub steps. One
130+
way to write the test would be to expose those steps (maybe by extending the
131+
PorterStemmer class) and write tests for each one. One other way would be to
132+
simply take a big list of English words and their stems according to the
133+
canonical implementation and check if your code produces the same results.
134+
135+
While the second is a lot easier to implement, in case of failure, it gives
136+
very little information regarding the cause of the error. Both are acceptable
137+
(in the case of the example the second is implemented).
138+
139+
[1]: https://github.com/php-fig/fig-standards/blob/master/accepted/PSR-2-coding-style-guide.md
140+
[2]: http://cs.sensiolabs.org/

README.markdown

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,15 +25,19 @@ Lda is still experimental and quite slow but it works. [See an example](http://p
2525

2626
### Clustering ###
2727

28-
Hierarchical and Expectations Maximization are coming soon.
29-
3028
1. [K-Means](http://php-nlp-tools.com/documentation/clustering.html)
29+
2. [Hierarchical Agglomerative Clustering](http://php-nlp-tools.com/documentation/clustering.html)
30+
* SingleLink
31+
* CompleteLink
32+
* GroupAverage
3133

3234
### Tokenizers ###
3335

3436
1. [WhitespaceTokenizer](http://php-nlp-tools.com/documentation/api/#NlpTools/Tokenizers/WhitespaceTokenizer)
3537
2. [WhitespaceAndPunctuationTokenizer](http://php-nlp-tools.com/documentation/api/#NlpTools/Tokenizers/WhitespaceAndPunctuationTokenizer)
36-
3. [ClassifierBasedTokenizer](http://php-nlp-tools.com/documentation/api/#NlpTools/Tokenizers/ClassifierBasedTokenizer)
38+
3. [PennTreebankTokenizer](http://php-nlp-tools.com/documentation/api/#NlpTools/Tokenizers/PennTreebankTokenizer)
39+
4. [RegexTokenizer](http://php-nlp-tools.com/documentation/api/#NlpTools\Tokenizers\RegexTokenizer)
40+
5. [ClassifierBasedTokenizer](http://php-nlp-tools.com/documentation/api/#NlpTools/Tokenizers/ClassifierBasedTokenizer)
3741
This tokenizer allows us to build a lot more complex tokenizers
3842
than the previous ones
3943

@@ -67,6 +71,8 @@ Hierarchical and Expectations Maximization are coming soon.
6771

6872
1. [PorterStemmer](http://php-nlp-tools.com/documentation/api/#NlpTools/Stemmers/PorterStemmer)
6973
2. [RegexStemmer](http://php-nlp-tools.com/documentation/api/#NlpTools/Stemmers/RegexStemmer)
74+
3. [LancasterStemmer](http://php-nlp-tools.com/documentation/api/#NlpTools/Stemmers/LancasterStemmer)
75+
4. [GreekStemmer](http://php-nlp-tools.com/documentation/api/#NlpTools/Stemmers/GreekStemmer)
7076

7177
### Optimizers (MaxEnt only) ###
7278

@@ -79,3 +85,10 @@ Hierarchical and Expectations Maximization are coming soon.
7985
resides in another [repo](https://github.com/angeloskath/nlp-maxent-optimizer),
8086
it is used via the [external optimizer](http://php-nlp-tools.com/documentation/api/#NlpTools/Optimizers/ExternalMaxentOptimizer).
8187
TODO: At least write a readme for the optimizer written in Go.
88+
89+
### Other ###
90+
91+
1. Idf Inverse document frequency
92+
2. Stop words
93+
3. Language based normalizers
94+
4. Classifier based transformation for creating flexible preprocessing pipelines

src/NlpTools/Analysis/FreqDist.php

Lines changed: 133 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,133 @@
1+
<?php
2+
namespace NlpTools\Analysis;
3+
4+
use NlpTools\Documents\TokensDocument;
5+
6+
/**
7+
* Extract the Frequency distribution of keywords
8+
* @author Dan Cardin
9+
*/
10+
class FreqDist
11+
{
12+
13+
/**
14+
* An associative array that holds all the frequencies per token
15+
* @var array
16+
*/
17+
protected $keyValues = array();
18+
19+
/**
20+
* The total number of tokens originally passed into FreqDist
21+
* @var int
22+
*/
23+
protected $totalTokens = null;
24+
25+
/**
26+
* This sorts the token meta data collection right away so use
27+
* frequency distribution data can be extracted.
28+
* @param array $tokens
29+
*/
30+
public function __construct(array $tokens)
31+
{
32+
$this->preCompute($tokens);
33+
$this->totalTokens = count($tokens);
34+
}
35+
36+
/**
37+
* Get the total number of tokens in this tokensDocument
38+
* @return int
39+
*/
40+
public function getTotalTokens()
41+
{
42+
return $this->totalTokens;
43+
}
44+
45+
/**
46+
* Internal function for summarizing all the data into a key value store
47+
* @param array $tokens The set of tokens passed into the constructor
48+
*/
49+
protected function preCompute(array &$tokens)
50+
{
51+
//count all the tokens up and put them in a key value store
52+
$this->keyValues = array_count_values($tokens);
53+
arsort($this->keyValues);
54+
}
55+
56+
/**
57+
* Return the weight of a single token
58+
* @return float
59+
*/
60+
public function getWeightPerToken()
61+
{
62+
return 1 / $this->getTotalTokens();
63+
}
64+
65+
/**
66+
* Return get the total number of unique tokens
67+
* @return int
68+
*/
69+
public function getTotalUniqueTokens()
70+
{
71+
return count($this->keyValues);
72+
}
73+
74+
/**
75+
* Return the sorted keys by frequency desc
76+
* @return array
77+
*/
78+
public function getKeys()
79+
{
80+
return array_keys($this->keyValues);
81+
}
82+
83+
/**
84+
* Return the sorted values by frequency desc
85+
* @return array
86+
*/
87+
public function getValues()
88+
{
89+
return array_values($this->keyValues);
90+
}
91+
92+
/**
93+
* Return the full key value store
94+
* @return array
95+
*/
96+
public function getKeyValues()
97+
{
98+
return $this->keyValues;
99+
}
100+
101+
/**
102+
*
103+
* Returns an array of tokens that occurred once
104+
* @todo This is an inefficient approach
105+
* @return array
106+
*/
107+
public function getHapaxes()
108+
{
109+
$hapaxes = array();
110+
111+
//get the head key
112+
$head = key($this->keyValues);
113+
114+
//get the tail value,. set the internal pointer to the tail
115+
$tail = end($this->keyValues);
116+
// no hapaxes available
117+
if ($tail > 1) {
118+
return array();
119+
}
120+
121+
do {
122+
$hapaxes[] = key($this->keyValues);
123+
prev($this->keyValues);
124+
125+
} while (current($this->keyValues) == 1 && key($this->keyValues) !== $head);
126+
127+
//reset the internal pointer in the array
128+
reset($this->keyValues);
129+
130+
return $hapaxes;
131+
}
132+
133+
}

0 commit comments

Comments
 (0)