Skip to content

Commit 0567c01

Browse files
ibnesayeedCh4s3
authored andcommitted
Ability to add custom stopwords at classifier initialization (#129)
* Abbility to add custom stopwords at classifier initialization * Downcased custom test stopwords * Documented and improved custom stopwords handling * Added test cases for custom stopwords and empty trainings, #125 and #130 * Added documentation for auto-categorization and custom stopwords
1 parent 006d31a commit 0567c01

File tree

6 files changed

+168
-16
lines changed

6 files changed

+168
-16
lines changed

docs/bayes.md

Lines changed: 50 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ classifier = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting', backend
6969

7070
## Beyond the Basics
7171

72-
Beyond the basic example, the constructor and trainer can be used in a more flexible way to accommodate non-trival applications.
72+
Beyond the basic example, the constructor and trainer can be used in a more flexible way to accommodate non-trivial applications.
7373
Consider the following program.
7474

7575
```ruby
@@ -80,7 +80,7 @@ require 'classifier-reborn'
8080
training_set = DATA.read.split("\n")
8181
categories = training_set.shift.split(',').map{|c| c.strip}
8282

83-
# pass :auto_categorize option to allow feeding previously unknown categories
83+
# Pass :auto_categorize option to allow feeding previously unknown categories
8484
classifier = ClassifierReborn::Bayes.new categories, auto_categorize: true
8585

8686
training_set.each do |a_line|
@@ -90,7 +90,7 @@ training_set.each do |a_line|
9090
end
9191

9292
puts classifier.classify "I hate bad words and you" #=> 'Uninteresting'
93-
puts classifier.classify "I hate javascript" #=> 'Uninteresting'
93+
puts classifier.classify "I hate JavaScript" #=> 'Uninteresting'
9494
puts classifier.classify "JavaScript is bad" #=> 'Uninteresting'
9595

9696
puts classifier.classify "All you need is ruby" #=> 'Interesting'
@@ -107,7 +107,7 @@ interesting: The love boat, soon we will be taking another ride
107107
interesting: Ruby don't take your love to town
108108
109109
uninteresting: Here are some bad words, I hate you
110-
uninteresting: Bad bad leroy brown badest man in the darn town
110+
uninteresting: Bad bad Leroy Brown badest man in the darn town
111111
uninteresting: The good the bad and the ugly
112112
uninteresting: Java, JavaScript, CSS front-end HTML
113113
#
@@ -119,12 +119,57 @@ dog: A good hunting dog is a fine thing
119119
dog: Man my dogs are tired
120120
dog: Dogs are better than cats in soooo many ways
121121

122-
cat: The fuzz ball spilt the milk
122+
cat: The fuzz ball spilled the milk
123123
cat: Got rats or mice get a cat to kill them
124124
cat: Cats never come when you call them
125125
cat: That dang cat keeps scratching the furniture
126126
```
127127

128+
If no categories are specified at initialization then `:auto_categorize` is set to `true` by default.
129+
However, dynamic methods like `train_some_category` or `untrain_some_category` will not work unless corresponding categories exist.
130+
131+
```ruby
132+
require 'classifier-reborn'
133+
134+
classifier = ClassifierReborn::Bayes.new
135+
classifier.train("cat", "I can has cat")
136+
# The above method will work, but the following will throw an error
137+
# classifier.train_cat "I can has cat"
138+
```
139+
140+
## Custom Stopwords
141+
142+
The library ships with stopword files in various languages.
143+
However, in certain situations a custom stopwords list is desired for the domain specific classifiers.
144+
Custom stopwords can be specified at the classifier initialization by supplying an array of stopwords or path to load a stopwords file.
145+
These stopwords will only be applied for the language of the classifier instance.
146+
To disable stopwords completely, pass an empty string (`""`) or empty array (`[]`) as the value of the `:stopwords` parameter.
147+
148+
```ruby
149+
require 'classifier-reborn'
150+
151+
custom_stopwords = ["custom", "stop", "words"]
152+
classifier = ClassifierReborn::Bayes.new stopwords: custom_stopwords
153+
# Or from a file
154+
classifier = ClassifierReborn::Bayes.new stopwords: "/path/to/custom/stopwords/file"
155+
# Or to disable stopwords
156+
classifier = ClassifierReborn::Bayes.new stopwords: ""
157+
# Alternatively, to disable stopwords
158+
classifier = ClassifierReborn::Bayes.new stopwords: []
159+
```
160+
161+
Training and untraing with empty strings or strings that consist of only stopwords will be skipped.
162+
While an attempt to classify such strings will return `nil` or a category with score `Infinity` (based on whether threshold is enabled).
163+
164+
The above method of custom stopwords will overwrite the existing stopwords for the language of the classifier instance.
165+
However, to supplement the existing set of stopwords, more directory paths containing stopwordsword files can be added.
166+
In this case, each stopwords file name needs to be the same as the corresponding language code, such as `en` for English or `ar` for Arabic.
167+
168+
169+
```ruby
170+
ClassifierReborn::Hasher.add_custom_stopword_path(/path/to/additional/stopwords/directory)
171+
```
172+
128173
## Knowing the Score
129174

130175
When you ask a Bayesian classifier to classify text against a set of trained categories it does so by generating a score (as a Float) for each possible category.

lib/classifier-reborn/bayes.rb

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,8 @@
22
# Copyright:: Copyright (c) 2005 Lucas Carlson
33
# License:: LGPL
44

5+
require 'set'
6+
57
require_relative 'category_namer'
68
require_relative 'backends/bayes_memory_backend'
79
require_relative 'backends/bayes_redis_backend'
@@ -16,10 +18,11 @@ class Bayes
1618
#
1719
# Options available are:
1820
# language: 'en' Used to select language specific stop words
19-
# auto_categorize: false When true, enables ability to dynamically declare a category
21+
# auto_categorize: false When true, enables ability to dynamically declare a category; the default is true if no initial categories are provided
2022
# enable_threshold: false When true, enables a threshold requirement for classifition
2123
# threshold: 0.0 Default threshold, only used when enabled
2224
# enable_stemmer: true When false, disables word stemming
25+
# stopwords: nil Accepts path to a text file or an array of words, when supplied, overwrites the default stopwords; assign empty string or array to disable stopwords
2326
# backend: BayesMemoryBackend.new Alternatively, BayesRedisBackend.new for persistent storage
2427
def initialize(*args)
2528
initial_categories = []
@@ -51,6 +54,10 @@ def initialize(*args)
5154
initial_categories.each do |c|
5255
add_category(c)
5356
end
57+
58+
if options.key?(:stopwords)
59+
custom_stopwords options[:stopwords]
60+
end
5461
end
5562

5663
# Provides a general training method for all categories specified in Bayes#new
@@ -236,5 +243,21 @@ def add_category(category)
236243
end
237244

238245
alias_method :append_category, :add_category
246+
247+
private
248+
249+
# Overwrites the default stopwords for current language with supplied list of stopwords or file
250+
def custom_stopwords(stopwords)
251+
unless stopwords.is_a?(Enumerable)
252+
if stopwords.strip.empty?
253+
stopwords = []
254+
elsif File.exist?(stopwords)
255+
stopwords = File.read(stopwords).force_encoding("utf-8").split
256+
else
257+
return # Do not overwrite the default
258+
end
259+
end
260+
Hasher::STOPWORDS[@language] = Set.new stopwords
261+
end
239262
end
240263
end

test/bayes/bayesian_common_tests.rb

Lines changed: 47 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,14 +36,14 @@ def test_add_category
3636

3737
def test_dynamic_category_succeeds_with_auto_categorize
3838
classifier = auto_categorize_classifier
39-
classifier.train('Ruby', 'I really sweet language')
39+
classifier.train('Ruby', 'A really sweet language')
4040
assert classifier.categories.include?('Ruby')
4141
end
4242

4343
def test_dynamic_category_succeeds_with_empty_categories
4444
classifier = empty_classifier
4545
assert classifier.categories.empty?
46-
classifier.train('Ruby', 'I really sweet language')
46+
classifier.train('Ruby', 'A really sweet language')
4747
assert classifier.categories.include?('Ruby')
4848
assert_equal 1, classifier.categories.size
4949
end
@@ -133,4 +133,49 @@ def test_untrain
133133
classification_after_untrain = @classifier.classify 'seven'
134134
refute_equal classification_of_bad_data, classification_after_untrain
135135
end
136+
137+
def test_skip_empty_training_and_classification
138+
classifier = empty_classifier
139+
classifier.train('Ruby', '')
140+
assert classifier.categories.empty?
141+
classifier.train('Ruby', 'To be or not to be')
142+
assert classifier.categories.empty?
143+
classifier.train('Ruby', 'A really sweet language')
144+
refute classifier.categories.empty?
145+
assert_equal Float::INFINITY, classifier.classify_with_score('To be or not to be')[1]
146+
end
147+
148+
def test_empty_string_stopwords
149+
classifier = empty_string_stopwords_classifier
150+
classifier.train('Stopwords', 'To be or not to be')
151+
refute classifier.categories.empty?
152+
refute_equal Float::INFINITY, classifier.classify_with_score('To be or not to be')[1]
153+
end
154+
155+
def test_empty_array_stopwords
156+
classifier = empty_array_stopwords_classifier
157+
classifier.train('Stopwords', 'To be or not to be')
158+
refute classifier.categories.empty?
159+
refute_equal Float::INFINITY, classifier.classify_with_score('To be or not to be')[1]
160+
end
161+
162+
def test_custom_array_stopwords
163+
classifier = array_stopwords_classifier
164+
classifier.train('Stopwords', 'Custom stopwords')
165+
assert classifier.categories.empty?
166+
classifier.train('Stopwords', 'To be or not to be')
167+
refute classifier.categories.empty?
168+
assert_equal Float::INFINITY, classifier.classify_with_score('These stopwords')[1]
169+
refute_equal Float::INFINITY, classifier.classify_with_score('To be or not to be')[1]
170+
end
171+
172+
def test_custom_file_stopwords
173+
classifier = file_stopwords_classifier
174+
classifier.train('Stopwords', 'Custom stopwords')
175+
assert classifier.categories.empty?
176+
classifier.train('Stopwords', 'To be or not to be')
177+
refute classifier.categories.empty?
178+
assert_equal Float::INFINITY, classifier.classify_with_score('These stopwords')[1]
179+
refute_equal Float::INFINITY, classifier.classify_with_score('To be or not to be')[1]
180+
end
136181
end

test/bayes/bayesian_memory_test.rb

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,11 @@ class BayesianMemoryTest < Minitest::Test
88

99
def setup
1010
@classifier = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting'
11+
@old_stopwords = Hasher::STOPWORDS['en']
12+
end
13+
14+
def teardown
15+
Hasher::STOPWORDS['en'] = @old_stopwords
1116
end
1217

1318
def another_classifier
@@ -29,4 +34,20 @@ def empty_classifier
2934
def useless_classifier
3035
ClassifierReborn::Bayes.new auto_categorize: false
3136
end
37+
38+
def empty_string_stopwords_classifier
39+
ClassifierReborn::Bayes.new stopwords: ""
40+
end
41+
42+
def empty_array_stopwords_classifier
43+
ClassifierReborn::Bayes.new stopwords: []
44+
end
45+
46+
def array_stopwords_classifier
47+
ClassifierReborn::Bayes.new stopwords: ["these", "are", "custom", "stopwords"]
48+
end
49+
50+
def file_stopwords_classifier
51+
ClassifierReborn::Bayes.new stopwords: File.dirname(__FILE__) + '/../data/stopwords/en'
52+
end
3253
end

test/bayes/bayesian_redis_test.rb

Lines changed: 24 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,14 @@ def setup
1212
@redis_backend.instance_variable_get(:@redis).config(:set, "save", "")
1313
@alternate_redis_backend = ClassifierReborn::BayesRedisBackend.new(db: 1)
1414
@classifier = ClassifierReborn::Bayes.new 'Interesting', 'Uninteresting', backend: @redis_backend
15+
@old_stopwords = Hasher::STOPWORDS['en']
1516
rescue Redis::CannotConnectError => e
1617
skip(e)
1718
end
1819
end
1920

2021
def teardown
22+
Hasher::STOPWORDS['en'] = @old_stopwords
2123
@redis_backend.instance_variable_get(:@redis).flushdb
2224
@alternate_redis_backend.instance_variable_get(:@redis).flushdb
2325
end
@@ -34,11 +36,27 @@ def threshold_classifier(category)
3436
ClassifierReborn::Bayes.new category, backend: @alternate_redis_backend
3537
end
3638

37-
def empty_classifier
38-
ClassifierReborn::Bayes.new backend: @alternate_redis_backend
39-
end
39+
def empty_classifier
40+
ClassifierReborn::Bayes.new backend: @alternate_redis_backend
41+
end
4042

41-
def useless_classifier
42-
ClassifierReborn::Bayes.new auto_categorize: false, backend: @alternate_redis_backend
43-
end
43+
def useless_classifier
44+
ClassifierReborn::Bayes.new auto_categorize: false, backend: @alternate_redis_backend
45+
end
46+
47+
def empty_string_stopwords_classifier
48+
ClassifierReborn::Bayes.new stopwords: "", backend: @alternate_redis_backend
49+
end
50+
51+
def empty_array_stopwords_classifier
52+
ClassifierReborn::Bayes.new stopwords: [], backend: @alternate_redis_backend
53+
end
54+
55+
def array_stopwords_classifier
56+
ClassifierReborn::Bayes.new stopwords: ["these", "are", "custom", "stopwords"], backend: @alternate_redis_backend
57+
end
58+
59+
def file_stopwords_classifier
60+
ClassifierReborn::Bayes.new stopwords: File.dirname(__FILE__) + '/../data/stopwords/en', backend: @alternate_redis_backend
61+
end
4462
end

test/data/stopwords/en

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
These
1+
these
22
are
33
custom
4-
stopwords
4+
stopwords

0 commit comments

Comments
 (0)