Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add critbit support #20

Open
rbotafogo opened this issue Sep 17, 2015 · 4 comments
Open

Add critbit support #20

rbotafogo opened this issue Sep 17, 2015 · 4 comments

Comments

@rbotafogo
Copy link

May I suggest that you also use the critbit gem and add words to critbit? In this way you can also do search by prefix.

@abitdodgy
Copy link
Owner

Hi there. Thanks for the suggestion. But I'm not sure how this would be useful this would. It also seems to be a heavy dependency to introduce the gem. I'm a bit reluctant to have any JRuby dependencies here too. What do you think?

@rbotafogo
Copy link
Author

Hi Mohamad,

Using Critbit would allow the tokenizer to also show all the tokens in
sorted order and do searches by token prefix. It would probably be also
more space efficient, as Critbit only stores a prefix once as the image
below:

[image: Imagem inline 1]

I agree that you should not bring any JRuby dependencies to your Gem. My
idea is for you to allow the user to give you the storage class. By
default, your code should use a Hash, however, if the user is on JRuby and
she has Critbit installed, then when instantiating a WordsCounted, she
could pass the Critbit class. Your code has only to consider that instead
of working with Hash, it could work with any class that has the same Hash
interface.

What do you think?

Cheers,

2015-10-24 19:37 GMT-02:00 Mohamad El-Husseini notifications@github.com:

Hi there. Thanks for the suggestion. But I'm not sure how this would be
useful this would. It also seems to be a heavy dependency to introduce the
gem. I'm a bit reluctant to have any JRuby dependencies here too. What do
you think?


Reply to this email directly or view it on GitHub
#20 (comment)
.

Rodrigo Botafogo
Integrando TI ao seu negócio
21-3010-4802/11-3010-1802

@abitdodgy
Copy link
Owner

Your code has only to consider that instead
of working with Hash, it could work with any class that has the same Hash
interface.

Thanks. I'm having a hard time visualizing what you mean, but that's because I'm not familiar with CritBit. Do you mind sharing a small example of what the interface would look like? Some example, pseudo-code?

I've made a few changes to the gem in the last couple of weeks. Are you aware of those changes? I've decoupled the Tokeniser and the Counter classes.

@rbotafogo
Copy link
Author

Hi Mohamad,

The Hash API should be a subset of the Critbit API. In principle, wherever
you use a Hash you could use a Critbit. The main difference is that Critbit
will keep the data in sorted order while Hash if you do hash.each will
cycle through the data in the order it was inserted. Bellow I show the use
of Critbit as a Hash:

crit = Critbit.new

add some key, value pairs to crit

crit["hello"] = 0
crit["there"] = 1
crit["essa"] = 10
crit["Essa é uma frase para armazenar"] = 100
assert_equal(4, crit.size)
assert_equal(0, crit["hello"])
assert_equal(1, crit["there"])
assert_equal(100, crit["Essa é uma frase para armazenar"])

fetch the key from crit

assert_equal(0, crit.fetch("hello"))

remove a key, value pair from crit. Given the key it will return

the value and

remove the entry

assert_equal(0, crit.delete("hello"))
assert_equal(3, crit.size)
assert_equal(nil, crit["hello"])
crit.delete("not there") { |k| p "#{k} is not there" }

assert_raise ( KeyError ) { crit.fetch("hello") }
assert_equal("NotFound", crit.fetch("hello", "NotFound"))

crit also accepts complex objects

crit["works?"] = [10, 20, 30]
assert_equal([10, 20, 30], crit["works?"])
assert_equal(["works?", [10, 20, 30]], crit.assoc("works?"))

check if keys are stored in crit

assert_equal(true, crit.has_key?("there"))
assert_equal(false, crit.has_key?("Not there"))

crit stores data in sorted order, so we can call min and max on crit

assert_equal(["Essa é uma frase para armazenar", 100], crit.min)
assert_equal(["works?", [10, 20, 30]], crit.max)

crit also allows for checking value containment

assert_equal(true, crit.has_value?(100))
assert_equal(true, crit.has_value?([10, 20, 30]))
assert_equal(false, crit.has_value?("hello"))

method entries returns all entries in the Critbit... same as Hash

assert_equal([["Essa \u00E9 uma frase para armazenar", 100], ["essa", 10],
["there", 1], ["works?", [10, 20, 30]]], crit.entries)

it is possible to change a value for a given key

crit["essa"] = 20
assert_equal([["Essa \u00E9 uma frase para armazenar", 100], ["essa", 20],
["there", 1], ["works?", [10, 20, 30]]], crit.entries)

Critbit also allows to get data by prefix. Lets add some data to a Critbit:

crit = Critbit.new

crit is space efficient and stores prefixes only once and can be used to

find only strings that match a certain prefix

items = ["u", "un", "unh", "uni", "unj", "unim", "unin", "unio",
"uninc", "unind", "unine", "unindd", "uninde", "unindf",
"unindew", "unindex", "unindey", "a", "z"]

add items to the container

items.each do |item|
crit[item] = item
end

Let´s now retrieve only data that has prefix ‘unin’

crit.prefix = "unin"

Does each for all elements in the critbit with prefix 'unin'

print "["
crit.each do |key, value|
print "[#{key}, #{value}] "
end
print "]"

This is the result:

[[unin, unin] [uninc, uninc] [unind, unind] [unindd, unindd] [uninde,
uninde] [unindew, unindew] [unindex, unindex] [unindey, unindey]
[unindf, unindf] [unine, unine] ]

Does that help?

2015-10-27 21:35 GMT-02:00 Mohamad El-Husseini notifications@github.com:

Your code has only to consider that instead
of working with Hash, it could work with any class that has the same Hash
interface.

Thanks. I'm having a hard time visualizing what you mean, but that's
because I'm not familiar with CritBit. Do you mind sharing a small example
of what the interface would look like?

I've made a few changes to the gem in the last couple of weeks. Are you
aware of those changes? I've decoupled the Tokeniser and the Counter
classes.


Reply to this email directly or view it on GitHub
#20 (comment)
.

Rodrigo Botafogo
Integrando TI ao seu negócio
21-3010-4802/11-3010-1802

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants