Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimHash Document Encoder in C++ with Python bindings #603

Merged
merged 59 commits into from
Aug 22, 2019

Conversation

brev
Copy link

@brev brev commented Aug 3, 2019

SimHash Document Encoder in C++ with Python bindings.
Full passing test suite in both langs.

The SimHashDocumentEncoder encodes a document (array of strings) value into an
array of bits. The output is 0's except for a sparse distribution spray of 1's.
Similar document encodings will share similar representations, and vice versa.
Unicode is supported. No lookup tables are used.

"Similarity" here refers to bitwise similarity (small hamming distance,
high overlap), not semantic similarity (encodings for "apple" and
"computer" will have no relation here.) For document encodings which are
also semantic, please try Cortical.io and their Semantic Folding tech.

Encoding is accomplished using SimHash, a Locality-Sensitive Hashing (LSH)
algorithm from the world of nearest-neighbor document similarity search.
As SDRs are variable-length, we use the SHA3+SHAKE256 hashing algorithm.
We deviate slightly from the standard SimHash algorithm in order to
achieve sparsity.

@breznak @ctrl-z-9000-times This is the first C++ I've written in 20 years, so you might want to use the fine-tooth comb. Thanks for the amazing codebase, I was surprised how quickly I got up and running, it was very easy to follow. I'd heard modern C++ was a much improved beast, they were not kidding.

I only have 1 non-blocking issue for @ctrl-z-9000-times which I'll inline below in the code.

C++ usage:

#include <htm/encoders/SimHashDocumentEncoder.hpp>
#include <htm/encoders/types/Sdr.hpp>

SimHashDocumentEncoderParameters params;
params.size = 400u;
params.activeBits = 21u;

SDR output({ params.size });
SimHashDocumentEncoder encoder(params);
encoder.encode({ "bravo", "delta", "echo" }, output);
encoder.encode("bravo delta echo", output);

Python usage:

from htm.bindings.encoders import SimHashDocumentEncoder
from htm.bindings.encoders import SimHashDocumentEncoderParameters
from htm.bindings.sdr import SDR
    
params = SimHashDocumentEncoderParameters()
params.size = 400
params.activeBits = 21
    
encoder = SimHashDocumentEncoder(params)
    
# call style: output is reference
output = SDR(params.size)
encoder.encode([ "bravo", "delta", "echo" ], output)
encoder.encode("bravo delta echo", output)
    
# call style: output is returned
other = encoder.encode([ "bravo", "delta", "echo" ])
other = encoder.encode("bravo delta echo")

Python help:

python
>>> import htm.bindings.encoders
>>> help(htm.bindings.encoders.SimHashDocumentEncoder)

Python command line tool:

$ python -m htm.examples.encoders.simhash_document_encoder --help

EDIT:

TODO status tracked in #603 (comment)

@brev
Copy link
Author

brev commented Aug 3, 2019

The Travis build failure must be a random fluke, is it possible to re-trigger?

Nevermind, it was me, fixed and retriggered.
067c5b5

@breznak breznak added encoder enhancement New feature or request labels Aug 3, 2019
Copy link
Member

@breznak breznak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brev This is near perfect PR! Well structured, docs, tested, bindings. Love it. 👍 👏

I wrote a couple of questions, suggestions below, none of which are critical.
Sorry, I went a bit harsh with questioning the functionality of the document encoder, but that is that I used to do NLP and I'm interested. This PR is fine almost as is, a generic "document encoder" is not there yet.

Some things I'd like to see:

  • add test for serialization (using save/load)
  • is Eigen needed? -> easier implementation with Eigen
  • real-world example/test
  • a README with doc on this new encoder
  • (opt) separate the example from py encoder file?
  • (opt) split to TopicEncoder & DocumentEncoder?

Thank you very much for your contribution, hope you keep your interest with htm.core and we'll see more work together 👍

Note:
I've reviewed

  • high level functionality of doc encoder
  • code quality, which is fit for the repo
  • I haven't understood yet the inner details how the encoders does its magic

.travis.yml Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
external/README.md Outdated Show resolved Hide resolved
py/htm/encoders/simhash_document_encoder.py Outdated Show resolved Hide resolved
src/htm/encoders/SimHashDocumentEncoder.cpp Outdated Show resolved Hide resolved
src/htm/encoders/SimHashDocumentEncoder.cpp Show resolved Hide resolved
src/test/unit/encoders/SimHashDocumentEncoderTest.cpp Outdated Show resolved Hide resolved
@breznak breznak closed this Aug 3, 2019
@brev
Copy link
Author

brev commented Aug 16, 2019

@brev
Copy link
Author

brev commented Aug 16, 2019

OK, I just realized that I've made a bigger dumb mistake. I'm using a map in an incorrect method in one place. @breznak, this is probably what lead to you requesting the 2 separate classes. Anyway, one of my next commits will be some re-tooling around this. It should be a good simplification. It'll only require 1 class for the encoder, not 2. @dkeeney you can ignore that subclass serialization issue, thanks. more soon.

currently passing because this led to me finding a bigger problem in my
general approach.  I'll retool in the PR branch, and keep this in a separate
branch for now.
- We've got a single class.
- It has params: vocabulary(map), excludes(list), encodeOrphans(bool).
- It has 2 encode calling styles: list, string.
- It all now makes a lot more damn sense. :)
- +tests +bindings

cc @breznak thx!

I'm on to the tokenFrequency min/max next, already mostly done, just need to
merge in.
- C++ and tests, Py bindings and tests.
- This also takes care of de-duplication feature.
- Make new param `encodeOrphans` only work when `vocabulary` is set.
  - Default=false now also
- Fixed alternative `encode(string)` calling across Python binding.
- Improved detail on all serialization tests to include `vocabulary` param,
  and output SDR comparison.
- Switch from `this[style]` of map access to the `more.at(formal)` style,
  in general. (My JS background keeps causing me to use the old style
  incorrectly, with hard to find bug results).
- Fix Python files to match PEP8: indentation style, line length, 4 spaces, etc.
  Sorry for the large reformat. The only changes were adding new params.
- Documentation, etc.
- Re-arranged a few things to get into alphabetical order for easier reading.

cc @breznak
Simhash Token/Char Frequency Floor/Ceiling, and more.
@brev
Copy link
Author

brev commented Aug 19, 2019

OK, I'm ready for a fresh look, I believe this is stable again, including all the new requested features. Thanks for any feedback.

Copy link
Member

@breznak breznak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brev, this looks even better now! The code is now cleaner and API is much more understandable to use, great feat!

  • I have no more conceptual, utility issues with this, it as now a really useful encoder! 👍
  • one bug, I think, with arg parsing
  • a few design questions, but that is only for your consideration

Is there anything else you're planning for the encoder, or are we good to merge?

src/htm/encoders/SimHashDocumentEncoder.cpp Show resolved Hide resolved
src/htm/encoders/SimHashDocumentEncoder.cpp Show resolved Hide resolved
src/htm/encoders/SimHashDocumentEncoder.cpp Show resolved Hide resolved
src/htm/encoders/SimHashDocumentEncoder.cpp Outdated Show resolved Hide resolved
src/htm/encoders/SimHashDocumentEncoder.hpp Outdated Show resolved Hide resolved
…or, now

  we just have frequencyCeil/Floor.)
  - What was "charFrequencyCeil" is now: "frequencyCeil" + "tokenSimilarity".
  - What was "charFrequencyFloor" has been removed as not useful.
- Better checks on class params.sparsity
- Don't die on empty encode([]) or encode(""), return SDR of zeros instead.
@brev
Copy link
Author

brev commented Aug 20, 2019

@breznak thanks again for the great feedback. I think I'm current again, ready for new changes.

I'm all set to go, otherwise. Excited to start messing around with this. :)

Copy link
Member

@breznak breznak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the great work and continued improvements in this PR!!
Let's get this merged, I'm quite excited, the encoder turned out something pretty useful.
I am also satisfied now.

  • So the only blocker is Simhash document encode [DO NOT MERGE] #635 where this fails on OSX in Debug. (is it failing only in OSX, or everywhere in Debug, might be some compatibility bug in Eigen with debug/python3/compiler/... ?)

Copy link
Member

@breznak breznak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix looks good to me.
@brev not yet :(, seems the statistics are a bit off due to the latest change,
https://travis-ci.org/htm-community/htm.core/builds/574695875#L2500

@brev
Copy link
Author

brev commented Aug 21, 2019

fixed, and finished serialization, here: 24299b3 thx

Copy link
Member

@breznak breznak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here we go, all the CI are passing, I think the Document encoder is in a great shape! 💯
Congratulations @brev for the great work and hope to see more research on NLP + HTM using this encoder.
This PR was really a pleasure developing with you, the ideas and the code implementation ✔️ , hope you'd stick around with the development and later join the precious ranks of HTM.core developers 😉
Let's ship it! :shipit:

@breznak breznak merged commit 609ce86 into htm-community:master Aug 22, 2019
@dkeeney
Copy link

dkeeney commented Aug 22, 2019

Very good 🥇

@brev brev deleted the simhash-document-encoder branch August 23, 2019 07:16
@brev
Copy link
Author

brev commented Aug 23, 2019

Awesome, thanks all! 👍 Pleasure working with you.

@Thanh-Binh
Copy link

@brev do you have any sample application of your encoder?

@ctrl-z-9000-times
Copy link
Collaborator

Congratulations Brev,

I look forward to playing with this encoder!

@brev
Copy link
Author

brev commented Sep 3, 2019

Hi @Thanh-Binh! sorry for the slow reply,

Here is the best example currently:
https://github.com/htm-community/htm.core/blob/master/py/htm/examples/encoders/simhash_document_encoder.py

There is an Issue open to come up with a more official demo:
#634

I'll post more on https://discourse.numenta.org/ soon, thanks.

@Thanh-Binh
Copy link

Hi @brev
Thanks. Do you have any c++ sample?

@breznak
Copy link
Member

breznak commented Sep 3, 2019

Do you have any c++ sample?

Please, see
https://github.com/htm-community/htm.core/pull/603/files

and the src/test/ files for this encoder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoder enhancement New feature or request ready
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants