-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SimHash Document Encoder in C++ with Python bindings #603
Conversation
…on with the same general results. A bit tweaky, but I can't figure anything better off the top of my head.
These don't work on my local machine (recent MacBook), but I want to test them against the cloud builds before offically increasing.
This seems cleaner/safer anyway.
Nevermind, it was me, fixed and retriggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brev This is near perfect PR! Well structured, docs, tested, bindings. Love it. 👍 👏
I wrote a couple of questions, suggestions below, none of which are critical.
Sorry, I went a bit harsh with questioning the functionality of the document encoder, but that is that I used to do NLP and I'm interested. This PR is fine almost as is, a generic "document encoder" is not there yet.
Some things I'd like to see:
- add test for serialization (using save/load)
- python serialization? -> implemented, but has a problem, moved to RDSE: serialization for py bindings #608
- is Eigen needed? -> easier implementation with Eigen
- real-world example/test
- a README with doc on this new encoder
- (opt) separate the example from py encoder file?
- (opt) split to TopicEncoder & DocumentEncoder?
Thank you very much for your contribution, hope you keep your interest with htm.core and we'll see more work together 👍
Note:
I've reviewed
- high level functionality of doc encoder
- code quality, which is fit for the repo
- I haven't understood yet the inner details how the encoders does its magic
bindings/py/cpp_src/bindings/encoders/py_SimHashDocumentEncoder.cpp
Outdated
Show resolved
Hide resolved
|
OK, I just realized that I've made a bigger dumb mistake. I'm using a map in an incorrect method in one place. @breznak, this is probably what lead to you requesting the 2 separate classes. Anyway, one of my next commits will be some re-tooling around this. It should be a good simplification. It'll only require 1 class for the encoder, not 2. @dkeeney you can ignore that subclass serialization issue, thanks. more soon. |
currently passing because this led to me finding a bigger problem in my general approach. I'll retool in the PR branch, and keep this in a separate branch for now.
- We've got a single class. - It has params: vocabulary(map), excludes(list), encodeOrphans(bool). - It has 2 encode calling styles: list, string. - It all now makes a lot more damn sense. :) - +tests +bindings cc @breznak thx! I'm on to the tokenFrequency min/max next, already mostly done, just need to merge in.
- C++ and tests, Py bindings and tests. - This also takes care of de-duplication feature. - Make new param `encodeOrphans` only work when `vocabulary` is set. - Default=false now also - Fixed alternative `encode(string)` calling across Python binding. - Improved detail on all serialization tests to include `vocabulary` param, and output SDR comparison. - Switch from `this[style]` of map access to the `more.at(formal)` style, in general. (My JS background keeps causing me to use the old style incorrectly, with hard to find bug results). - Fix Python files to match PEP8: indentation style, line length, 4 spaces, etc. Sorry for the large reformat. The only changes were adding new params. - Documentation, etc. - Re-arranged a few things to get into alphabetical order for easier reading. cc @breznak
Simhash Token/Char Frequency Floor/Ceiling, and more.
OK, I'm ready for a fresh look, I believe this is stable again, including all the new requested features. Thanks for any feedback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brev, this looks even better now! The code is now cleaner and API is much more understandable to use, great feat!
- I have no more conceptual, utility issues with this, it as now a really useful encoder! 👍
- one bug, I think, with arg parsing
- a few design questions, but that is only for your consideration
Is there anything else you're planning for the encoder, or are we good to merge?
bindings/py/cpp_src/bindings/encoders/py_SimHashDocumentEncoder.cpp
Outdated
Show resolved
Hide resolved
bindings/py/cpp_src/bindings/encoders/py_SimHashDocumentEncoder.cpp
Outdated
Show resolved
Hide resolved
…or, now we just have frequencyCeil/Floor.) - What was "charFrequencyCeil" is now: "frequencyCeil" + "tokenSimilarity". - What was "charFrequencyFloor" has been removed as not useful. - Better checks on class params.sparsity - Don't die on empty encode([]) or encode(""), return SDR of zeros instead.
@breznak thanks again for the great feedback. I think I'm current again, ready for new changes. I'm all set to go, otherwise. Excited to start messing around with this. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the great work and continued improvements in this PR!!
Let's get this merged, I'm quite excited, the encoder turned out something pretty useful.
I am also satisfied now.
- So the only blocker is Simhash document encode [DO NOT MERGE] #635 where this fails on
OSX in Debug
. (is it failing only in OSX, or everywhere in Debug, might be some compatibility bug in Eigen with debug/python3/compiler/... ?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix looks good to me.
@brev not yet :(, seems the statistics are a bit off due to the latest change,
https://travis-ci.org/htm-community/htm.core/builds/574695875#L2500
fixed, and finished serialization, here: 24299b3 thx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here we go, all the CI are passing, I think the Document encoder is in a great shape! 💯
Congratulations @brev for the great work and hope to see more research on NLP + HTM using this encoder.
This PR was really a pleasure developing with you, the ideas and the code implementation ✔️ , hope you'd stick around with the development and later join the precious ranks of HTM.core developers 😉
Let's ship it!
Very good 🥇 |
Awesome, thanks all! 👍 Pleasure working with you. |
@brev do you have any sample application of your encoder? |
Congratulations Brev, I look forward to playing with this encoder! |
Hi @Thanh-Binh! sorry for the slow reply, Here is the best example currently: There is an Issue open to come up with a more official demo: I'll post more on https://discourse.numenta.org/ soon, thanks. |
Hi @brev |
Please, see and the |
SimHash Document Encoder in C++ with Python bindings.
Full passing test suite in both langs.
The SimHashDocumentEncoder encodes a document (array of strings) value into an
array of bits. The output is 0's except for a sparse distribution spray of 1's.
Similar document encodings will share similar representations, and vice versa.
Unicode is supported. No lookup tables are used.
"Similarity" here refers to bitwise similarity (small hamming distance,
high overlap), not semantic similarity (encodings for "apple" and
"computer" will have no relation here.) For document encodings which are
also semantic, please try Cortical.io and their Semantic Folding tech.
Encoding is accomplished using SimHash, a Locality-Sensitive Hashing (LSH)
algorithm from the world of nearest-neighbor document similarity search.
As SDRs are variable-length, we use the SHA3+SHAKE256 hashing algorithm.
We deviate slightly from the standard SimHash algorithm in order to
achieve sparsity.
@breznak @ctrl-z-9000-times This is the first C++ I've written in 20 years, so you might want to use the fine-tooth comb. Thanks for the amazing codebase, I was surprised how quickly I got up and running, it was very easy to follow. I'd heard modern C++ was a much improved beast, they were not kidding.
I only have 1 non-blocking issue for @ctrl-z-9000-times which I'll inline below in the code.
C++ usage:
Python usage:
Python help:
Python command line tool:
EDIT:
TODO status tracked in #603 (comment)