Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SimHash Document Encoder in C++ with Python bindings #603

Merged
merged 59 commits into from
Aug 22, 2019
Merged
Show file tree
Hide file tree
Changes from 24 commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
bf6ac05
simhash document encoder, 1st draft skeleton with test
brev Jul 15, 2019
7e79ba1
Pull in digest lib.
brev Jul 16, 2019
572ce43
Merge branch 'master' into simhash-document-encoder
brev Jul 16, 2019
4b4d804
more progress
brev Jul 19, 2019
5443420
Finish main coding stretch.
brev Jul 24, 2019
199807f
Finishing up. Py bindings 80%.
brev Jul 27, 2019
6340b9e
Merge branch 'master' into simhash-document-encoder
brev Jul 27, 2019
2c5328a
tweak
brev Jul 27, 2019
a2b705f
retrigger cloud build
brev Jul 27, 2019
5e9b7c4
retrigger cloud build
brev Jul 27, 2019
da28df0
fix bug found on cloud build tests for this feature branch
brev Jul 27, 2019
8f8cf05
Finish up py bindings and the rest.
brev Jul 28, 2019
c3f6a53
fix some test constants for small diffs on varying cloud build archit…
brev Jul 28, 2019
fefd54c
more
brev Jul 30, 2019
6fa04f8
Add SDR statistics python test.
brev Jul 30, 2019
8229a82
Merge branch 'master' into simhash-document-encoder
brev Jul 30, 2019
a7d07d9
Finish python example script with stats and chart generation.
brev Jul 31, 2019
3832346
Change the tokenSimilarity algo slightly to get better bit distributi…
brev Jul 31, 2019
ba8de60
Merge branch 'master' into simhash-document-encoder
brev Aug 2, 2019
992f0f8
Merge branch 'master' into simhash-document-encoder
brev Aug 2, 2019
b0d18e0
add python docs on alt calling style
brev Aug 2, 2019
d63339e
Putting original performance test assertions back in place.
brev Aug 2, 2019
5829cd5
Trying to play better with SDRMetrics callbacks to no avail.
brev Aug 3, 2019
067c5b5
Possibly borked the Travis build with 1 change, testing revert.
brev Aug 3, 2019
56f885a
Merge branch 'master' into simhash-document-encoder
brev Aug 8, 2019
c6256c4
"htm.cpp" => "htm.core"
brev Aug 8, 2019
a994004
Add "caseSensitivity" bool flag, functionality, tests, bindings.
brev Aug 9, 2019
647a8bf
Lock down digestpp to specific commit (no versions avail, better than…
brev Aug 9, 2019
0c65611
Move pyplot import to code section, only run plot code if the import …
brev Aug 9, 2019
3751b8e
Better docs around SHA3 SHAKE XOF
brev Aug 9, 2019
c376950
Add basic real-world use-case example in test form.
brev Aug 9, 2019
e40ee2f
Better docs on document length (# of tokens allowed, any > 0).
brev Aug 9, 2019
70fd2c2
Add dox that token order is ignored and doesn't influence output enco…
brev Aug 9, 2019
bda2452
Remove "@since" C++ documentation tag, as we don't really have versio…
brev Aug 9, 2019
6ca3ca9
Move code dox from .cpp to .hpp as per @breznak
brev Aug 9, 2019
ac41dcc
Merge branch 'master' into simhash-document-encoder
brev Aug 9, 2019
3435d51
Add in Python de/serialization in bindings and tests. String works, P…
brev Aug 10, 2019
94315a4
Add definition of terms to dox.
brev Aug 12, 2019
1678651
Merge branch 'master' into simhash-document-encoder
brev Aug 12, 2019
7580708
Add another simple calling signature to `encode()` - you can now call…
brev Aug 12, 2019
b9d391a
Make sure all positive/negative test cases are combined under the sam…
brev Aug 12, 2019
beec09c
- Make new subdir for encoder examples.
brev Aug 13, 2019
e515ee7
Merge branch 'master' into simhash-document-encoder
brev Aug 13, 2019
977b58d
- Add encoder-specific README.
brev Aug 13, 2019
393571d
Add Determinism test for new SimHash Document Encoder.
brev Aug 13, 2019
136706d
Merge branch 'master' into simhash-document-encoder
brev Aug 13, 2019
0f9360e
Improvements to encoder README
brev Aug 13, 2019
53850e5
Fix param check (thx @breznak) with tests.
brev Aug 13, 2019
59487a2
Merge branch 'master' into simhash-document-encoder
brev Aug 15, 2019
1030b34
C++ and Tests for char/token frequency floor/ceilings. Tests are not
brev Aug 16, 2019
ea54ae5
Ok, a bit of a re-working:
brev Aug 17, 2019
3a912d1
Merge branch 'simhash-document-encoder' into simhash-frequency
brev Aug 17, 2019
00c4a89
Token/Char Frequency Floors/Ceilings.
brev Aug 19, 2019
3e0eee3
Merge pull request #1 from brev/simhash-frequency
brev Aug 19, 2019
1d17611
- Merged 4 params down into 2 params (gone are Char/TokenFreqCeil/Flo…
brev Aug 20, 2019
9cf513b
Merge branch 'master' into simhash-document-encoder
brev Aug 20, 2019
de089cf
Fix off-by-1 index error. Eigen quietly takes the blow in non-debug m…
brev Aug 21, 2019
24299b3
All serialization and tests are now working for SimHash Doc encoder.
brev Aug 21, 2019
6d4794e
Merge branch 'master' into simhash-document-encoder
brev Aug 21, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ deploy:
on:
tags: true
branch: master
# repo must be "htm.cpp" and not "htm.core"
repo: htm-community/htm.cpp
brev marked this conversation as resolved.
Show resolved Hide resolved

notifications:
Expand Down
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,7 @@ The installation scripts will automatically download and build the dependencies
* mnist test data
* numpy
* pytest
* [digestpp](https://github.com/kerukuro/digestpp) (for SimHash encoders)
brev marked this conversation as resolved.
Show resolved Hide resolved

Once these third party components have been downloaded and built they will not be
re-visited again on subsequent builds. So to refresh the third party components
Expand All @@ -240,6 +241,7 @@ distribution packages as listed and rename them as indicated. Copy these to
| mnist.zip (*note3) | https://github.com/wichtounet/mnist/archive/master.zip |
| pybind11.tar.gz | https://github.com/pybind/pybind11/archive/v2.2.4.tar.gz |
| cereal.tar.gz | https://github.com/USCiLab/cereal/archive/v1.2.2.tar.gz |
| digestpp.tar.gz | https://github.com/kerukuro/digestpp/archive/master.tar.gz |

* note1: Version 0.6.2 of yaml-cpp is broken so use the master from the repository.
* note2: Boost is not required for Windows (MSVC 2017) or any compiler that supports C++17 with std::filesystem.
Expand Down
1 change: 1 addition & 0 deletions bindings/py/cpp_src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ set(src_py_encoders_files
bindings/encoders/encoders_module.cpp
bindings/encoders/py_ScalarEncoder.cpp
bindings/encoders/py_RDSE.cpp
bindings/encoders/py_SimHashDocumentEncoder.cpp
)

set(src_py_engine_files
Expand Down
2 changes: 2 additions & 0 deletions bindings/py/cpp_src/bindings/encoders/encoders_module.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ namespace htm_ext
{
void init_ScalarEncoder(py::module&);
void init_RDSE(py::module&);
void init_SimHashDocumentEncoder(py::module&);
}

using namespace htm_ext;
Expand Down Expand Up @@ -60,4 +61,5 @@ categories into integers before encoding them. )";

init_ScalarEncoder(m);
init_RDSE(m);
init_SimHashDocumentEncoder(m);
}
189 changes: 189 additions & 0 deletions bindings/py/cpp_src/bindings/encoders/py_SimHashDocumentEncoder.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,189 @@
/* -----------------------------------------------------------------------------
* HTM Community Edition of NuPIC
* Copyright (C) 2016, Numenta, Inc. https://numenta.com
* 2019, David McDougall
* 2019, Brev Patterson, Lux Rota LLC, https://luxrota.com
brev marked this conversation as resolved.
Show resolved Hide resolved
*
* This program is free software: you can redistribute it and/or modify it
* under the terms of the GNU Affero Public License version 3 as published by
* the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero Public License for
* more details.
*
* You should have received a copy of the GNU Affero Public License along with
* this program. If not, see http://www.gnu.org/licenses.
* -------------------------------------------------------------------------- */

/** @file
* py_SimHashDocumentEncoder.cpp
breznak marked this conversation as resolved.
Show resolved Hide resolved
* @since 0.2.3
*/

#include <bindings/suppress_register.hpp> //include before pybind11.h
#include <pybind11/pybind11.h>
#include <pybind11/stl.h>

#include <htm/encoders/SimHashDocumentEncoder.hpp>

namespace py = pybind11;

using namespace htm;


namespace htm_ext {

using namespace htm;

void init_SimHashDocumentEncoder(py::module& m)
{
/**
* Parameters
*/
py::class_<SimHashDocumentEncoderParameters>
py_SimHashDocumentEncoderParameters(m, "SimHashDocumentEncoderParameters",
R"(
Parameters for the SimHashDocumentEncoder.
)");

py_SimHashDocumentEncoderParameters.def(py::init<>());

py_SimHashDocumentEncoderParameters.def_readwrite("activeBits",
&SimHashDocumentEncoderParameters::activeBits,
R"(
This is the number of true bits in the encoded output SDR. The output encoding
will have a distribution of this many 1's. Specify only one of: activeBits
or sparsity.
)");

py_SimHashDocumentEncoderParameters.def_readwrite("size",
&SimHashDocumentEncoderParameters::size,
R"(
This is the total number of bits in the encoded output SDR.
)");

py_SimHashDocumentEncoderParameters.def_readwrite("sparsity",
&SimHashDocumentEncoderParameters::sparsity,
R"(
This is an alternate way (percentage) to specify the the number of active bits.
Specify only one of: activeBits or sparsity.
)");

py_SimHashDocumentEncoderParameters.def_readwrite("tokenSimilarity",
&SimHashDocumentEncoderParameters::tokenSimilarity,
R"(
This allows similar tokens ("cat", "cats") to also be represented similarly,
at the cost of document similarity accuracy. Default is FALSE (providing better
document-level similarity, at the expense of token-level similarity).

Results are heavily dependent on the content of your input data.

If TRUE: Similar tokens ("cat", "cats") will have similar influence on the
brev marked this conversation as resolved.
Show resolved Hide resolved
output simhash. This benefit comes with the cost of a reduction in
document-level similarity accuracy.

If FALSE: Similar tokens ("cat", "cats") will have individually unique and
unrelated influence on the output simhash encoding, thus losing token-level
similarity and increasing document-level similarity.
)");


/**
* Class
*/
py::class_<SimHashDocumentEncoder> py_SimHashDocumentEncoder(m,
"SimHashDocumentEncoder",
R"(
Encodes a document text into a distributed spray of 1's.

The SimHashDocumentEncoder encodes a document (array of strings) value into an
brev marked this conversation as resolved.
Show resolved Hide resolved
array of bits. The output is 0's except for a sparse distribution spray of 1's.
Similar document encodings will share similar representations, and vice versa.
Unicode is supported. No lookup tables are used.

"Similarity" here refers to bitwise similarity (small hamming distance,
high overlap), not semantic similarity (encodings for "apple" and
"computer" will have no relation here.) For document encodings which are
also semantic, please try Cortical.io and their Semantic Folding tech.

Encoding is accomplished using SimHash, a Locality-Sensitive Hashing (LSH)
brev marked this conversation as resolved.
Show resolved Hide resolved
algorithm from the world of nearest-neighbor document similarity search.
As SDRs are variable-length, we use the SHA3+SHAKE256 hashing algorithm.
We deviate slightly from the standard SimHash algorithm in order to
achieve sparsity.

To inspect this run:
$ python -m htm.encoders.simhash_document_encoder --help

Python Code Example:
from htm.bindings.encoders import SimHashDocumentEncoder
from htm.bindings.encoders import SimHashDocumentEncoderParameters
from htm.bindings.sdr import SDR

params = SimHashDocumentEncoderParameters()
params.size = 400
params.activeBits = 21

output = SDR(params.size)
encoder = SimHashDocumentEncoder(params)

# call style: output is reference
encoder.encode([ "bravo", "delta", "echo" ], output) # weights 1
encoder.encode({ "brevo": 3, "delta" : 1, "echo" : 2 }, output)

# call style: output is returned
other = encoder.encode([ "bravo", "delta", "echo" ]) # weights 1
other = encoder.encode({ "brevo": 3, "delta" : 1, "echo" : 2 })

)");

py_SimHashDocumentEncoder.def(py::init<SimHashDocumentEncoderParameters&>());

py_SimHashDocumentEncoder.def_property_readonly("parameters",
[](SimHashDocumentEncoder &self) { return self.parameters; },
R"(
Contains the parameter structure which this encoder uses internally. All fields
are filled in automatically.
)");

py_SimHashDocumentEncoder.def_property_readonly("dimensions",
[](SimHashDocumentEncoder &self) { return self.dimensions; });

py_SimHashDocumentEncoder.def_property_readonly("size",
[](SimHashDocumentEncoder &self) { return self.size; });

// Handle case of class method overload + class method override
// https://pybind11.readthedocs.io/en/master/classes.html#overloaded-methods
py_SimHashDocumentEncoder.def("encode",
(void (SimHashDocumentEncoder::*)(std::map<std::string, htm::UInt>, htm::SDR &))
&SimHashDocumentEncoder::encode);
py_SimHashDocumentEncoder.def("encode", // alternate: simple w/o weights
(void (SimHashDocumentEncoder::*)(std::vector<std::string>, htm::SDR &))
&SimHashDocumentEncoder::encode);

py_SimHashDocumentEncoder.def("encode",
[](SimHashDocumentEncoder &self, std::map<std::string, htm::UInt> value) {
auto output = new SDR({ self.size });
self.encode( value, *output );
return output;
},
R"(
Takes input in a python map of strings (tokens) => integer (weights).
Ex: { "alpha": 2, "bravo": 1, "delta": 1, "echo": 3 }
)");
py_SimHashDocumentEncoder.def("encode", // alternate: simple w/o weights
[](SimHashDocumentEncoder &self, std::vector<std::string> value) {
auto output = new SDR({ self.size });
self.encode( value, *output );
return output;
},
R"(
Simple alternate calling pattern using only strings, no weights (assumed
to be 1). Takes input in a python list of strings (tokens).
Ex: [ "alpha", "bravo", "delta", "echo" ]
)");

}
}
Loading