cuNVSM is a C++/CUDA implementation of state-of-the-art NVSM and LSE representation learning algorithms. It also supports injecting a priori knowledge of document/document similarity, as was the main subject of study in the CIKM2018 paper on product substitutability.
It integrates conveniently with the Indri search engine and model parameters are estimated directly from indexes created by Indri. Model parameters are stored in the open HDF5 format. A lightweight Python module nvsm
, provided as part of this toolkit, allows querying the models and more.
For more information, see Section 3.3 of the 2018 TOIS paper "Neural Vector Spaces for Unsupervised Information Retrieval".
To build the cuNVSM training binary and manage dependencies, we use CMake (version 3.8 and higher). In addition, we rely on the following libraries for the cuNVSM training binary:
- Boost (>= 1.65.1)
- CUDA (>= 8.0)
- cuDNN (>= 5.1.3)
- Glog (>= 0.3.4)
- HDF5 (>= 1.6.10)
- Indri (>= 5.11)
- gperftools (>= 2.5)
- protobuf (>= 3.5.1)
The cnmem library is used for memory management. The tests are implemented using the googletest and googlemock frameworks. CMake will fetch and compile these libraries automatically as part of the build pipeline. Finally, you need a CUDA-compatible GPU in order to perform any computations.
Dependencies for the nvsm
Python (>= 3.5) library used for loading and querying trained models can be installed as follows:
pip install -r requirements.txt
Note that the Python library depends on pyndri, which in turn also depends on Indri.
To install cuNVSM, the following instructions should get you started. Note that the installation will fail if dependencies cannot be found.
git clone https://github.com/cvangysel/cuNVSM
cd cuNVSM
mkdir build
cd build
cmake ..
make
make install
Please refer to the CMake documentation for advanced options.
cuNVSM also comes with a rich test harness to verify its implementation, see TESTS for more information.
See TUTORIAL for examples.
Different models can be trained/queried by passing the appropriate flags to the cuNVSMTrainModel
and cuNVSMQuery
executables.
- For LSE, pass
--batch_size 4096
,--nonlinearity tanh
and--bias_negative_samples
tocuNVSMTrainModel
. - For NVSM, pass
--batch_size 51200
,--nonlinearity hard_tanh
and--batch_normalization
tocuNVSMTrainModel
and pass--linear
tocuNVSMQuery
.
For more information, see the train_nvsm
function in scripts/functions.sh and the invocation of cuNVSMQuery
in rank-cranfield-collection.sh.
If you use cuNVSM to produce results for your scientific publication, please refer to our TOIS and CIKM 2018 papers:
@article{VanGysel2018nvsm,
title={Neural Vector Spaces for Unsupervised Information Retrieval},
author={Van Gysel, Christophe and de Rijke, Maarten and Kanoulas, Evangelos},
publisher={ACM},
journal={TOIS},
year={2018},
}
@inproceedings{VanGysel2018substitutability,
title={Mix ’n Match: Integrating Text Matching and Product Substitutability within Product Search},
author={Van Gysel, Christophe and de Rijke, Maarten and Kanoulas, Evangelos},
booktitle={CIKM},
volume={2018},
year={2018},
organization={ACM}
}
The validate/test splits used in the 2018 TOIS paper can be found here. The test collections for the 2018 CIKM paper can be found here.
The toolkit also contains an implementation of the LSE model described in the following CIKM paper:
@inproceedings{VanGysel2016lse,
title={Learning Latent Vector Spaces for Product Search},
author={Van Gysel, Christophe and de Rijke, Maarten and Kanoulas, Evangelos},
booktitle={CIKM},
volume={2016},
pages={165--174},
year={2016},
organization={ACM}
}
cuNVSM is licensed under the MIT license. CUDA is a licensed trademark of NVIDIA. Please note that CUDA and Indri are licensed separately. Some of the CMake scripts in the third_party directory are licensed under BSD-3.
If you modify cuNVSM in any way, please link back to this repository.