CompleteSearch is a fast and interactive search engine for context-sensitive prefix search on a given collection of documents. It does not only provide search results, like a regular search engine, but also completions for the last (maybe only partially typed) query word that lead to a hit. This can be used to provide very efficient support for a variety of features: query autocompletion, faceted search, synonym search, error-tolerant search, semantic search. A list of publications on the techniques behind CompleteSearch and its many applications is provided at the end of this page.
For a demo on various datasets, just checkout this repository and follow the instructions below. With a single command line, you get a working demo (you can choose from several datasets, each of the size of a few million documents, so not paticularly large, but also not small). CompleteSearch scales to collections with tens or even hundreds of millions of documents, without losing its interactivity.
Checkout the repository and build the docker image
git clone https://github.com/ad-freiburg/completesearch
cd completesearch
docker build -t completesearch .
The following command line builds a search index and then starts the search server
for the dataset specified via the DB
variable (the name of any subdirectory of applications works).
Under the specified PORT
you then have a generic UI, as well as an API (see Section 4 below).
export DB=movies && PORT=1622 && docker run -it --rm -e DB=${DB} -p ${PORT}:8080 -v $(pwd)/applications:/applications -v $(pwd)/data/:/data -v $(pwd)/ui:/ui --name completesearch.${DB} completesearch -c "make DATA_DIR=/data/${DB} DB=${DB} csv pall start"
This command line downloads and uncompresses the CSV, builds the index, and starts the server, all in one go.
If you have already downloaded the CSV, it will not be downloaded again (the Makefile target csv:
then has no effect).
If you have already built the index once, you can omit the Makefile target pall:
(which stand for precompute all).
Read this section if you want to understand a little deeper of what's going on with the fancy command line above. The command line first builds a docker image from the code in this repository. So far so good. It then runs a docker container, which mounts three volumes, which we briefly explain next:
applications This folder contains the configuration for each application.
Each configuration just contains two files.
A Makefile
that specifies how to build the index (this is highly customizable, see below).
And a config.js
for customizing the generic UI.
data This folder contains the CSV file with the original data (one record per line, in columns) and the index files. They all have a common prefix. See below for more information on the index.
ui This folder contains the code for the generic UI. If you just want to use CompleteSearch as backend and build your own UI, you don't have to mount this volume. It's nice, however, to always have a working UI available for testing, without any extra work.
Like all search engines, CompleteSearch builds an index with the help of which it can then answer queries efficiently. It is not an ordinary inverted index, but something more fancy: a half-inverted index or hybird (HYB) index. You don't have to understand this if you just want to use CompleteSearch. But if you are interested, you can learn more about it in the publications below.
To build the index, CompleteSearch requires two input files, one with suffix .words
and one with suffix .docs
.
The first contains the contents of your documents split into words.
The second contains the data that you want to display as search engine hits.
The two are usually related, but not exactly the same.
The format is very simple and is described by example here.
If you have special wishes, you can build these two input files yourself, from whatever your data is. Then you have full control over what CompleteSearch will and can do for you. However, in most applications, you can use our generic CSV parser. It takes a CSV file (one record per line, with a fixed number of columns per line) as input, and from that produce the .words and the .docs file.
The CSV parse is very powerful and highly customizable.
You can see how it is used in the Makefile of the various example applications
(in the subdirectories of the directory applications
).
A subset of the options is described in more detail here.
For a complete list, look at the code that parses the options.
The binary to start the CompleteSearch engine is called startCompletionServer
.
It is very powerful and has a lot of options.
For some example uses, you can have a look at the Makefile
in the director
applications
and at the included Makefile
of one of the example applications.
A detailed documentation of all the options can be found in the README.md in the src directory.
Once started, you can either ask queries using our generic and customizable UI (see above).
Or you can ask the backend directly, via the HTTP API provided by startCompletionServer
.
The API is very simple and described at the end of this page.
Play around with it for one the example applications to get a feeling for what it does.
You can also look at the (rather simple) JavaScript code of the generic UI
to get a feeling for how it works and what it can be used for.
To show off your CompleteSearch instance to your friends, you may want it to run
under a fancy URL, and not http://my.weird.hostname.somewhere:76154
.
Let us assume you have an Apache webserver running on your machine.
Then you can add the following section in your apache.conf
or in a separte
config file included by apache.conf
.
You have to replace servername
by the
fully qualified domain name (FQDN) of the
machine on which your Apache webserver is running.
You have to replace hostname
by the FQDN of the machine on which the CompleteSearch frontend is running.
This can be the same machine as servername
, but does not have to be.
<VirtualHost *:80>
ServerName example.cs.uni-freiburg.de
ServerAlias dblp example.cs.uni-freiburg.de
ServerAdmin webmaster@localhost
ProxyPreserveHost On
ProxyRequests Off
ProxyPass / http://<hostname>:5000/
ProxyPassReverse / http://<hostname>:5000>/
...
</VirtualHost>
Here are some of the publications explaining the techniques behind CompleteSearch and what it can be used for. This work was done at the Max-Planck-Institute for Informatics. It's already a while ago, but turns out that the features and the efficiency provided by CompleteSearch are still very much state of the art.
Type Less, Find More: Fast Autocompletion with a Succinct Index @ SIGIR 2006
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration @ CIDR 2007
ESTER: efficient search on text, entities, and relations @ SIGIR 2007
Efficient interactive query expansion with complete search @ CIKM 2007
Output-Sensitive Autocompletion Search @ Information Retrieval 2008
Semantic Full-Text Search with ESTER: Scalable, Easy, Fast @ ICDM 2008