Skip to content

Senzing/elasticsearch

Repository files navigation

Senzing with ElasticSearch

Overview

This code project demonstrates how the G2 engine may be used with an ElasticSearch indexing engine. ElasticSearch provides enhanced searching capabilities on entity data.

The G2 data repository contains data records and observations about known entities. It determines which records match/merge to become single resolved entities. These resolved entities can be indexed through the ElasticSearch engine, to provide more searchable data entities.

ElasticSearch stores its indexed entity data in a separate data repository than the G2 engine does. Thus, ElasticSearch and G2 must both be managed in order to keep them in sync.

Preamble

At Senzing, we strive to create GitHub documentation in a "don't make me think" style. For the most part, instructions are copy and paste. Whenever thinking is needed, it's marked with a "thinking" icon 🤔. Whenever customization is needed, it's marked with a "pencil" icon ✏️. If the instructions are not clear, please let us know by opening a new Documentation issue describing where we can improve. Now on with the show...

Legend

  1. 🤔 - A "thinker" icon means that a little extra thinking may be required. Perhaps there are some choices to be made. Perhaps it's an optional step.
  2. ✏️ - A "pencil" icon means that the instructions may need modification before performing.
  3. ⚠️ - A "warning" icon means that something tricky is happening, so pay attention.

Expectations

  • Space: This repository and demonstration require X GB free disk space.

  • Time: Budget 30 minutes to get the demonstration up-and-running, depending on CPU and network speeds.

  • Background knowledge: This repository assumes a working knowledge of:

Prerequisites

  1. Docker
  2. git.
  3. maven
  4. java

Demonstration

Load Data

  • 🤔 Data needs to be loaded into a Senzing project to post to elasticsearch, if you don't have any data to load, or don't know how, visit our quickstart.

Startup elasticsearch

  • Start an instance of elasticsearch and your favorite elastic search UI, kibana is recommended and will be assumed for the remainder of this demonstration. For guidance on how to get an instance of ES and kibana running vist our doc on How to Bring Up an ELK Stack.

Build project

  1. ✏️ Set local environment variables. These variables may be modified, but do not need to be modified. The variables are used throughout the installation procedure.

    export GIT_ACCOUNT=senzing
    export GIT_REPOSITORY=elasticsearch
    export GIT_ACCOUNT_DIR=~/${GIT_ACCOUNT}.git
    export GIT_REPOSITORY_DIR="${GIT_ACCOUNT_DIR}/${GIT_REPOSITORY}"
  2. Clone the repository

    cd ${GIT_ACCOUNT_DIR}
    git clone https://github.com/Senzing/elasticsearch.git
    cd ${GIT_REPOSITORY_DIR}
  3. 🤔 Make sure the SENZING_ENGINE_CONFIGURATION_JSON environment variable is set to the Senzing installation that the data was loaded into earlier

  4. 🤔 Set elasticsearch local environment variables. The hostname and port must point towards the exposed port that the elasticsearch instance has. The index name can be anything; conforming to elasticsearch's index syntax.

    export ELASTIC_HOSTNAME=senzing-elasticsearch
    export ELASTIC_PORT=9200
    export ELASTIC_INDEX_NAME=g2index
  5. Build the docker container.

    cd {GIT_REPOSITORY_DIR}
    sudo docker build -t senzing/elasticsearch .

Run the indexer

Using a local sqlite Senzing database

  1. We will mount the sqlite database; make sure the CONNECTION string in our config json points to where it is mounted. In this example the CONNECTION will need to point towards the /db dir. We also need to run the container as part of the network that the ELK-stack is running in. Example:

    sudo --preserve-env docker run \
      --interactive \
      --rm \
      --tty \
      -e ELASTIC_HOSTNAME \
      -e ELASTIC_PORT \
      -e ELASTIC_INDEX_NAME \
      -e SENZING_ENGINE_CONFIGURATION_JSON \
      --network=senzing-network \
      --volume ~/senzing/var/sqlite:/db \
      senzing/elasticsearch

Using an external Senzing database

  1. Here we won't need to mount a database, instead we can set our CONNECTION string in the config json to where the external database is. Example:

    export SENZING_ENGINE_CONFIGURATION_JSON='{
    "PIPELINE": {
        "CONFIGPATH": "/etc/opt/senzing",
        "RESOURCEPATH": "/opt/senzing/g2/resources",
        "SUPPORTPATH": "/opt/senzing/data"
       },
    "SQL": {
        "CONNECTION": "postgresql://postgres:postgres@senzing-postgres:5432:G2"
       }
      }'
  2. Now we can run the container as part of the network that the ELK-stack is running in so that it can "see" the elasticsearch container. Example:

    sudo --preserve-env docker run \
      --interactive \
      --rm \
      --tty \
      -e ELASTIC_HOSTNAME \
      -e ELASTIC_PORT \
      -e ELASTIC_INDEX_NAME \
      -e SENZING_ENGINE_CONFIGURATION_JSON \
      --network=senzing-network \
      senzing/elasticsearch

Search data

  1. Open up kibana in a web browser, default: localhost:5601

  2. Navigate to the discover tab

image
  1. Create Index.

    • If all was done correctly, a new screen with a button to "Create data view" should appear.
    • Click this and in the index pattern box type the name of the index that was created, this was the ELASTIC_INDEX_NAME variable set early, and should also appear on the right side of the popup.
    • The Name field can be set but is not required.
  2. Press "Save data view to Kibana" at the bottom of the screen, now can view the created index and do searches. If fuzzy searches are needed click on "Saved Query" and switch the language to lucene. Here you can view the lucene syntax and how to do fuzzy searches

image