The HBase Indexer allows you to easily and quickly index HBase rows into Solr. It uses HBase replication to copy rows from HBase to the Solr index in real-time.
This project is a fork of the original hbase-indexer project from NGDATA, with these additional features:
-
Support for HBase 0.94, 0.96, 0.98, 1.1.0 and 1.1.2
-
Support for Solr 5.x (Solr 6.x to come soon)
-
Kerberos authentication
-
Output to Lucidworks Fusion pipelines (optional)
Important
|
Due to the changes required for Solr 5.x support, builds from this fork of the main project will not work with Solr 4.x. |
A standalone library for asynchronously processing HBase mutation events by hooking into HBase replication, see the SEP readme.
A standalone utility to monitor HBase replication progress, see the SEP-tools readme.
You can build the full hbase-indexer project as follows:
mvn clean package -DskipTests
The default build is linked to HBase 0.94. In order to build for another version, you should specify the version with the -Dhbase.api
parameter. For example, to build for HBase 0.98, run the following command:
mvn clean package -DskipTests -Dhbase.api=0.98
To build for HBase 1.1.0, run the following command:
mvn clean package -DskipTests -Dhbase.api=1.1.0
The HBase Indexer works as an HBase replication sink and runs as a daemon. When updates are written to HBase region servers, they are replicated to the HBase Indexer processes. These processes then pass the updates to Solr for indexing.
To use the HBase Indexer, there are several configuration files to prepare before using the HBase Indexer. Then a specific configuration for your your HBase table must be supplied to map HBase columns and rows to Solr fields and documents.
If you are already using HBase replication, the HBase Indexer process should not interfere.
Each of the following configuration changes need to be made to use the HBase Indexer.
Note
|
These instructions assume Solr is running and a collection already exists for the data to be indexed. Solr does not need to be restarted to start using the HBase Indexer. |
The first step to configuring the HBase Indexer is to define the ZooKeeper connect string and quorum location in hbase-indexer-site.xml
, found in /opt/${namePackage}/hbase-indexer/conf
.
The properties to define are the hbaseindexer.zookeeper.connectstring
and hbaseindexer.zookeeper.quorum
as in this example:
<configuration>
<property>
<name>hbaseindexer.zookeeper.connectstring</name>
<value>server1:2181,server2:2181,server3:2181</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>server1,server2,server3</value>
</property>
</configuration>
Customize the server names and ports as appropriate for your environment.
The hbase-site.xml
file on each HBase node needs to be modified to include several new properties, as shown below:
<configuration>
<!-- SEP is basically replication, so enable it -->
<property>
<name>hbase.replication</name>
<value>true</value>
</property>
<property>
<name>replication.source.ratio</name>
<value>1.0</value>
</property>
<property>
<name>replication.source.nb.capacity</name>
<value>1000</value>
</property>
<property>
<name>replication.replicationsource.implementation</name>
<value>com.ngdata.sep.impl.SepReplicationSource</value>
</property>
</configuration>
In more detail, these properties are:
-
hbase.replication
: True or false; enables replication. -
replication.source.ratio
: Source ratio of 100% makes sure that each SEP consumer is actually used (otherwise, some can sit idle, especially with small clusters). -
replication.source.ratio
: The ratio of the number of slave cluster regionservers the master cluster regionserver will connect to for updates. This should be set to 1.0 to get updates from all slaves. -
replication.source.nb.capacity
: The maximum number of hlog entries to replicate in one go. If this is large, and a consumer takes a while to process the events, the HBase rpc call will time out. -
replication.replicationsource.implementation
: A custom replication source for indexing content to Solr.
Once these values have been inserted to hbase-site.xml
, HBase will need to be restarted. That is an upcoming step in the configuration.
Tip
|
If you use Ambari to manage your cluster, you can set these properties by going to the HBase Configs screen at Ambari → Configs → Advanced tab → Custom hbase-site. |
Once you have added the new properties to hbase-site.xml
, copy it to the hbase-indexer conf
directory. In many cases, this is from /etc/hbase/conf
to hbase-indexer/conf
.
Copying this file ensures all of the parameters configured for HBase (such as settings for Kerberos and ZooKeeper) are available to the HBase Indexer.
The SEP replication being used by HBase Indexer requires 4 .jar files to be copied from the HBase Indexer distribution to each HBase node.
These .jar files can be found in the hbase-indexer/lib
directory.
They need to be copied to the $HBASE_HOME/lib
directory on each node running HBase. These files are:
-
hbase-sep-api-2.2.2.jar
-
hbase-sep-impl-2.2.2.jar
-
hbase-sep-impl-common-2.2.2.jar
-
hbase-sep-tools-2.2.2.jar
If you want to index content to a Solr cluster that has been secured with Kerberos for internode communication, you will need to apply additional configuration.
A JAAS file configures the authentication properties, and will include a section for a service principal and keytab file for a user who has access to both HBase and Solr. This user should be a different user than the service principal that Solr is using for internode communication.
To configure HBase Indexer to be able to write to Kerberized Solr, you will modify the hbase-indexer
script found in /opt/${namePackage}/hbase-indexer/bin
.
Find the section where two of the properties are commented out by default:
#HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -Dlww.jaas.file="
#HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -Dlww.jaas.appname="
Uncomment those properties to supply the correct values (explained below), and then add another property:
HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -Djava.security.auth.login.config="
The three together should look similar to this:
HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -Dlww.jaas.file="
HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -Djava.security.auth.login.config="
HBASE_INDEXER_OPTS="$HBASE_INDEXER_OPTS -Dlww.jaas.appname="
Remove the #
to uncomment existing lines, and supply values for each property:
-Dlww.jaas.file
-
The full path to a JAAS configuration file that includes a section to define the keytab location and service principal that will be used to run the HBase Indexer. This user must have access to both HBase and Solr, but should be a different user.
-Djava.security.auth.login.config
-
The path to the JAAS file which includes a section for the HBase user. This can be the same file that contains the definitions for the HBase Indexer user, but must be defined separately.
-Dlww.jaas.appname
-
The name of the section in the JAAS file that includes the service principal and keytab location for the user who will run the HBase Indexer, as shown below. If this is not defined, a default of "Client" will be used.
Here is a sample JAAS file, and the areas that must be changed for your environment:
Client { --(1)
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="/data/hbase.keytab" --(2)
storeKey=true
useTicketCache=false
debug=true
principal="hbase@SOLRSERVER.COM"; --(3)
};
SolrClient { --(4)
com.sun.security.auth.module.Krb5LoginModule required
useKeyTab=true
keyTab="/data/solr-indexer.keytab" --(5)
storeKey=true
useTicketCache=false
debug=true
principal="solr-indexer@SOLRSERVER.COM"; --(6)
};
-
The name of the section of the JAAS file for the HBase user. This first section named "Client" should contain the proper credentials for HBase and ZooKeeper.
-
The path to the keyTab file for the HBase user. The user running the HBase Indexer must have access to this file.
-
The service principal name for the HBase user.
-
The name of the section for the user who will run the HBase Indexer. This second section is named "SolrClient" and should contain the proper credentials for the user who will run the HBase Indexer. This section name will be used with the
-Dlww.jaas.appname
parameter as described earlier. -
The path to the keyTab file for the Solr user. The user running the HBase Indexer must have access to this file.
-
The service principal name. This should be a different principal than the one used for Solr, but must have access to both Solr and HBase.
When configuration is complete and HBase has been restarted, you can start the HBase Indexer daemon.
From hbase-indexer/bin
, run:
hbase-indexer server &
The &
portion of this command will run the server in the background. If you would like to run it in the foreground, omit the &
part of the above example.
At this point the HBase Indexer daemon is running, you are ready to configure an indexer to start indexing content.
Once the HBase configuration files have been updated, HBase has been restarted on all nodes, and the daemon started, the next step is to create an indexer to stream data from a specific HBase table to Solr.
Important
|
The HBase table that will be indexed must have the REPLICATION_SCOPE set to "1".
|
If you are not familiar with HBase, the HBase Indexer tutorial provides a good introduction to creating a simple table and indexing it to Solr.
In order to process HBase events, an indexer must be created.
First you need to create an indexer configuration file, which is a simple XML file to tell the HBase Indexer how to map HBase columns to Solr fields, For example:
<?xml version="1.0"?>
<indexer table="indexdemo-user">
<field name="firstname_s" value="info:firstname"/>
<field name="lastname_s" value="info:lastname"/>
<field name="age_i" value="info:age" type="int"/>
</indexer>
Note that this defines the Solr field name, then the HBase column, and optionally, the field type. The indexer-table
value must also reflect the name of the table you intend to index.
More details on the indexer configuration options are available from https://github.com/NGDATA/hbase-indexer/wiki/Indexer-configuration.
Once we have an indexer configuration file, we can then start the indexer itself with the add-indexer
command.
This command takes several properties, as in this example:
./hbase-indexer add-indexer \ -- (1)
-n myindexer \ -- (2)
-c indexer-conf.xml \ -- (3)
-cp solr.zk=server1:3181,server2:3181,server3:3181/solr \ -- (4)
-cp solr.collection=myCollection -- (5)
-z server1:3181,server2:3181,server3:3181 -- (6)
-
The
hbase-indexer
script is found in/opt/${namePackage}/hbase-indexer/bin
directory. Theadd-indexer
command adds the indexer to the running hbase-indexer daemon. -
The
-n
property provides a name for the indexer. -
The
-c
property defines the location of the indexer configuration file. Provide the path as well as the filename if you are launching the -
The
-cp
property allows you to provide a key-value pair. In this case, we use thesolr.zk
property to define the location of the ZooKeeper ensemble used with Solr. The ZooKeeper connect string should include/solr
as the znode path. -
Another
-cp
key-value pair, which defines thesolr.collection
property with a value of a collection name in Solr that the documents should be indexed to. This collection must exist prior to running the indexer. -
The
-z
property defines the location of the ZooKeeper ensemble used with HBase. This may be the same as was defined in item (4) above, but needs to be additionally defined.
More details on the options for add-indexer
are available from https://github.com/NGDATA/hbase-indexer/wiki/CLI-tools.