Skip to content

Hadoop Database Connectors

Alex Bush edited this page Dec 17, 2018 · 14 revisions

The Hadoop Database Connectors are connection objects used by the ParquetDataCommitter object and are a Waimak abstraction over an Hadoop-based Metastore database connection (i.e. Hive or Impala). They are used to generate correct SQL schema DDLs and to manage the schema information in the Metastore.

Impala Connections

These connection objects are used to connect to an Impala service and are available in the waimak-impala package. The Impala SQL dialect is implemented by the ImpalaDBConnector trait and can be extended to provided a custom Impala connection implementation.

##Impala connection over JDBC The class ImpalaJDBCConnector can be used to create a connection to an Impala service using JDBC. It has the following definition:

scala
case class ImpalaJDBCConnector(sparkSession: SparkSession,
                               jdbcString: String,
                               forceRecreateTables: Boolean = false,
                               properties: java.util.Properties = new java.util.Properties(),
                               secureProperties: Map[String, String] = Map.empty) extends ImpalaDBConnector

The constructor takes a SparkSession object, a JDBC connection string to the Impala service, and an optional boolean flag forceRecreateTables. If this flag is set to true all tables committed using this connection object will be dropped and recreated; this flag is useful when the schema of tables may have changed (i.e. new columns are added).

The constructor also takes an optional Java Properties object as properties and an optional Map[String, String] object as secureProperties. The properties object is passed to the JDBC connection to provide extra connection parameters (i.e. username). The secureProperties is used to provide additional properties to the JDBC connection by taking them securely from a secure JCEKS file configured with hadoop.security.credential.provider.path. First value in the Map is the key of the parameter in the JCEKS file and the second parameter is the key of the parameter you want in the JDBC connection.

Creating connectors, allowing for updatable tables and force recreate flag.

Clone this wiki locally