Skip to content
This repository has been archived by the owner on Jan 27, 2020. It is now read-only.

Google BigQuery Export

Utz Westermann edited this page Jan 25, 2018 · 10 revisions

Summary

You can configure a parallel export of view data to a Google BigQuery dataset by specifying an exportTo() clause with BigQuery in a view. Whenever the view performs its transformation successfully and materializes, Schedoscope triggers a mapreduce job that writes the view's data (i.e., the data of the Hive partition the view represents) to a table within the specified dataset.

Syntax

def BigQuery(
    v: View,
    projectId: String = Schedoscope.settings.bigQueryExportProjectId,
    gcpKey: String = Schedoscope.settings.bigQueryExportGcpKey,
    gcpKeyFile: String = null,
    storageBucket: String = Schedoscope.settings.bigQueryExportStorageBucket,
    storageBucketFolderPrefix: String = Schedoscope.settings.storageBucketFolderPrefix,
    storageBucketRegion: String = Schedoscope.settings.bigQueryExportStorageBucketRegion,
    dataLocation: String = Schedoscope.settings.bigQueryExportDataLocation,
    numReducers: Int = Schedoscope.settings.jdbcExportNumReducers,
    flushInterval: Long = Schedoscope.settings.bigQueryExportFlushInterval,	
    proxyHost: String = Schedoscope.settings.bigQueryExportProxyHost,
    proxyPort: String = Schedoscope.settings.bigQueryExportProxyPort,
    isKerberized: Boolean = !Schedoscope.settings.kerberosPrincipal.isEmpty(),
    kerberosPrincipal: String = Schedoscope.settings.kerberosPrincipal,
    metastoreUri: String = Schedoscope.settings.metastoreUri,
    exportSalt: String = Schedoscope.settings.exportSalt)

Description

For the BigQuery export to work you need a GCP user with at least

  • write access to a GCP Cloud Storage bucket of your choice for storing temporary data blobs;
  • permissions to create BigQuery datasets in a GCP project;
  • permissions to create and drop tables within these datasets.

The BigQuery target dataset and table with carry the same name as the Hive database and table you are exporting, respectively. The export will try to create a BigQuery table schema that closely matches the schema of the original Hive table.

There are, however, limitations in BigQuery schema expressiveness:

  • BigQuery cannot represent generic maps. Field of such type are mapped to string fields formatted like JSON maps;
  • BigQuery cannot represent lists of lists. Field of such type are mapped to string fields formatted like JSON lists;
  • BigQuery cannot represent lists of maps. Field of such type are mapped to string fields formatted like JSON lists with embedded map structures.

All other constructs should map naturally to BigQuery counterparts.

The export tries to do its best when it comes to mapping partition columns of views. BigQuery only supports temporal partitions by day. Hence, monthly and daily partitioned views are mapped to BigQuery daily partitions, choosing the first of a given month for monthly partitioned views.

As for other non-temporal partition columns, their values are mapped to table suffixes (ordered alphabetically according to column name, separated by underscores) so that you can at least make use of table wildcards in your BigTable queries for logical partition selection.

Should a view be exported more than once, the rows of previous exports are overwritten in the target table to avoid duplicates. Each target table gets an additional column _USED_HCAT_FILTER which contains the parameter values of the exported view.

Here is a description the parameters you must or can pass to Jdbc exports:

  • v: the view to export.
  • projectId: GCP project ID under which exported BigQuery dataset will be created. If not set, this is the default GCP project of the current user. Can be globally configured by setting schedoscope.export.bigQuery.projectId.
  • gcpKey: GCP key in JSON format to use for authentication when exporting to BigQuery. If not set, the local gcloud key of the user running Schedoscope is used. Can be globally configured by setting schedoscope.export.bigQuery.gcpKey.
  • gcpKeyFile: An absolute path pointing to the GCP key in JSON format to use for authentication when exporting to BigQuery. If not set, the local gcloud key of the user running Schedoscope is used (or gcpKey).
  • storageBucket: GCP Cloud Storage bucket to use for temporary storage while exporting to BigQuery. Defaults to schedoscope-bigquery-export. Can be globally configured by setting schedoscope.export.bigQuery.storageBucket.
  • storageBucketFolderPrefix: Folder prefix to apply to blobs in the GCP Cloud Storage bucket while exporting to BigQuery. Defaults to "". Can be globally configured by setting schedoscope.export.bigQuery.storageBucketFolderPrefix.
  • storageBucketRegion: GCP Cloud Storage bucket region to use for exporting to BigQuery. Defaults to europe-west3. Can be globally configured by setting schedoscope.export.bigQuery.storageBucketRegion.
  • dataLocation: GCP data storage location of exported data within BigQuery. Defaults to EU. Can be globally configured by setting schedoscope.export.bigQuery.dataLocation.
  • numReducers: Number of reducers to use for BigQuery export. Defines the parallelism. Defaults to 10. Can be globally configured by setting schedoscope.export.bigQuery.numReducers.
  • flushInterval: Number of records to batch before flushing data to GCP Cloud Storage. Defaults to 10000. Can be globally configured by setting schedoscope.export.bigQuery.flushInterval.
  • proxyHost: Host of proxy to use for GCP API access. Set to empty, i.e., no proxy to use. Can be globally configured by setting schedoscope.export.bigQuery.proxyHost.
  • proxyPort: Port of proxy to use for GCP API access. Set to empty, i.e., no proxy to use. Can be globally configured by setting schedoscope.export.bigQuery.proxyPort.
  • isKerberized: Is the cluster kerberized?
  • kerberosPrincipal: Kerberos principal to use. Can be globally configured by setting schedoscope.kerberos.principal.
  • metastoreUri: URI of the metastore. Can be globally configured by setting schedoscope.metastore.metastoreUri.
  • exportSalt: Salt to use for anonymization. Can be globally configured by setting schedoscope.export.salt.

Example

package test.export

case class ClickWithBigQueryExport(
  year: Parameter[String],
  month: Parameter[String],
  day: Parameter[String]) extends View
       with DailyParameterization {

  val id = fieldOf[String]
  val url = fieldOf[String]

  val stage= dependsOn(() => Stage(year, month, day))

  transformVia(
    () => HiveTransformation(
      insertInto(this, s"""SELECT id, url FROM ${stage().tableName} WHERE date_id='${dateId}'""")))

   // This would create a daily partitioned BigQuery table click_with_big_query_export in the dataset
   // ${env}_test_export. It would have the columns id, url, year, month, day, date_id plus the column
   // _USED_HCAT_FILTER.
   //
   // To perform the export, temporary blobs will be created in the default bucket 
   // schedoscope_bigquery_export, creating that bucket if it does not exist.

   exportTo(() => BigQuery(this))

}
Clone this wiki locally