-
Notifications
You must be signed in to change notification settings - Fork 28
Google BigQuery Export
You can configure a parallel export of view data to a Google BigQuery dataset by specifying an exportTo()
clause with BigQuery
in a view. Whenever the view performs its transformation successfully and materializes, Schedoscope triggers a mapreduce job that writes the view's data (i.e., the data of the Hive partition the view represents) to a table within the specified dataset.
def BigQuery(
v: View,
projectId: String = Schedoscope.settings.bigQueryExportProjectId,
gcpKey: String = Schedoscope.settings.bigQueryExportGcpKey,
gcpKeyFile: String = null,
storageBucket: String = Schedoscope.settings.bigQueryExportStorageBucket,
storageBucketFolderPrefix: String = Schedoscope.settings.storageBucketFolderPrefix,
storageBucketRegion: String = Schedoscope.settings.bigQueryExportStorageBucketRegion,
dataLocation: String = Schedoscope.settings.bigQueryExportDataLocation,
numReducers: Int = Schedoscope.settings.jdbcExportNumReducers,
flushInterval: Long = Schedoscope.settings.bigQueryExportFlushInterval,
proxyHost: String = Schedoscope.settings.bigQueryExportProxyHost,
proxyPort: String = Schedoscope.settings.bigQueryExportProxyPort,
isKerberized: Boolean = !Schedoscope.settings.kerberosPrincipal.isEmpty(),
kerberosPrincipal: String = Schedoscope.settings.kerberosPrincipal,
metastoreUri: String = Schedoscope.settings.metastoreUri,
exportSalt: String = Schedoscope.settings.exportSalt)
For the BigQuery
export to work you need a GCP user with at least
- write access to a GCP Cloud Storage bucket of your choice for storing temporary data blobs;
- permissions to create BigQuery datasets in a GCP project;
- permissions to create and drop tables within these datasets.
The BigQuery target dataset and table with carry the same name as the Hive database and table you are exporting, respectively. The export will try to create a BigQuery table schema that closely matches the schema of the original Hive table.
There are, however, limitations in BigQuery schema expressiveness:
- BigQuery cannot represent generic maps. Field of such type are mapped to string fields formatted like JSON maps;
- BigQuery cannot represent lists of lists. Field of such type are mapped to string fields formatted like JSON lists;
- BigQuery cannot represent lists of maps. Field of such type are mapped to string fields formatted like JSON lists with embedded map structures.
All other constructs should map naturally to BigQuery counterparts.
The export tries to do its best when it comes to mapping partition columns of views. BigQuery only supports temporal partitions by day. Hence, monthly and daily partitioned views are mapped to BigQuery daily partitions, choosing the first of a given month for monthly partitioned views.
As for other non-temporal partition columns, their values are mapped to table suffixes (ordered alphabetically according to column name, separated by underscores) so that you can at least make use of table wildcards in your BigTable queries for logical partition selection.
Should a view be exported more than once, the rows of previous exports are overwritten in the target table to avoid duplicates. Each target table gets an additional column _USED_HCAT_FILTER
which contains the parameter values of the exported view.
Here is a description the parameters you must or can pass to Jdbc
exports:
-
v
: the view to export. -
projectId
: GCP project ID under which exported BigQuery dataset will be created. If not set, this is the default GCP project of the current user. Can be globally configured by settingschedoscope.export.bigQuery.projectId
. -
gcpKey
: GCP key in JSON format to use for authentication when exporting to BigQuery. If not set, the local gcloud key of the user running Schedoscope is used. Can be globally configured by settingschedoscope.export.bigQuery.gcpKey
. -
gcpKeyFile
: An absolute path pointing to the GCP key in JSON format to use for authentication when exporting to BigQuery. If not set, the local gcloud key of the user running Schedoscope is used (or gcpKey). -
storageBucket
: GCP Cloud Storage bucket to use for temporary storage while exporting to BigQuery. Defaults toschedoscope-bigquery-export
. Can be globally configured by settingschedoscope.export.bigQuery.storageBucket
. -
storageBucketFolderPrefix
: Folder prefix to apply to blobs in the GCP Cloud Storage bucket while exporting to BigQuery. Defaults to""
. Can be globally configured by settingschedoscope.export.bigQuery.storageBucketFolderPrefix
. -
storageBucketRegion
: GCP Cloud Storage bucket region to use for exporting to BigQuery. Defaults toeurope-west3
. Can be globally configured by settingschedoscope.export.bigQuery.storageBucketRegion
. -
dataLocation
: GCP data storage location of exported data within BigQuery. Defaults toEU
. Can be globally configured by settingschedoscope.export.bigQuery.dataLocation
. -
numReducers
: Number of reducers to use for BigQuery export. Defines the parallelism. Defaults to10
. Can be globally configured by settingschedoscope.export.bigQuery.numReducers
. -
flushInterval
: Number of records to batch before flushing data to GCP Cloud Storage. Defaults to10000
. Can be globally configured by settingschedoscope.export.bigQuery.flushInterval
. -
proxyHost
: Host of proxy to use for GCP API access. Set to empty, i.e., no proxy to use. Can be globally configured by settingschedoscope.export.bigQuery.proxyHost
. -
proxyPort
: Port of proxy to use for GCP API access. Set to empty, i.e., no proxy to use. Can be globally configured by settingschedoscope.export.bigQuery.proxyPort
. -
isKerberized
: Is the cluster kerberized? -
kerberosPrincipal
: Kerberos principal to use. Can be globally configured by settingschedoscope.kerberos.principal
. -
metastoreUri
: URI of the metastore. Can be globally configured by settingschedoscope.metastore.metastoreUri
. -
exportSalt
: Salt to use for anonymization. Can be globally configured by settingschedoscope.export.salt
.
package test.export
case class ClickWithBigQueryExport(
year: Parameter[String],
month: Parameter[String],
day: Parameter[String]) extends View
with DailyParameterization {
val id = fieldOf[String]
val url = fieldOf[String]
val stage= dependsOn(() => Stage(year, month, day))
transformVia(
() => HiveTransformation(
insertInto(this, s"""SELECT id, url FROM ${stage().tableName} WHERE date_id='${dateId}'""")))
// This would create a daily partitioned BigQuery table click_with_big_query_export in the dataset
// ${env}_test_export. It would have the columns id, url, year, month, day, date_id plus the column
// _USED_HCAT_FILTER.
//
// To perform the export, temporary blobs will be created in the default bucket
// schedoscope_bigquery_export, creating that bucket if it does not exist.
exportTo(() => BigQuery(this))
}