Scripts to set up LTV on GCP.
To start running LTV on your local machine:
Set up your local drive with virtualenv using python 2.7 (latest version of Dataflow Python SDK)
git clone https://github.com/mozilla-it-data/ltv_v3.git
cd ltv_v3/
pip install apache-beam[gcp]
export GOOGLE_APPLICATION_CREDENTIALS = <your json api key file here>Run Dataflow job locally using DirectRunner:
python ltv_beam.py --project imposing-union-227917 --temp_location gs://ltv-dataflow/tmp --staging_location gs://ltv-dataflow/stagingThis will run the job using your local machine CPU/memory to process LTV data on GCP.
Create & stage template on GCP:
python ltv_beam.py --runner DataflowRunner --project imposing-union-227917 --staging_location gs://ltv-dataflow/staging --temp_location gs://ltv-dataflow/tmp --template_location gs://ltv-dataflow/templates/ltv-dataflow-template --requirements_file requirements.txtThis packages the Dataflow pipeline defined in ltv_beam.py into an executable and add any required libraries to the staging_location. The template job references the packaged pipeline and is defined in template_location.
Execute template on dataflow as batch job:
gcloud dataflow jobs run run-ltv-dataflow-template --gcs-location gs://ltv-dataflow/templates/ltv-dataflow-templateThis runs the template. You can view the progress from the Dataflow UI.
Schema file used by BigQuery is located in ltv-dataflow/templates/input
The following command will synchronize data between an Amazon S3 bucket and a Cloud Storage bucket:
gsutil rsync -d -r s3://my-aws-bucket gs://example-bucket(to be set up and tested)
Upload to gcp: Make sure you are in gcf directory with index.js and package.json files
gcloud beta functions deploy triggerDataFlowLTV --stage-bucket ltv-dataflow --trigger-bucket ltv-test-copyTest GCF trigger Dataflow for ltv-beam-template: Upload file named _SUCCESS to ltv-test-copy bucket
Licensed under ... For details, see the LICENSE file.
