Skip to content

Commit

Permalink
Updating examples and spelling (GoogleCloudPlatform#678)
Browse files Browse the repository at this point in the history
Co-authored-by: Abdel SGHIOUAR <abdelfettah@google.com>
  • Loading branch information
bluPhy and boredabdel authored Aug 13, 2021
1 parent 255883b commit 20a0765
Show file tree
Hide file tree
Showing 48 changed files with 131 additions and 128 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,8 +91,8 @@ The tools folder contains ready-made utilities which can simplify Google Cloud P
* [CUD Prioritized Attribution](tools/cuds-prioritized-attribution) - A tool that allows GCP customers who purchased Committed Use Discounts (CUDs) to prioritize a specific scope (e.g. project or folder) to attribute CUDs first before letting any unconsumed discount float to other parts of an organization.
* [Custom Role Manager](tools/custom-role-manager) - Manages organization- or project-level custom roles by combining predefined roles and including and removing permissions with wildcards. Can run as Cloud Function or output Terraform resources.
* [DNS Sync](tools/dns-sync) - Sync a Cloud DNS zone with GCE resources. Instances and load balancers are added to the cloud DNS zone as they start from compute_engine_activity log events sent from a pub/sub push subscription. Can sync multiple projects to a single Cloud DNS zone.
* [Firewall Enforcer](tools/firewal-enforcer) - Automatically watch & remove illegal firewall rules across organization. Firewall rules are monitored by a Cloud Asset Inventory Feed, which trigger a Cloud Function that inspects the firewall rule and deletes it if it fails a test.
* [GCE Disk Encryption Converter](tools/gce-google-keys-to-cmek) - A tool that converts disks attached to a GCE VM instnace from Google-managed keys to a customer-managed key stored in Cloud KMS.
* [Firewall Enforcer](tools/firewall-enforcer) - Automatically watch & remove illegal firewall rules across organization. Firewall rules are monitored by a Cloud Asset Inventory Feed, which trigger a Cloud Function that inspects the firewall rule and deletes it if it fails a test.
* [GCE Disk Encryption Converter](tools/gce-google-keys-to-cmek) - A tool that converts disks attached to a GCE VM instance from Google-managed keys to a customer-managed key stored in Cloud KMS.
* [GCE Quota Sync](tools/gce-quota-sync) - A tool that fetches resource quota usage from the GCE API and synchronizes it to Stackdriver as a custom metric, where it can be used to define automated alerts.
* [GCE Usage Log](tools/gce-usage-log) - Collect GCE instance events into a BigQuery dataset, surfacing your vCPUs, RAM, and Persistent Disk, sliced by project, zone, and labels.
* [GCP Architecture Visualizer](https://github.com/forseti-security/forseti-visualizer) - A tool that takes CSV output from a Forseti Inventory scan and draws out a dynamic hierarchical tree diagram of org -> folders -> projects -> gcp_resources using the D3.js javascript library.
Expand Down
22 changes: 11 additions & 11 deletions examples/bigquery-table-access-pattern-analysis/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Table Access Pattern Analysis
This module consists of deep dive analysis of a BigQuery environment in Google Cloud Platform, according to audt logs - data access data, which can be used to optimise BigQuery usage, and improve time, space and cost of BigQuery.
This module consists of deep dive analysis of a BigQuery environment in Google Cloud Platform, according to audit logs - data access data, which can be used to optimise BigQuery usage, and improve time, space and cost of BigQuery.

## Pipeline Optimisation

Expand All @@ -15,7 +15,7 @@ with its destination table set to return_purchases table. The source table of th

![](assets/pipeline-definition.png)

In the illustriation above, one of the pipeline involves T1 and T2 as its source tables and T5 as its destination table.
In the illustration above, one of the pipeline involves T1 and T2 as its source tables and T5 as its destination table.

Given enough historical data from the audit logs, you can group queries which have the same source table(s) and destination table pair, and see when these were executed. Same source table(s) - destination table pair will almost always come from the same query, even if they are a different query, the semantics should be similar, so this assumption is still valid. After grouping the source table(s) - destination table pair, you might be able to see a pattern in their execution history. You might see that this pair is executed hourly, or daily, or even monthly, and when it was last executed.

Expand All @@ -38,7 +38,7 @@ As can be seen from the GIF, the tool will visualise all the pipelines associate
The tables that are involved in the pipelines associated with a table employs the below logic:

For every query jobs that has the table of interest as one of its source tables or destination table,
* For every source table(s) of every query job that has the table of itnerest as one of its source table(s), recursively find query jobs that has this source table as its destination table, and get its source table(s).
* For every source table(s) of every query job that has the table of interest as one of its source table(s), recursively find query jobs that has this source table as its destination table, and get its source table(s).
* For every destination table of every query job that has the table of interest as its destination table, recursively find query jobs that has destination table as its source table, and get its destination table.

As seen from the GIF too, for every tables that are involved in the pipeline of the table of interest, you can toggle to it, and see the details of the job schedule of every query involving this particular table. It will list down all the query jobs that has this table as its source table, and destination table. These query jobs are then grouped by whether they are ad-hoc jobs, live jobs or dead jobs. For each of this job, the counterpart destination table or source table are also noted.
Expand Down Expand Up @@ -74,13 +74,13 @@ This directory contains images or other assets that are used in README.md

<li> <b>bq_routines/</b>

This directory contains all the [JS UDF](https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions#javascript-udf-structure) functions that will be created in BigQuery upon usage of the tool. These files are not to be run independently in a JS environment, these file contents will be loaded by the Python package, `src/` to be constructed as a funciton creation query to BigQuery.
This directory contains all the [JS UDF](https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions#javascript-udf-structure) functions that will be created in BigQuery upon usage of the tool. These files are not to be run independently in a JS environment, these file contents will be loaded by the Python package, `src/` to be constructed as a function creation query to BigQuery.

For more information about each of the functions, look at this [section](#routines-creation)

<li> <b>pipeline_graph/</b>

This directory contains the HTML file, which is a webpaget that is used to displa the pipeline visualisation of the pipeline optimisation module.
This directory contains the HTML file, which is a webpage that is used to display the pipeline visualisation of the pipeline optimisation module.

<li> <b>src/</b>

Expand Down Expand Up @@ -114,7 +114,7 @@ This is the file on which environment variables are to be defined and to be load

## Prerequisites
* Your account must have access to read the audit logs - data access table that will be used as a source table for the analysis. For more details regarding different kinds of audit logs, visit this [page](https://cloud.google.com/logging/docs/audit#data-access)
* The audit logs - data access table that wwill be used as a source table for the analysis should contain BigQuery logs version 1. For more details regarding audit logs version, visit this [page](https://cloud.google.com/bigquery/docs/reference/auditlogs)
* The audit logs - data access table that will be used as a source table for the analysis should contain BigQuery logs version 1. For more details regarding audit logs version, visit this [page](https://cloud.google.com/bigquery/docs/reference/auditlogs)
* Your account must have access to write to the destination dataset.
* The source and destination dataset must be in the same location

Expand All @@ -141,7 +141,7 @@ The details of each of the environment variables are as follows:
<ul>
<li> Definition

These 3 environment variables should point to the audit logs - data access table that will be the source table of the analysis. The complete path to the audit logs table sourrce will be `INPUT_PROJECT_ID.INPUT_DATASET_ID.INPUT_AUDIT_LOGS_TABLE_ID`. If you want to analyse on a table with a wildcard, include the wildcard in the INPUT_AUDIT_LOGS_TABLE_ID variable as well.
These 3 environment variables should point to the audit logs - data access table that will be the source table of the analysis. The complete path to the audit logs table source will be `INPUT_PROJECT_ID.INPUT_DATASET_ID.INPUT_AUDIT_LOGS_TABLE_ID`. If you want to analyse on a table with a wildcard, include the wildcard in the INPUT_AUDIT_LOGS_TABLE_ID variable as well.

<li> Example values

Expand Down Expand Up @@ -295,7 +295,7 @@ This table stores some of the details of job history that are relevant to pipeli
"name": "sourceTables",
"type": "ARRAY<STRING>",
"mode": "NULLABLE",
"description": "The source tables of this job, in an arra of concatenated 'project.dataset.table' string format"
"description": "The source tables of this job, in an array of concatenated 'project.dataset.table' string format"
}
]
```
Expand Down Expand Up @@ -339,7 +339,7 @@ This table stores the information of the different pipeline IDs. Each unique pip
"name": "sourceTables",
"type": "ARRAY<STRING>",
"mode": "NULLABLE",
"description": "The source tables of this pipeline, in an arra of concatenated 'project.dataset.table' string format"
"description": "The source tables of this pipeline, in an array of concatenated 'project.dataset.table' string format"
}
]
```
Expand Down Expand Up @@ -400,13 +400,13 @@ This table stores all table pipeline, as destination table and as source table


### Routines Creation
There are several JavaScript UDFs created in BigQuery upon usage of the tool. These function files are not to be run independently in a JS environment, these file contents will be loaded by the Python package, `src/` to be constructed as a funciton creation query to BigQuery.
There are several JavaScript UDFs created in BigQuery upon usage of the tool. These function files are not to be run independently in a JS environment, these file contents will be loaded by the Python package, `src/` to be constructed as a function creation query to BigQuery.

<ul>

<li>getPipelineTypeAndSchedule

This funtion takes in an array of timestamp, and return a struct of the pipeline type and schedule according to the history. There are 3 possible values for pipeline type: live/dead/adhoc, and there are 4 possible values for schedule: non deterministic/hourly/daily/monthly.
This function takes in an array of timestamp, and return a struct of the pipeline type and schedule according to the history. There are 3 possible values for pipeline type: live/dead/adhoc, and there are 4 possible values for schedule: non deterministic/hourly/daily/monthly.

The routine file content is located in `bq_routines/getPipelineTypeAndSchedule.js`

Expand Down
2 changes: 1 addition & 1 deletion examples/bigtable-change-key/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ $ ./scripts/copy_schema_to_new_table.sh MY_INPUT_TABLE MY_OUTPUT_TABLE

In addition to the [Dataflow command line
options](https://cloud.google.com/dataflow/docs/guides/specifying-exec-params),
this pipeline has three addditional required options:
this pipeline has three additional required options:

* ``--bigtableInstance``, the name of the Bigtable instances where all the
tables are located
Expand Down
10 changes: 5 additions & 5 deletions examples/bq_benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ number of columns, column types (such as 100% STRING vs 50% STRING/ 50% NUMERIC)
and the size of files. Once the files are created, they are loaded into BigQuery tables.

#### Benchmark Parameters
Specifc file parameters are used in this project for performance testing. While
the list of parameters is growing, the current list of paramters and values
Specific file parameters are used in this project for performance testing. While
the list of parameters is growing, the current list of parameters and values
is as follows:

**File Type**:
Expand Down Expand Up @@ -294,7 +294,7 @@ Do not store tables for any other purposes in this dataset.

`--benchmark_table_schemas_directory`: Optional argument for the directory where
the schemas for the staging tables are stored. It defaults to
`json_schemas/benchmark_table_schemas`. If your schemas are elsehwere, provide
`json_schemas/benchmark_table_schemas`. If your schemas are elsewhere, provide
that directory.

`--dataflow_staging_location`: Staging location for Dataflow on GCS. Include
Expand Down Expand Up @@ -439,7 +439,7 @@ python bq_benchmark.py \
--benchmark_dataset_id=<ID of the dataset holding the benchmark tables> \
--bucket_name=<name of bucket to hold files> \
--results_table_name=<Name of results table> \
--results_dataset_id=<Name dataset holding resultst table> \
--results_dataset_id=<Name dataset holding results table> \
--duplicate_benchmark_tables \
--bq_logs_dataset=<Name of dataset hold BQ logs table>
--include_federated_query_benchmark
Expand Down Expand Up @@ -530,7 +530,7 @@ python bq_benchmark.py \
--benchmark_dataset_id=<ID of the dataset holding the benchmark tables> \
--bucket_name=<name of bucket to hold files> \
--results_table_name=<Name of results table> \
--results_dataset_id=<Name dataset holding resultst table> \
--results_dataset_id=<Name dataset holding results table> \
--bq_logs_dataset=<Name of dataset hold BQ logs table>
```
Expand Down
2 changes: 1 addition & 1 deletion examples/cloud-composer-examples/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
## Cloud Composer Examples:
This repo constains the following examples of using Cloud Composer, Google Cloud Platform's managed Apache Airflow service:
This repo contains the following examples of using Cloud Composer, Google Cloud Platform's managed Apache Airflow service:

1. [Composer Dataflow Examples](composer_dataflow_examples/README.md)

Expand Down
2 changes: 1 addition & 1 deletion examples/cloudml-bank-marketing/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# CloudML Marketing (Classification) Model for Banking

The goal of this notebook is to create a classification model using CloudML as an alternative to on-premise methods. Along the way you will learn how to store data into BigQuery, fetch and explore that data, understand how to properly partition your dataset, perform feature selection, evalute multiple models at the same time and how to explain these models using [lime](https://github.com/marcotcr/lime).
The goal of this notebook is to create a classification model using CloudML as an alternative to on-premise methods. Along the way you will learn how to store data into BigQuery, fetch and explore that data, understand how to properly partition your dataset, perform feature selection, evaluate multiple models at the same time and how to explain these models using [lime](https://github.com/marcotcr/lime).

The use case for this notebook is the [UCI Bank Marketing Dataset](https://archive.ics.uci.edu/ml/datasets/bank+marketing). The data is related with direct marketing campaigns of a Portuguese banking institution an The classification goal is to predict if the client will subscribe (yes/no) a term deposit. The data is related to a direct marketing campaigns of a Portuguese banking institution and the classification goal is to predict if the client will subscribe (yes/no) to a term deposit.

Expand Down
2 changes: 1 addition & 1 deletion examples/cloudml-churn-prediction/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Churn Prediction with Survival Analysis
This model uses Survival Analysis to classify customers into time-to-churn buckets. The model output can be used to calculate each user's churn score for different durations.

The same methodology can be used used to predict customers' total lifetime from their "birth" (intital signup, or t = 0) and from the current state (t > 0).
The same methodology can be used used to predict customers' total lifetime from their "birth" (initial signup, or t = 0) and from the current state (t > 0).

## Why is Survival Analysis Helpful for Churn Prediction?
Survival Analysis is used to predict the time-to-event, when the event in question has not necessarily occurred yet. In this case, the event is a customer churning.
Expand Down
4 changes: 2 additions & 2 deletions examples/cloudml-collaborative-filtering/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ The steps involved are as follows:
text files. Leave users and items under a set frequency threshold out of the
vocabularies.
3. Filter away user-item pairs where either element is outside of its
cooresponding vocabulary.
corresponding vocabulary.
4. Split the data into train, validation, and test sets.
5. Write each dataset as TFRecords to GCS.

Expand All @@ -81,7 +81,7 @@ The steps involved are as follows:
## Training
A [Custom Estimator](https://www.tensorflow.org/guide/custom_estimators) is
trained using TensorFlow and [Cloud AI Platform](https://cloud.google.com/ai-platform/)(CAIP).
The trainng steps are as follows:
The training steps are as follows:
1. Read TFRecords from GCS and create a `tf.data.Dataset` for each of them that
yields data in batches.
2. Use the TensorFlow Transform output from preprocessing to transform usernames
Expand Down
2 changes: 1 addition & 1 deletion examples/cloudml-energy-price-forecasting/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ The code takes in raw data from BigQuery, transforms and prepares the data, uses
* `dpe-cloud-mle.Energy.MarketPricePT` - Historical hourly energy prices.
* `dpe-cloud-mle.Energy.historical_weather` - Historical hourly weather forecasts.

Disclaimer: The data for both tables was downloaded from http://complatt.smartwatt.net/. This website hosts a closed competition meant to solve the energy price forecasting problem. The data was not collected or vetted by Google LLC and hence, we can't guarantee the veracity or qualitty of it.
Disclaimer: The data for both tables was downloaded from http://complatt.smartwatt.net/. This website hosts a closed competition meant to solve the energy price forecasting problem. The data was not collected or vetted by Google LLC and hence, we can't guarantee the veracity or quality of it.

2. Run: ```python -m data_preparation.data_prep``` to generate training/validation/testing data as well as to generate constants needed for normalization. The produced data has the following columns:
* price - FLOAT - Energy price.
Expand Down
6 changes: 3 additions & 3 deletions examples/cloudml-sklearn-pipeline/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ this example has the following additional feature:

1. Support both Classification and Regression, which can be specified in the configuration
2. Support serving for both JSON and List of Value formats
3. Support additional custome transformation logics besides typical pre-processing provided by scikit-learn
3. Support additional custom transformation logics besides typical pre-processing provided by scikit-learn

Google Cloud tools used:
- [Google Cloud Platform](https://cloud.google.com/) (GCP) lets you build and
Expand Down Expand Up @@ -64,12 +64,12 @@ template
|__ transform_config.py # configuration for transform pipeline construction"
|__ predictor.py # define custom prediction behavior
|__ setup.py # specify necessary dependency for running job on AI Platform
|__ requirements.txt # specify necessary dependency, helper for setup environemnt for local development
|__ requirements.txt # specify necessary dependency, helper for setup environment for local development
```

## Using the template
### Step 0. Prerequisites
Before you follow the instructions below to adapt the tempate to your machine learning job,
Before you follow the instructions below to adapt the template to your machine learning job,
you need a Google cloud project if you don't have one. You can find detailed instructions
[here](https://cloud.google.com/dataproc/docs/guides/setup-project).

Expand Down
Loading

0 comments on commit 20a0765

Please sign in to comment.