This SQL script demonstrates a basic workflow for building, evaluating, and explaining a machine learning model for account fraud detection using BigQuery ML. The script uses data from an external Cloud Spanner table.
The script performs the following steps:
- Data Cleansing and Preparation: Creates a new table (
demo_dataset.account_activity_cleansed
) by querying raw account activity data from a Cloud Spanner table (account_activity_raw
) via an external connection (demo-spanner-conn
). During this process, it:- Casts data types for timestamp, transaction amount, successful login, and unusual activity columns.
- Adds a
classifier
column with random values between 0 and 1 to facilitate splitting the data into training, evaluation, and prediction sets.
- Data Split Verification: Includes a query to check the distribution of data across the training (80%), evaluation (10%), and prediction (10%) sets based on the
classifier
column. - Model Training: Creates or replaces a BigQuery ML Logistic Regression model (
demo_dataset.fraud_model
) using the cleansed data. Key options used:model_type='LOGISTIC_REG'
: Specifies the model algorithm.auto_class_weights=TRUE
: Helps handle class imbalance in the target variable (unusual_activity
).data_split_method='NO_SPLIT'
: Informs BQML that the data is already split manually.input_label_cols=['unusual_activity']
: Defines the target variable.- The training uses only the data where
classifier < 0.8
.
- Model Evaluation: Evaluates the trained model's performance using
ML.EVALUATE
on the evaluation dataset (classifier between 0.8 and 0.9
). - Prediction and Explanation: Uses
ML.EXPLAIN_PREDICT
to predict unusual activity on the prediction dataset (classifier > 0.9
) and provides feature attributions (explanations) for the predictions. The results are filtered to show only instances where the prediction differs from the actual label (misclassifications). - (Commented Out) Example Update: Includes a commented-out example
UPDATE
statement showing how to modify data in the cleansed table.
- Access to a Google Cloud project with BigQuery and Cloud Spanner APIs enabled.
- A BigQuery dataset named
demo_dataset
. - A BigQuery connection named
datasherlock.us-central1.demo-spanner-conn
configured to access a Cloud Spanner instance. - A Cloud Spanner table named
account_activity_raw
within the connected Spanner database, containing the necessary columns (transaction_id
,account_id
,timestamp
,location
,device_type
,ip_address
,transaction_amount
,transaction_type
,successful_login
,unusual_activity
).
- Ensure Prerequisites: Verify that all prerequisites listed above are met. Pay close attention to the dataset name (
demo_dataset
) and the connection name (datasherlock.us-central1.demo-spanner-conn
) used in the script; adjust them if your environment uses different names. - Execute the Script: Run the SQL commands sequentially in the BigQuery console, using the
bq
command-line tool, or through a BigQuery client library.
- A BigQuery table named
demo_dataset.account_activity_cleansed
containing the prepared data. - Query results showing the distribution of data into training, evaluation, and prediction sets.
- A BigQuery ML model named
demo_dataset.fraud_model
. - Query results from
ML.EVALUATE
showing model performance metrics (e.g., precision, recall, accuracy, f1-score, roc_auc). - Query results from
ML.EXPLAIN_PREDICT
showing predictions, actual labels, and feature attributions for misclassified instances in the prediction set.