Skip to content

Tutorial and example analysis using the synthetic American Family Cohort dataset.

Notifications You must be signed in to change notification settings

StanfordHPDS/synthetic_afc_tutorial

Repository files navigation

Synthetic Primary Care Data Tutorial

This repository contains an example analysis using a fully synthetic dataset based on patients in the American Family Cohort, generated in collaboration with the American Board of Family Medicine and the Stanford University Center for Population Health Sciences (PHS).

Data citation and location: Gabriela Elise Basel, Malcolm Barrett, Sherri Rose. (2025). AFC CKD SYN (Synthetic) (Version 0.1) [Dataset]. Redivis (DOI:10.71778/V2DW-7A53).

Setting up this repository

Running code/train_classifier.py and rendering synthetic_afc_tutorial.pdf require access to the Nero Google Cloud Platform developed by PHS, Stanford University School of Medicine, and Stanford Research Computing Center. Follow these steps to set up a Nero instance on which files in this repository can be run.

Setting up a Nero instance

NOTE: These instructions are in bash and thus for Mac and Linux users. If you are a Windows user, you'll either need to adapt these instructions for PowerShell or use Windows Subsystem for Linux (WSL).

Create the instance my-instance on the Nero project my-project. my-project should be replaced by a Nero project to which you already have access.

# CHANGE THIS TO THE NAME YOU WANT FOR YOUR INSTANCE
INSTANCE_NAME="my-instance"
# CHANGE THIS TO THE NAME OF YOUR NERO PROJECT
PROJECT_ID="my-project"

Additionally, set ZONE, MACHINE_TYPE, DISK_SIZE, IMAGE_NAME, and IMAGE_PROJECT, changing any values you want to adjust.

ZONE="us-west1-c"

# see all machine types with:
# gcloud compute machine-types list --zones="$ZONE"
# 8 vCPUs (4 cores) and 30 GB RAM
MACHINE_TYPE="n1-standard-8" 
DISK_SIZE="200" # in GB

# Recommended as the setup script assumes this OS
IMAGE_NAME="ubuntu-2404-noble-amd64-v20241004"
IMAGE_PROJECT="ubuntu-os-cloud"

Then, run the gcloud compute instances create command below:

# Create instance with above specs
gcloud compute instances create "$INSTANCE_NAME" \
  --project="$PROJECT_ID" \
  --zone="$ZONE" \
  --machine-type="$MACHINE_TYPE" \
  --network-interface=network-tier=PREMIUM,stack-type=IPV4_ONLY \
  --maintenance-policy=MIGRATE \
  --provisioning-model=STANDARD \
  --scopes=https://www.googleapis.com/auth/devstorage.read_only,https://www.googleapis.com/auth/logging.write,https://www.googleapis.com/auth/monitoring.write,https://www.googleapis.com/auth/service.management.readonly,https://www.googleapis.com/auth/servicecontrol,https://www.googleapis.com/auth/trace.append,https://www.googleapis.com/auth/bigquery,https://www.googleapis.com/auth/cloud-platform \
  --tags=ssh \
  --create-disk=auto-delete=yes,boot=yes,device-name="$INSTANCE_NAME",image="$IMAGE_NAME",image-project="$IMAGE_PROJECT",mode=rw,size="$DISK_SIZE",type=pd-balanced \
  --no-shielded-secure-boot \
  --shielded-vtpm \
  --shielded-integrity-monitoring \
  --labels=goog-ec-src=vm_add-gcloud \
  --reservation-affinity=any

Note that it may take a moment for the server to initialize before you can connect. Then connect to the server via ssh with:

gcloud compute ssh --zone "$ZONE" "$INSTANCE_NAME" --project "$PROJECT_ID"

When you've successfully SSH'd into the server, run the installation script:

curl -fsSL https://github.com/StanfordHPDS/gcp_setup_script/releases/download/v1.0.3/setup.sh | bash

This process will take several minutes to run.

After the script has completed, the server will reboot to finish updating the Linux kernel. This is also intended to finish updating the paths for all the new software. Eventually, this will disconnect you.

It will take a few moments for the server to reboot.

Log back in with the port for VS Code open:

gcloud compute ssh --zone "$ZONE" "$INSTANCE_NAME" --project "$PROJECT_ID" \
  -- -L 8080:localhost:8080

Cloning this repository to your Nero instance

After you've finished setting up your Nero instance, authorize your GitHub credentials while logged into your instance:

gh auth login

And tell git who you are:

git config --global user.name "Jane Doe"
git config --global user.email "jane@example.com"

Clone this repository:

git clone https://github.com/StanfordHPDS/synthetic_afc_tutorial.git

Navigate to the cloned repository, and install additional requirements from the uv file. See pyproject.toml for the project dependencies that will be installed.

uv sync

Access your Nero instance with VS Code (http://localhost:8080/)

Optionally, you can access your Nero instance on VS Code. If you open http://localhost:8080/, you'll get a start up message that tells you where the credential file is. You can see the password with:

cat /path/to/the/file/code-server/config.yaml

Make sure to replace the path with the path in the startup message.

Navigating this repository

├── README.md
├── code
│   ├── params.py
│   └── train_classifier.py
├── pyproject.toml
├── references.bib
├── synthetic_afc_tutorial.pdf
├── synthetic_afc_tutorial.qmd
└── uv.lock

The code directory contains scripts for performing the analysis described in synthetic_afc_tutorial.pdf. Variables in code/params.py can be changed to select different classifier covariates or a different target variable. Executing code/train_classifier.py trains the classifier on the synthetic data and requires access to the synthetic dataset on Nero Google Cloud Platform. The location of the synthetic dataset on Nero can be updated in code/params.py. Run code/train_classifier.py using the following command:

uv run code/train_classifier.py

Package dependencies can by found in pyproject.toml, with additional information in uv.lock.

synthetic_afc_tutorial.pdf provides a description of the American Family Cohort, the synthetic dataset, and an example analysis of the synthetic dataset. Regenerating this file also requires access to the synthetic dataset on Nero Google Cloud Platform.

Regenerate synthetic_afc_tutorial.pdf with the following command:

uv run quarto render synthetic_afc_tutorial.qmd

About

Tutorial and example analysis using the synthetic American Family Cohort dataset.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •