Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@

- [flux-in-slurm](tutorial/flux-in-slurm): Bring up a Flux instance (in user-space) in a Slurm Allocation - both in Kubernetes ([video](https://youtu.be/8ZkSLV0m7To?si=WqWKCe2jvRuTXvlJ))
- [Flux on AWS](tutorial/aws): Deploy an entire Flux Framework cluster to "bare metal" instances on AWS with (essentially) two `make` commands - one to build with packer, and one to deploy with Terraform ([video](https://youtu.be/LJh-ab6fAqE?si=dIzScA530N7lXs_7))
- [Flux on Azure](tutorial/azure): Deploy Flux Framework on Azure with Infiniband
- [HPCIC Tutorial 2024](https://youtu.be/Dt4CSZWSEJE?si=b2O7lQrJixcKh-EJ)

## What is this?
Expand Down
3 changes: 3 additions & 0 deletions tutorial/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.terraform.lock.hcl
.env
.terraform
26 changes: 26 additions & 0 deletions tutorial/azure/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
.PHONY: all
all: init fmt validate apply

.PHONY: init
init:
terraform init

.PHONY: fmt
fmt:
terraform fmt

.PHONY: validate
validate:
terraform validate

.PHONY: apply
apply:
terraform apply

.PHONY: apply-approved
apply-approved:
terraform apply --auto-approve

.PHONY: destroy
destroy:
terraform destroy
121 changes: 121 additions & 0 deletions tutorial/azure/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Flux on Azure

## Usage

### 1. Build Images

Note that you should [build](build) the images first. Follow the instructions in the README there.

### 2. Deploy Terraform

Check the [start-script.sh](start-script.sh) and variables at the top of [main.tf](main.tf). You'll need to export the image full identifier to the environment:

```bash
export TF_VAR_vm_image_storage_reference=/subscriptions/xxxxxxx/resourceGroups/xxxxx/providers/Microsoft.Compute/images/flux-framework
```

Note that I needed to clone this and do from the cloud shell in the Azure portal.

```bash
git clone https://github.com/converged-computing/flux-tutorials
cd flux-tutorials/tutorial/azure
```

and then:

```bash
make
```

The shell can be buggy - if it seems like it's hanging, it's that terraform is waiting for you to enter "yes." You can type it (despite not seeing it) and press enter and it works every time... 50% of the time. :) I added a command to the Makefile to get around this:

```bash
make apply-approved
```

You can also run each command separately:

```bash
# Terraform init
make init

# Terraform validate
make validate

# Create
make apply

# Destroy
make destroy
```

When it's done, save the public and private key to local files:

```bash
terraform output -json public_key | jq -r > id_azure.pub
terraform output -json private_key | jq -r > id_azure
chmod 600 id_azure*
```

Then get the instance ip addresses from the command line (or portal), and ssh in!

```bash
ip_address=$(az vmss list-instance-public-ips -g terraform-testing -n flux | jq -r .[0].ipAddress)
ssh -i ./id_azure azureuser@${ip_address}
```

To get a difference instance, just use the index (e.g., index 1 is the second instance)

```bash
follower_address=$(az vmss list-instance-public-ips -g terraform-testing -n flux | jq -r .[1].ipAddress)
ssh -i ./id_azure azureuser@${follower_address}
```

### 3. Checks

Check the cluster status, the overlay status, and try running a job:

```bash
$ flux resource list
```
```bash
$ flux run -N 2 hostname
```

### 4. Cleanup

This should work (but see [debugging](#debugging)).

```bash
make destroy
```

But if not, you can either delete the resource group from the console, or the command line:

```bash
az group delete --name terraform-testing
```

Note that this current build does not have flux-pmix, which might lead to issues with MPI. It's an issue of the VM base being compiled with a libpmix.so that has a different ABI than what flux is expecting. I will be looking into it.

### Debugging

Depending on your environment, terraform (e.g., `make` or `make destroy` doesn't always work. I get this error from the Azure Cloud Shell:

```console
terraform destroy
random_pet.id: Refreshing state... [id=usable-grouper]
random_string.fqdn: Refreshing state... [id=lhppiw]
│ Error: building account: could not acquire access token to parse claims: running Azure CLI: exit status 1: ERROR: Failed to connect to MSI. Please make sure MSI is configured correctly.
│ Get Token request returned: <Response [400]>
│ with provider["registry.terraform.io/hashicorp/azurerm"],
│ on main.tf line 28, in provider "azurerm":
│ 28: provider "azurerm" {
make: *** [Makefile:22: destroy] Error 1
```

If I open a new cloud shell, it seems to magically go away. But you can also interact with the `az` tool (that does seem to to work) or issue commands via clicking directly in the portal.
18 changes: 18 additions & 0 deletions tutorial/azure/build/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
.PHONY: all
all: init fmt validate build

.PHONY: init
init:
packer init .

.PHONY: fmt
fmt:
packer fmt .

.PHONY: validate
validate:
packer validate .

.PHONY: build
build:
packer build flux-build.pkr.hcl
38 changes: 38 additions & 0 deletions tutorial/azure/build/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Build Packer Images

Note that I needed to do this build from a cloud shell, so clone and then:

```bash
git clone https://github.com/converged-computing/flux-tutorials
flux-tutorials/tutorial/azure/build
```

And install packer

```bash
wget https://releases.hashicorp.com/packer/1.11.2/packer_1.11.2_linux_amd64.zip
unzip packer_1.11.2_linux_amd64.zip
mkdir -p ./bin
mv ./packer ./bin/
export PATH=$(pwd)/bin:$PATH
```

Get your account information for azure as follows:

```bash
az account show
```

And export variables in the following format. Note that the resource group needs to actually exist - I created mine in the console UI.

```bash
export AZURE_SUBSCRIPTION_ID=xxxxxxxxx
export AZURE_TENANT_ID=xxxxxxxxxxx
export AZURE_RESOURCE_GROUP_NAME=packer-testing
```

Then build!

```bash
make
```
Loading