Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update diagnosis scripts to support AKS RP #231

Merged
merged 1 commit into from
Jun 18, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 37 additions & 23 deletions diagnosis/README.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,20 @@
# Troubleshooting AKS Engine on Azure Stack
# Troubleshooting AKS Cluster Issues on Azure Stack

This short [guide](https://github.com/Azure/aks-engine/blob/master/docs/howto/troubleshooting.md) from Azure's AKS Engine team has a good high level explanation of how AKS Engine interacts with the Azure Resource Manager (ARM) and lists common reasons that can cause AKS Engine commands to fail. That guide applies to Azure Stack as well as it ships with its own ARM instance. If you are facing a problem that is not part of this guide, then you will need extra information to figure out the root cause.
## Introduction
In order to troubleshoot some AKS cluster issues, you may need to collect logs directly from the cluster nodes. Typically, without this script, you would need to connect to each node in the cluster, locate and download the logs manually.

Typically, to collect logs from servers you manage, you have to start a remote session using SSH and browse for relevant log files. The scripts in this directory are aim to simplify the collection of relevant logs from your Kubernetes cluster. Just download/unzip the latest [release](https://github.com/msazurestackworkloads/azurestack-gallery/releases/tag/diagnosis-v0.1.2) and execute script `getkuberneteslogs.sh`.
The scripts in this directory aim to simplify the collection of relevant logs from your Kubernetes cluster. The script will automatically create a snapshot of the cluster, and connect to each node to collect logs. In addition, the script can, optionally, upload the collected logs to a storage account.

> Before you execute `getkuberneteslogs.sh`, make sure that you can login to your Azure Stack instance using `Azure CLI`. Follow this [article](https://docs.microsoft.com/azure-stack/user/azure-stack-version-profiles-azurecli2) to learn how to configure Azure CLI to manage your Azure Stack cloud.
This tool is mainly designed for the Microsoft support team to collect comprehensive cluster logs. For self-diagnosis purposes, please see [`az aks kollect`](https://docs.microsoft.com/en-us/cli/azure/aks?view=azure-cli-latest#az_aks_kollect) command and [aks-periscope](https://github.com/Azure/aks-periscope) application.

The logs retrieved by `getkuberneteslogs.sh` are the following:
## Requirments
- A machine that has access to your Kubernetes cluster, or the same machine you used to deploy your cluster. For Windows machine, install [Git Bash](https://gitforwindows.org/) in order to run bash scripts.
- `Azure CLI` installed on the machine where the script will be run. Make sure that you can login to your Azure Stack environment using `Azure CLI` from the machine. Follow this [article](https://docs.microsoft.com/azure-stack/user/azure-stack-version-profiles-azurecli2) to learn how to install and configure Azure CLI to manage your Azure Stack cloud.
- Switch to the subscription where the Kubernetes cluster is deployed, by using `az account set --subscription <Subscription ID>`.
- Download the latest [release](https://github.com/msazurestackworkloads/azurestack-gallery/releases) of the script into your machine and extract the scripts.

## Logs
This script automates the process of gathering the following logs:

- Log files in directory `/var/log/azure/`
- Log files in directory `/var/log/kubeaudit` (kube audit logs)
Expand All @@ -18,8 +26,10 @@ The logs retrieved by `getkuberneteslogs.sh` are the following:
- kubelet status and journal
- etcd status and journal
- docker status and journal
- containerd status and journal
- kube-system snapshot
- Azure CNI config files
- kubelet config files

Some additional logs are retrieved for Windows nodes:

Expand All @@ -31,21 +41,25 @@ Some additional logs are retrieved for Windows nodes:
- ETW events for Hyper-V
- Azure CNI config files

## Required Parameters

`-u, --user` - The administrator username for the cluster VMs

`-i, --identity-file` - RSA private key tied to the public key used to create the Kubernetes cluster (usually named 'id_rsa')

`-g, --resource-group` - Kubernetes cluster resource group

## Optional Parameters

`--disable-host-key-checking` - Sets SSH's `StrictHostKeyChecking` option to `no` while the script executes. Only use in a safe environment.

`--upload-logs` - Persists retrieved logs in an Azure Stack storage account. Logs can be found in `KubernetesLogs` resource group.

`--api-model` - Persists apimodel.json file in an Azure Stack Storage account.
Upload apimodel.json file to storage account happens when `--upload-logs` parameter is also provided.

`-h, --help` - Print script usage
## Parameters
| Parameter | Description | Required | Example |
|-----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------------------------------------------------|
| -h, --help | Print command usage. | no | |
| -u,--user | The administrator username for the cluster VMs. | yes | azureuser (default value) |
| -i, --identity-file | SA private key tied to the public key used to create the Kubernetes cluster (sometimes named 'id_rsa'). | yes | /rsa.pem (Putty)<br>~/.ssh/id_rsa (SSH) |
| -g, --resource-group | Kubernetes cluster resource group. For the clusters created by AKS Service, the managed resource group name follows pattern 'MC_RESOURCEGROUP_CLUSTERNAME_LOCTION'. | yes | k8sresourcegroup<br>MC_AKSRP_k8scluster1_redmond |
| -n, --user-namespace | Collect logs from containers in the specified namespaces. If not sepcified, logs from ALL namespaces are collected. | no | monitoring |
| --upload-logs | Persists retrieved logs in an Azure Stack Hub storage account. Logs can be found in KubernetesLogs resource group. | no | |
| --api-model | Persists apimodel.json file in an Azure Stack Hub Storage account. Upload apimodel.json file to storage account happens when --upload-logs parameter is also provided. | no | ./apimodel.json |
| --disable-host-key-checking | Sets SSH's StrictHostKeyChecking option to "no" while the script executes. Only use in a safe environment. | no | |

## Examples
```bash
az account set --subscription <Subscription ID>
# cd to the directory where the scripts are in.
./getkuberneteslogs.sh -u azureuser -i private.key.1.pem -g k8s-rg
./getkuberneteslogs.sh -u azureuser -i ~/.ssh/id_rsa -g k8s-rg --disable-host-key-checking
./getkuberneteslogs.sh -u azureuser -i ~/.ssh/id_rsa -g k8s-rg -n default -n monitoring
./getkuberneteslogs.sh -u azureuser -i ~/.ssh/id_rsa -g k8s-rg --upload-logs --api-model clusterDefinition.json
./getkuberneteslogs.sh -u azureuser -i ~/.ssh/id_rsa -g k8s-rg --upload-logs
```
9 changes: 6 additions & 3 deletions diagnosis/azs-collect-windows-logs.ps1
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
$ProgressPreference = "SilentlyContinue"

$lockedFiles = "kubelet.err.log", "kubelet.log", "kubeproxy.log", "kubeproxy.err.log", "azure-vnet-telemetry.log", "azure-vnet.log", "network-interfaces.json", "interfaces.json"
$lockedFiles = "kubelet.err.log", "kubelet.log", "kubeproxy.log", "kubeproxy.err.log", "azure-vnet-telemetry.log", "azure-vnet.log", "network-interfaces.json", "interfaces.json", "azure-vnet-ipam.log", "windowsnodereset.log", "csi-proxy.log", "csi-proxy.err.log"

$timeStamp = get-date -format 'yyyyMMdd-hhmmss'
$zipName = "win_log_$env:computername.zip"
Expand Down Expand Up @@ -56,7 +56,10 @@ if (-not (Test-Path 'c:\k\debug\collectlogs.ps1')) {
& 'c:\k\debug\collectlogs.ps1' | write-Host
$netLogs = Get-ChildItem (Get-ChildItem -Path c:\k\debug -Directory | Sort-Object LastWriteTime -Descending | Select-Object -First 1).FullName | Select-Object -ExpandProperty FullName
$paths += $netLogs
$paths += "c:\AzureData\CustomDataSetupScript.log"
$setupLog = "c:\AzureData\CustomDataSetupScript.log"
if (Test-Path $setupLog) {
$paths += $setupLog
}

Write-Host "Collecting containerd hyperv logs"
if ((Test-Path "$Env:ProgramFiles\containerd\diag.ps1") -And (Test-Path "$Env:ProgramFiles\containerd\ContainerPlatform.wprp")) {
Expand All @@ -75,5 +78,5 @@ else {
Write-Host "Compressing all logs to $zipName"
$paths | Format-Table FullName, Length -AutoSize
Compress-Archive -LiteralPath $paths -DestinationPath $zipName
Remove-Item -Path $paths
Remove-Item -Path $paths -ErrorAction SilentlyContinue
Get-ChildItem $zipName # this puts a FileInfo on the pipeline so that another script can get it on the pipeline
Loading