In this workshop you'll cover using a Process and various Platform components to create a Big Data Cluster for SQL Server solution using Kubernetes that you can deploy on premises, in the cloud, or in a hybrid architecture. In each module you'll get more references, which you should follow up on to learn more. Also watch for links within the text - click on each one to explore that topic. There's a lot here - so focus on understanding the overall system first, then come back and explore each section.
This module covers Container technologies and how they are different than Virtual Machines. Then you'll learn about the need for container orchestration using Kubernetes. We'll start with some computer architecture basics, and introduce the Linux operating system in case you're new to that topic. There are many references and links throughout the module, and a further set of references at the bottom of this module.
You may now start the Virtual Machine you created in the pre-requisites. You will run all commands from that environment.
In this section you will learn more about the design the primary operating system (Linux) used with a Kubernetes Cluster.
NOTE: This is not meant to be a comprehensive discussion of the merits of an operating system or ecostructure. The goal is to understand the salient features in each architecture as they pertain to processing large sets of data.
When working with large-scale, distributed data processing systems, there are two primary concepts to keep in mind: You should move as much processing to the data location as possible, and the storage system should be abstracted so that the code does not have to determine where to get its data from each node. Both Windows and Linux have specific attributes to keep in mind as you select hardware, drivers, and configurations for your systems.
The general rules for storage in a distributed data processing environment are that the disks should be as fast as possible, you should optimize for the best number of nodes that will store and process the data, and that the abstraction system (in the case of BDC, this includes HDFS and relational storage) be at the latest version and optimized settings.
In general, Linux treats almost everything as a file system or a process. For general concepts for Linux storage see the following resources:
Both Windows and Linux (in the x86 architecture) are Symmetric Multiprocessing systems, which means that all processors are addressed as a single unit. In general, distributed processing systems should have larger, and more, processors at the "head" node. General Purpose (GP) processors are normally used in these systems, but for certain uses such as Deep Learning or Artificial Intelligence, Purpose-Built processors such as Graphics Processing Unit (GPU's) are used within the environment, and in some cases, Advanced RISC Machines (ARM) chips can be brought into play for specific tasks. For the purposes of this workshop, you will not need these latter two technologies, although both Windows and Linux support them.
NOTE: Linux can be installed on chipsets other than x86 compatible systems. In this workshop, you will focus on x86 systems.
Storage Nodes in a distributed processing system need a nominal amount of memory, and the processing nodes in the framework (Spark) included with SQL Server big data clusters need much more. Both Linux and Windows support large amounts of memory (most often as much as the system can hold) natively, and no special configuration for memory is needed.
Modern operating systems use a temporary area on storage to act as memory for light caching and also as being able to extend RAM memory. This should be avoided as much as possible in both Windows and Linux.
While technically a Processor system, NUMA (Non-Uniform Memory Access) systems allow for special memory configurations to be mixed in a single server by a processor set. Both Linux and Windows support NUMA access.
The general guidance for large-scale distributed processing systems is that the network between them be as fast and collision-free (in the case of TCP/IP) as possible. The networking stacks in both Windows and Linux require no special settings within the operating system, but you should thoroughly understand the driver settings as they apply to each NIC installed in your system, that each setting is best optimized for your networking hardware and topology and that the drivers are at the latest tested versions.
Although a full discussion of the TCP/IP protocol is beyond the scope of this workshop, it's important that you have a good understanding of how it works, since it permeates every part of a Kubernetes architecture. You can get a quick overview of TCP/IP here.
The other rule you should keep in mind is that in general you should have only the presentation of the data handled by the workstation or device that accesses the solution. Do not move large amounts of data back and forth over the network to have the data processed locally. That being said, there are times (such as certain IoT scenarios) where subsets of data should be moved to the client, but you will not face those scenarios in this workshop.
The best way to learn an operating system is to install it and perform real-world tasks. (A good place to learn a lot more about Linux is here). For this workshop, the essential concepts you need from the SQL Server perspective are:
Linux Concept | Description |
---|---|
Distributions | Unlike Windows, which is written and controlled by Microsoft, Linux is comprised only of a small Kernel, and then all other parts of the operating system are created by commercial entities or the public, and packaged up into a Distribution. These Distributions have all of the complementary functions to the operating system, and in some cases a graphical interface and other files. The Distributions supported by SQL Server are RedHat, Ubuntu, and SuSE. |
Package Managers | Software installation on Linux can be done manually by copying files or compiling source code. A Package Manager is a tool that simplifies this process, and is based on the Distribution. The two package managers you will see most often in SQL Server are yum and apt. |
File Systems | Like Windows, organized as a tree, but referenced by a forward-slash /. There are no drive letters in Linux - everything is "mounted" to what looks like a directory. |
Access and Authentication | Users and Groups are stored in protected files, called /etc/passwd and /etc/group. These files and locations may be augmented or slightly different based in the distribution. By default, each user has very low privileges and must be granted access to files or directories. The sudo command allows you to run as a privileged user (known as root) or as another user. |
The essential commands you should know for this workshop are listed below, with links to learn more. In Linux you can often send the results of one command to another using the "pipe" symbol, similar to PowerShell: |.
Linux Command | Description |
---|---|
man | Shows help on a command or concept. You can also add --help to most commands for a quick syntax display. Similar to HELP in Windows. |
cat | Display File Contents. Similar to TYPE in Windows. |
cd | Changes Directory. Same as in Windows. |
chgrp | Change file group access. |
chown | Change permissions. Similar to CACLS in Windows. |
cp | Copy source file into destination. Similar to COPY in Windows. |
df | Shows free space on mounted devices. Similar to dir | find "bytes free" in Windows. |
file | Determine file type. |
find | Find files. Similar to DIR filename /S in Windows. |
grep | Search files for regular expressions. Similar to FIND in Windows. |
head | Display the first few lines of a file. |
ln | Create softlink on oldname |
ls | Display information about file type. Similar to DIR in Windows. |
mkdir | Create a new directory dirname. Same as in Windows. |
more | Display data in paginated form. Same as in Windows. An improved version of this command is less. |
mount | Makes a drive, network location, and many other objects available to the operating system so that you can work with it. |
mv | Move (Rename) an oldname to newname. Similar to REN and DEL combined in Windows. |
nano | Create and edit text files. |
pwd | Print current working directory. |
rm | Remove (Delete) filename. Similar to DEL in Windows. |
rmdir | Delete an existing directory provided it is empty. Same as in Windows. |
sudo | Elevate commands that follow to sysadmin privileges. |
tail | Prints last few lines in a file. |
touch | Update access and modification time of a file. Similar to ECHO > test.txt in Windows. |
A longer explanation of system administration for Linux is here.
Activity: Work with Linux Commands
Steps
Open this link to run a Linux Emulator in a browser
Find the mounted file systems, and then show the free space in them.
Show the files in the current directory.
Create a new directory, navigate to it, and create a file called test.txt with the words This is a test in it. (hint: us the nano editor or the echo command)
Display the contents of that file.
Show the help for the cat command.
Bare-metal installations of an operating system such as Windows are deployed on hardware using a Kernel, and additional software to bring all of the hardware into a set of calls.
One abstraction layer above installing software directly on hardware is using a Hypervisor. In essence, this layer uses the base operating system to emulate hardware. You install an operating system (called a Guest OS) on the Hypervisor (called the Host), and the Guest OS acts as if it is on bare-metal.
In this abstraction level, you have full control (and responsibility) for the entire operating system, but not the hardware. This isolates all process space and provides an entire "Virtual Machine" to applications. For scale-out systems, a Virtual Machine allows for a distribution and control of complete computer environments using only software.
You can read the details of Microsoft's Hypervisor here, and the VMWare approach is detailed here. Another popular Hypervisor is VirtualBox, which you can read more about here.
The next level of Abstraction is a Container. There are various types of Container technologies, in this workshop, you will focus on the docker container format, as implemented in the Moby framework.
A Container is provided by the Container Runtime (Such as containerd]) runtime engine, which sits on the operating system (Windows or Linux). In this abstraction, you do not control the hardware or the operating system. In general, you create a manifest (in the case of Docker a "dockerfile") which is a text file containing the binaries, code, files or other assets you want to run in a Container. You then "build" that file into a binary object called an "Image", which you can then store in a "Repository". Now you can use the runtime commands to download (pull) that Image from the Repository and run it on your system. This running, protected memory space is now a "Container".
NOTE: The Container service runs natively on Linux, but in Windows it often runs in a Virtual Machine environment running Linux as a VM. So in Windows you may see several layers of abstraction for Containers to run.
This abstraction holds everything for an application to isolate it from other running processes. It is also completely portable - you can create an image on one system, and another system can run it so long as the Container Runtimes (Such as Docker) Runtime is installed. Containers also start very quickly, are easy to create (called Composing) using a simple text file with instructions of what to install on the image. The instructions pull the base Kernel, and then any binaries you want to install. Several pre-built Containers are already available, SQL Server is one of these. You can read more about installing SQL Server on Container Runtimes (Such as Docker) here.
You can have several Containers running at any one time, based on the amount of hardware resources where you run it. For scale-out systems, a Container allows for distribution and control of complete applications using only declarative commands.
You can read more about Container Runtimes (Such as Docker) here.
For Big Data systems, having lots of Containers is very advantageous to segment purpose and performance profiles. However, dealing with many Container Images, allowing persisted storage, and interconnecting them for network and internetwork communications is a complex task. One such Container Orchestration tool is Kubernetes, an open source Container orchestrator, which can scale Container deployments according to need. The following table defines some important Container Orchestration Tools (Such as Kubernetes or OpenShift) terminology:
Component | Used for |
Cluster | Container Orchestration (Such as Kubernetes or OpenShift) cluster is a set of machines, known as Nodes. One node controls the cluster and is designated the master node; the remaining nodes are worker nodes. The Container Orchestration *master* is responsible for distributing work between the workers, and for monitoring the health of the cluster. |
Node | A Node runs containerized applications. It can be either a physical machine or a virtual machine. A Cluster can contain a mixture of physical machine and virtual machines as Nodes. |
Pod | A Pod is the atomic deployment unit of a Cluster. A pod is a logical group of one or more Containers and associated resources needed to run an application. Each Pod runs on a Node; a Node can run one or more Pods. The Cluster master automatically assigns Pods to Nodes in the Cluster. |
You can learn much more about Container Orchestration systems here. We're using the Azure Kubernetes Service (AKS) as a backup in this workshop, and they have a great set of tutorials for you to learn more here.
In SQL Server Big Data Clusters, the Container Orchestration system (Such as Kubernetes or OpenShift) is responsible for the state of the BDC; it is responsible for building and configuring the Nodes, assigns Pods to Nodes,creates and manages the Persistent Volumes (durable storage), and manages the operation of the Cluster.
NOTE: The OpenShift Container Platform is a commercially supported Platform as a Service (PaaS) based on Kubernetes from RedHat. Many shops require a commercial vendor to implement and support Kubernetes.
(You'll cover the storage aspects of Container Orchestration in more detail in a moment.)
All of this orchestration is controlled by another set of text files, called "Manifests", which you will learn more about in the Modules that follow. You declare the state of the layout, networking, storage, redundancy and more in these files, load them to the Kubernetes environment, and it is responsible for instantiating and maintaining a Cluster with that configuration.
Activity: Familiarize Yourself with Container Orchestration using minikube
To practice with Kubernetes, you will use an online emulator to work with the minikube
platform. This is a small Virtual Machine with a single Node acting as a full Cluster.
Steps
Open this resource, and complete the first module. (You can return to it later to complete all exercises if you wish)
Traditional storage uses a call from the operating system to an underlying I/O system, as you learned earlier. These file systems are either directly connected to the operating system or appear to be connected directly using a Storage Area Network. The blocks of data are stored and managed by the operating system.
For large scale-out data systems, the mounting point for an I/O is another abstraction. For SQL Server BDC, the most commonly used scale-out file system is the Hadoop Data File System, or HDFS. HDFS is a set of Java code that gathers disparate disk subsystems into a Cluster which is comprised of various Nodes - a NameNode, which manages the cluster's metadata, and DataNodes that physically store the data. Files and directories are represented on the NameNode by a structure called inodes. Inodes record attributes such as permissions, modification and access times, and namespace and disk space quotas.
With an abstraction such as Containers, storage becomes an issue for two reasons: The storage can disappear when the Container is removed, and other Containers and technologies can't access storage easily within a Container.
To solve this, Container Runtimes (Such as docker) implemented the concept of Volumes, and Container Orchestration systems extended this concept. Using a specific protocol and command, the Container Orchestration system (and in specific, SQL Server BDC) mounts the storage as a Persistent Volume and uses a construct called a Persistent Volume Claim to access it. A Container Volume is a mounted directory which is accessible to the Containers in a Pod within the Node.
You'll cover Volumes in more depth in a future module as you learn how the SQL Server BDC takes advantage of these constructs.
You can read more about HDFS here.
There are three primary tools and utilities you will use to control the SQL Server big data cluster:
- kubectl
- azdata (Not covered in this course)
- Azure Data Studio (Used for the final Module)
The kubectl command accesses the Application Programming Interfaces (API's) from Kubernetes (other clustering systems may have different command line tools). The utility can be installed your workstation using this process, and it is also available in the Azure Cloud Shell with no installation.
A full list of the kubectl commands is here. You can use these commands for troubleshooting the SQL Server BDC as well.
You'll explore further operations with these tools in the rest of the course.
Azure Data Studio is a cross-platform database tool to manage and program on-premises and cloud data platforms on Windows, MacOS, and Linux. It is extensible, and one of these extensions is how you work with Azure Data Studio code and Jupyter Notebooks. It is built on the Visual Studio Code shell. The editor in Azure Data Studio has Intellisense, code snippets, source control integration, and an integrated terminal.
If you have not completed the prerequisites for this workshop you can install Azure Data Studio from this location, and you will install the Extension to work with SQL Server big data clusters in a future module.
You can learn more about Azure Data Studio here.
You'll explore further operations with the Azure Data Studio in the final module of this course.
Activity: Practice with Notebooks
Steps
Open this reference, and review the instructions you see there. You can clone this Notebook to work with it later.
Activity: Azure Data Studio Notebooks Overview
Steps
Open this reference, and read the tutorial - you do not have to follow the steps, but you can if time permits.
- Official Documentation for this section - Wide World Importers Data Dictionary and company description
- Understanding the Big Data Landscape
- Linux for the Windows Admin
- Container Runtimes (Such as Docker) Guide
- Video introduction to Kubernetes
- Complete course on the Azure Kubernetes Service (AKS)
- Working with Spark
- Full tutorial on Jupyter Notebooks
Next, Continue to 02 - Hardware and Virtualization environment for Kubernetes .