falknerdominik
diff --git a/‎01 - Setup/01 - Kubernetes/README.md‎
Lines changed: 48 additions & 0 deletions b/‎01 - Setup/01 - Kubernetes/README.md‎
Lines changed: 48 additions & 0 deletions
diff --git a/‎setup/02 - virtual environments/README.md‎ renamed to ‎01 - Setup/02 - virtual environments/README.md‎
Lines changed: 2 additions & 1 deletion b/‎setup/02 - virtual environments/README.md‎ renamed to ‎01 - Setup/02 - virtual environments/README.md‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎setup/02 - virtual environments/environment.yml‎ renamed to ‎01 - Setup/02 - virtual environments/environment.yml‎ b/‎setup/02 - virtual environments/environment.yml‎ renamed to ‎01 - Setup/02 - virtual environments/environment.yml‎
diff --git a/‎01 - Setup/02 - virtual environments/requirements.txt‎
Lines changed: 34 additions & 0 deletions b/‎01 - Setup/02 - virtual environments/requirements.txt‎
Lines changed: 34 additions & 0 deletions
diff --git a/‎setup/README.md‎ renamed to ‎01 - Setup/README.md‎ b/‎setup/README.md‎ renamed to ‎01 - Setup/README.md‎
diff --git a/‎03 - Luigi and Kubernetes/01 - Move data into the cloud/README.md‎
Lines changed: 103 additions & 0 deletions b/‎03 - Luigi and Kubernetes/01 - Move data into the cloud/README.md‎
Lines changed: 103 additions & 0 deletions
diff --git a/‎03 - Luigi and Kubernetes/01 - Move data into the cloud/assets/blob_storage.png‎
72.9 KB b/‎03 - Luigi and Kubernetes/01 - Move data into the cloud/assets/blob_storage.png‎
72.9 KB
diff --git a/‎03 - Luigi and Kubernetes/01 - Move data into the cloud/assets/storage_container.png‎
23.4 KB b/‎03 - Luigi and Kubernetes/01 - Move data into the cloud/assets/storage_container.png‎
23.4 KB
diff --git a/‎03 - Luigi and Kubernetes/01 - Move data into the cloud/preprocessing.py‎
Lines changed: 27 additions & 0 deletions b/‎03 - Luigi and Kubernetes/01 - Move data into the cloud/preprocessing.py‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎03 - Luigi and Kubernetes/01 - Move data into the cloud/simple_workflow.py‎
Lines changed: 62 additions & 0 deletions b/‎03 - Luigi and Kubernetes/01 - Move data into the cloud/simple_workflow.py‎
Lines changed: 62 additions & 0 deletions
@@ -0,0 +1,48 @@
+# Setup Kubernetes
+
+Kubernetes will be used as a cluster for this tutorial. 
+It acts as a managing layer for the deployments. To follow this tutorial ensure that you 
+have a kubernetes up and running.
+
+Depending on your situation you can use a local (running on a single node) cluster or a cloud cluster.
+
+## Local Setup
+
+A local cluster can be done via the `minicube` application. It is a cheap and easy option for testing deployments.
+
+>Be aware that your operating system and hardware need to support virtualization. 
+
+To install `minicube` follow the up-to-date instructions (available for linux / macOS / windows) provided [here](https://kubernetes.io/docs/tasks/tools/install-minikube/). Specific instructions for Archlinux can be obtained [here](http://blog.programmableproduction.com/2018/03/08/Archlinux-Setup-Minikube-using-KVM/).
+Installing minikube will automatically configure your local kubectl for it. The final output should look similar to this:
+
+```text
+😄  minikube v1.6.2 on Arch 18.1.5
+✨  Selecting 'kvm2' driver from user configuration (alternates: [none])
+🔥  Creating kvm2 VM (CPUs=2, Memory=2000MB, Disk=20000MB) ...
+🐳  Preparing Kubernetes v1.17.0 on Docker '19.03.5' ...
+💾  Downloading kubeadm v1.17.0
+💾  Downloading kubelet v1.17.0
+🚜  Pulling images ...
+🚀  Launching Kubernetes ... 
+⌛  Waiting for cluster to come online ...
+🏄  Done! kubectl is now configured to use "minikube"
+```
+
+### Dashboard
+To verify that the application started start the dashboard like this:
+ 
+```bash 
+$ minikube dashboard
+``` 
+
+If the browser window does not open, you can access the URL via `minikube dashboard --url`. 
+
+### Config file
+
+If you need to refer to any configuration files of your minikube instance, you will find it in the `~/.kube/config` directory.
+
+## Cloud Setup
+
+Instructions for the cloud setup can be found [here](https://github.com/clc3-CloudComputing/clc3-ws19/tree/master/3%20Kubernetes/exercise%203.1).
+TODO: Write down what parameters (url port) we need from this and where we get them (screenshot)
+
@@ -10,6 +10,7 @@ $ conda env create -f environment.yml
 
 or install the required packages with: `conda install numpy pandas luigi scikit-learn`
 
+Additionally you need to install the `pykube` package via pip by executing `pip install pykube azure-storage`.
 ## Virtualenv
 
 For a pip environment you can use the `requirements.txt` in this directory to install the dependencies:
@@ -18,4 +19,4 @@ For a pip environment you can use the `requirements.txt` in this directory to in
 pip install -r requirements.txt
 ```
 
-or install the required packages with: `pip install numpy pandas luigi scikit-learn`
+or install the required packages with: `pip install numpy pandas luigi scikit-learn pykube  azure-storage`
@@ -0,0 +1,34 @@
+azure-common==1.1.24
+azure-nspkg==3.0.2
+azure-storage==0.36.0
+certifi==2019.11.28
+cffi==1.13.2
+chardet==3.0.4
+cryptography==2.8
+docutils==0.16
+httplib2==0.17.0
+idna==2.8
+joblib==0.14.1
+lockfile==0.12.2
+luigi==2.8.11
+numpy==1.18.1
+oauth2client==4.1.3
+oauthlib==3.1.0
+pandas==1.0.0
+pyasn1==0.4.8
+pyasn1-modules==0.2.8
+pycparser==2.19
+pykube==0.15.0
+python-daemon==2.2.4
+python-dateutil==2.8.1
+pytz==2019.3
+PyYAML==5.3
+requests==2.22.0
+requests-oauthlib==1.3.0
+rsa==4.0
+scikit-learn==0.22.1
+scipy==1.4.1
+six==1.14.0
+tornado==5.1.1
+tzlocal==2.0.0
+urllib3==1.25.8
@@ -0,0 +1,103 @@
+# Move data into the 'cloud'
+
+This module will go into some details on how and why we have to move data into the cloud. It will also explain some more
+intricate parts of the source code in this directory.
+
+## Why
+
+Currently our sample solution only works for local data, meaning it loads local files and outputs
+them on the local hard drive. In order to run our solution in a cluster we need to provide the same source files
+from a place the cluster can reach. In addition to that we need a place to store our results.
+
+Because our cluster should be runnable everywhere we will setup some storage in the cloud and use a github GIST to serve
+our source CSV files. In production you will probably want to serve your source files from a local HDFS cluster or something 
+similar to that. 
+
+## Reading source files
+
+For the luigi example project we have two source files in the CSV format. The make it easy for us we will just paste
+them using a github GIST. This is publicly available and free to use. The source files can downloaded [here](https://gist.github.com/falknerdominik/425d72f02bd58cb5d42c3ddc328f505f).
+If you click on `RAW` (read: show the raw file) you can obtain a link that is reachable for anyone.
+
+- [CSV 1](https://gist.githubusercontent.com/falknerdominik/425d72f02bd58cb5d42c3ddc328f505f/raw/4ad926e347d01f45496ded5292af9a5a5d67c850/test_file1.CSV)
+- [CSV 2](https://gist.githubusercontent.com/falknerdominik/425d72f02bd58cb5d42c3ddc328f505f/raw/4ad926e347d01f45496ded5292af9a5a5d67c850/test_file2.CSV)
+
+Fortunately for us pandas supports reading CSV files directly from an http[s] source. This is done by providing the
+`pd.read_csv` function with the url.
+
+```python
+data_in: DataFrame = pd.read_csv(self.gist_input_url, sep=";")
+```
+
+## Saving the result
+
+Because kubernetes decides on which node the code is executed the program can not make any assumption about the location or the underlying
+filesystem structure. To accomplish this, the results are saved on a storage that is reachable from the cluster. There are
+multiple storage providers (Google Cloud Storage, Dropbox, AWS S3, ...), but for the purposes of this how-to Azure Blob Storage
+is used (This can be replaced by any option you like - or where you still have credits left).
+
+### Creating the Storage [Azure]
+
+Before creating a blob storage, you have to create a storage account. You can do this by simply searching for `storage account` and selecting 
+`create`. Then fill out the form as shown below:
+
+ ![Creating an Azure Storage Account](assets/blob_storage.png)
+ 
+The storage account decides which plan you use (how much you
+ pay), the actual data will reside in one of many storage types.  After that select your new account and search for `containers`.
+ Add a new one like shown in the screenshot below:
+ 
+ ![Creating a Storage Container](assets/storage_container.png)
+ 
+Selecting the container will show you whats inside of it. A container like this can be used as a simple blob storage.
+The pipeline will write any results here. For now take note of the container name (The one you filled in the form above).
+ 
+To access the storage the pipeline needs a connection string. It contains every detail on how to connect securely to your storage (Account Name, Account Key).
+It can be found inside your storage account in the `Access keys` section take a note of it.
+
+> Be careful with connection strings. Do not commit them to your GIT repository, they grant total access (depending on 
+>your policies) to your storage!
+
+### Saving results with luigi
+
+With the connection string and container name in hand you are well prepared to save the data into your blob storage. 
+In the `simple_workflow.py` add these two strings:
+
+```python
+azure_connection_string = '<Insert-Connection-String>'
+container_name = '<Insert-Container-Name>'
+```
+
+> Conveniently luigi supports azure blob storage out-of-the-box using the `luigi.contrib.azureblob.AzureBlobStorage` class. The
+> connection is provided using the `luigi.contrib.azureblob.AzureBlobClient` class.
+
+After that execute the pipeline and the results get saved to your storage. The output should look similar to this:
+
+```text
+DEBUG: Checking if PreprocessAllFiles() is complete
+DEBUG: Checking if Preprocess(gist_input_url=https://gist.githubusercontent.com/falknerdominik/425d72f02bd58cb5d42c3ddc328f505f/raw/4ad926e347d01f45496ded5292af9a5a5d67c850/test_file1.CSV, connection_string=<CONNECTION_STRING>, filename=test_file1.CSV) is complete
+DEBUG: Checking if Preprocess(gist_input_url=https://gist.githubusercontent.com/falknerdominik/425d72f02bd58cb5d42c3ddc328f505f/raw/4ad926e347d01f45496ded5292af9a5a5d67c850/test_file2.CSV, connection_string=<CONNECTION_STRING>, filename=test_file2.CSV) is complete
+INFO: Informed scheduler that task   PreprocessAllFiles__99914b932b   has status   PENDING
+INFO: Informed scheduler that task   Preprocess_DefaultEndpoints_test_file2_CSV_https___gist_git_6ab2dd2a85   has status   PENDING
+INFO: Informed scheduler that task   Preprocess_DefaultEndpoints_test_file1_CSV_https___gist_git_62fd631f6d   has status   PENDING
+INFO: Done scheduling tasks
+INFO: Running Worker with 1 processes
+DEBUG: Asking scheduler for work...
+DEBUG: Pending tasks: 3
+...
+DEBUG:luigi-interface:There are no more tasks to run at this time
+INFO: Worker Worker(salt=542199482, workers=1, host=andromeda, username=dfalkner, pid=11595) was stopped. Shutting down Keep-Alive thread
+INFO:luigi-interface:Worker Worker(salt=542199482, workers=1, host=andromeda, username=dfalkner, pid=11595) was stopped. Shutting down Keep-Alive thread
+INFO: 
+===== Luigi Execution Summary =====
+
+Scheduled 3 tasks of which:
+* 3 ran successfully:
+    - 2 Preprocess(...)
+    - 1 PreprocessAllFiles()
+
+This progress looks :) because there were no failed tasks or missing dependencies
+
+===== Luigi Execution Summary =====
+```
+
@@ -0,0 +1,27 @@
+"""
+This module includes an example preprocessing step.
+"""
+
+import pandas as DataFrame
+
+
+def drop_nan_columns(data: DataFrame) -> DataFrame:
+    """
+    Drop all columns with more than 80percent missing values.
+
+    :param data: Input DataFrame which should be preprocessed.
+    :return: DataFrame where columns with more than 80 percent missing values are deleted.
+    """
+    data = data.dropna(axis=1, thresh=(len(data)*80)/100)
+    return data
+
+
+def drop_duplicates(data: DataFrame) -> DataFrame:
+    """
+    Drop duplicated rows and columns.
+
+    :param data: Input DataFrame which should be preprocessed.
+    :return: DataFrame where columns with more than 80 percent missing values are deleted.
+    """
+    data = data.drop_duplicates()
+    return data
@@ -0,0 +1,62 @@
+"""Preprocessing example to show how luigi works (only one preprocessing step will be executed!)."""
+from typing import Generator
+
+import luigi
+import pandas as DataFrame
+import pandas as pd
+from luigi.contrib.azureblob import AzureBlobTarget, AzureBlobClient
+
+from preprocess import drop_nan_columns
+
+
+class Preprocess(luigi.Task):
+    """
+    Applies general preprocessing steps to all CSV files loaded.
+    """
+    gist_input_url: str = luigi.Parameter()
+    connection_string: str = luigi.Parameter()
+    filename: str = luigi.Parameter()
+    container_name: str = luigi.Parameter()
+
+    def run(self):
+        # read data from url
+        data_in: DataFrame = pd.read_csv(self.gist_input_url, sep=";")
+        data_preprocessed = drop_nan_columns(data_in)
+
+        # write contents to azure blob file
+        with self.output().open("w") as output_file:
+            data_preprocessed.to_csv(output_file)
+
+    def output(self) -> luigi.Target:
+        # save the output in the azure blob storage
+        # noinspection PyTypeChecker
+        return AzureBlobTarget(
+            container=self.container_name,
+            blob=self.filename,
+            client=AzureBlobClient(
+                connection_string=self.connection_string)
+        )
+
+
+class PreprocessAllFiles(luigi.WrapperTask):
+    """
+    Applies defined preprocessing steps to all files in the selected folder.
+    """
+    # gist where the CSV files are stored
+    gist_url = 'https://gist.githubusercontent.com/falknerdominik/425d72f02bd58cb5d42c3ddc328f505f/raw/4ad926e347d01f45496ded5292af9a5a5d67c850/'
+    # connection string obtained for the storage unit via azure
+    azure_connection_string = '<Insert-Connection-String>'
+    container_name = '<Insert-Container-Name>'
+
+    def requires(self) -> Generator[luigi.Task, None, None]:
+        for filename in ['test_file1.CSV', 'test_file2.CSV']:
+            yield Preprocess(
+                gist_input_url=f'{self.gist_url}{filename}',
+                filename=filename,
+                connection_string=self.azure_connection_string,
+                container_name=self.container_name,
+            )
+
+
+if __name__ == "__main__":
+    luigi.build([PreprocessAllFiles()], local_scheduler=True)