|
1 | | -# Placeholder |
| 1 | +# Move data into the 'cloud' |
2 | 2 |
|
3 | | -Explanation |
| 3 | +This module will go into some details on how and why we have to move data into the cloud. It will also explain some more |
| 4 | +intricate parts of the source code in this directory. |
| 5 | + |
| 6 | +## Why |
| 7 | + |
| 8 | +Currently our sample solution only works for local data, meaning it loads local files and outputs |
| 9 | +them on the local hard drive. In order to run our solution in a cluster we need to provide the same source files |
| 10 | +from a place the cluster can reach. In addition to that we need a place to store our results. |
| 11 | + |
| 12 | +Because our cluster should be runnable everywhere we will setup some storage in the cloud and use a github GIST to serve |
| 13 | +our source CSV files. In production you will probably want to serve your source files from a local HDFS cluster or something |
| 14 | +similar to that. |
| 15 | + |
| 16 | +## Reading source files |
| 17 | + |
| 18 | +For the luigi example project we have two source files in the CSV format. The make it easy for us we will just paste |
| 19 | +them using a github GIST. This is publicly available and free to use. The source files can downloaded [here](https://gist.github.com/falknerdominik/425d72f02bd58cb5d42c3ddc328f505f). |
| 20 | +If you click on `RAW` (read: show the raw file) you can obtain a link that is reachable for anyone. |
| 21 | + |
| 22 | +- [CSV 1](https://gist.githubusercontent.com/falknerdominik/425d72f02bd58cb5d42c3ddc328f505f/raw/4ad926e347d01f45496ded5292af9a5a5d67c850/test_file1.CSV) |
| 23 | +- [CSV 2](https://gist.githubusercontent.com/falknerdominik/425d72f02bd58cb5d42c3ddc328f505f/raw/4ad926e347d01f45496ded5292af9a5a5d67c850/test_file2.CSV) |
| 24 | + |
| 25 | +Fortunately for us pandas supports reading CSV files directly from an http[s] source. This is done by providing the |
| 26 | +`pd.read_csv` function with the url. |
| 27 | + |
| 28 | +```python |
| 29 | +data_in: DataFrame = pd.read_csv(self.gist_input_url, sep=";") |
| 30 | +``` |
| 31 | + |
| 32 | +## Saving the result |
| 33 | + |
| 34 | +Because kubernetes decides on which node the code is executed the program can not make any assumption about the location or the underlying |
| 35 | +filesystem structure. To accomplish this, the results are saved on a storage that is reachable from the cluster. There are |
| 36 | +multiple storage providers (Google Cloud Storage, Dropbox, AWS S3, ...), but for the purposes of this how-to Azure Blob Storage |
| 37 | +is used (This can be replaced by any option you like - or where you still have credits left). |
| 38 | + |
| 39 | +### Creating the Storage [Azure] |
| 40 | + |
| 41 | +Before creating a blob storage, you have to create a storage account. You can do this by simply searching for `storage account` and selecting |
| 42 | +`create`. Then fill out the form as shown below: |
| 43 | + |
| 44 | +  |
| 45 | + |
| 46 | +The storage account decides which plan you use (how much you |
| 47 | + pay), the actual data will reside in one of many storage types. After that select your new account and search for `containers`. |
| 48 | + Add a new one like shown in the screenshot below: |
| 49 | + |
| 50 | +  |
| 51 | + |
| 52 | +Selecting the container will show you whats inside of it. A container like this can be used as a simple blob storage. |
| 53 | +The pipeline will write any results here. For now take note of the container name (The one you filled in the form above). |
| 54 | + |
| 55 | +To access the storage the pipeline needs a connection string. It contains every detail on how to connect securely to your storage (Account Name, Account Key). |
| 56 | +It can be found inside your storage account in the `Access keys` section take a note of it. |
| 57 | + |
| 58 | +> Be careful with connection strings. Do not commit them to your GIT repository, they grant total access (depending on |
| 59 | +>your policies) to your storage! |
| 60 | +
|
| 61 | +### Saving results with luigi |
| 62 | + |
| 63 | +With the connection string and container name in hand you are well prepared to save the data into your blob storage. |
| 64 | +In the `simple_workflow.py` add these two strings: |
| 65 | + |
| 66 | +```python |
| 67 | +azure_connection_string = '<Insert-Connection-String>' |
| 68 | +container_name = '<Insert-Container-Name>' |
| 69 | +``` |
| 70 | + |
| 71 | +> Conveniently luigi supports azure blob storage out-of-the-box using the `luigi.contrib.azureblob.AzureBlobStorage` class. The |
| 72 | +> connection is provided using the `luigi.contrib.azureblob.AzureBlobClient` class. |
| 73 | +
|
| 74 | +After that execute the pipeline and the results get saved to your storage. The output should look similar to this: |
| 75 | + |
| 76 | +```text |
| 77 | +DEBUG: Checking if PreprocessAllFiles() is complete |
| 78 | +DEBUG: Checking if Preprocess(gist_input_url=https://gist.githubusercontent.com/falknerdominik/425d72f02bd58cb5d42c3ddc328f505f/raw/4ad926e347d01f45496ded5292af9a5a5d67c850/test_file1.CSV, connection_string=<CONNECTION_STRING>, filename=test_file1.CSV) is complete |
| 79 | +DEBUG: Checking if Preprocess(gist_input_url=https://gist.githubusercontent.com/falknerdominik/425d72f02bd58cb5d42c3ddc328f505f/raw/4ad926e347d01f45496ded5292af9a5a5d67c850/test_file2.CSV, connection_string=<CONNECTION_STRING>, filename=test_file2.CSV) is complete |
| 80 | +INFO: Informed scheduler that task PreprocessAllFiles__99914b932b has status PENDING |
| 81 | +INFO: Informed scheduler that task Preprocess_DefaultEndpoints_test_file2_CSV_https___gist_git_6ab2dd2a85 has status PENDING |
| 82 | +INFO: Informed scheduler that task Preprocess_DefaultEndpoints_test_file1_CSV_https___gist_git_62fd631f6d has status PENDING |
| 83 | +INFO: Done scheduling tasks |
| 84 | +INFO: Running Worker with 1 processes |
| 85 | +DEBUG: Asking scheduler for work... |
| 86 | +DEBUG: Pending tasks: 3 |
| 87 | +... |
| 88 | +DEBUG:luigi-interface:There are no more tasks to run at this time |
| 89 | +INFO: Worker Worker(salt=542199482, workers=1, host=andromeda, username=dfalkner, pid=11595) was stopped. Shutting down Keep-Alive thread |
| 90 | +INFO:luigi-interface:Worker Worker(salt=542199482, workers=1, host=andromeda, username=dfalkner, pid=11595) was stopped. Shutting down Keep-Alive thread |
| 91 | +INFO: |
| 92 | +===== Luigi Execution Summary ===== |
| 93 | +
|
| 94 | +Scheduled 3 tasks of which: |
| 95 | +* 3 ran successfully: |
| 96 | + - 2 Preprocess(...) |
| 97 | + - 1 PreprocessAllFiles() |
| 98 | +
|
| 99 | +This progress looks :) because there were no failed tasks or missing dependencies |
| 100 | +
|
| 101 | +===== Luigi Execution Summary ===== |
| 102 | +``` |
4 | 103 |
|
5 | | -- Why into the cloud? |
6 | | -- Gist links to file |
7 | | -- Azure storage setup (with screenshots) |
8 | | - - where do i get the connection string |
9 | | -- Code explanation |
|
0 commit comments