Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Commit 2f4613c

Browse files
Add README.md for 'Move data into the cloud' section
1 parent 4dbc469 commit 2f4613c

File tree

5 files changed

+108
-11
lines changed

5 files changed

+108
-11
lines changed

03 - Luigi and Kubernetes/01 - Move data into the cloud/README.md

Lines changed: 101 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,103 @@
1-
# Placeholder
1+
# Move data into the 'cloud'
22

3-
Explanation
3+
This module will go into some details on how and why we have to move data into the cloud. It will also explain some more
4+
intricate parts of the source code in this directory.
5+
6+
## Why
7+
8+
Currently our sample solution only works for local data, meaning it loads local files and outputs
9+
them on the local hard drive. In order to run our solution in a cluster we need to provide the same source files
10+
from a place the cluster can reach. In addition to that we need a place to store our results.
11+
12+
Because our cluster should be runnable everywhere we will setup some storage in the cloud and use a github GIST to serve
13+
our source CSV files. In production you will probably want to serve your source files from a local HDFS cluster or something
14+
similar to that.
15+
16+
## Reading source files
17+
18+
For the luigi example project we have two source files in the CSV format. The make it easy for us we will just paste
19+
them using a github GIST. This is publicly available and free to use. The source files can downloaded [here](https://gist.github.com/falknerdominik/425d72f02bd58cb5d42c3ddc328f505f).
20+
If you click on `RAW` (read: show the raw file) you can obtain a link that is reachable for anyone.
21+
22+
- [CSV 1](https://gist.githubusercontent.com/falknerdominik/425d72f02bd58cb5d42c3ddc328f505f/raw/4ad926e347d01f45496ded5292af9a5a5d67c850/test_file1.CSV)
23+
- [CSV 2](https://gist.githubusercontent.com/falknerdominik/425d72f02bd58cb5d42c3ddc328f505f/raw/4ad926e347d01f45496ded5292af9a5a5d67c850/test_file2.CSV)
24+
25+
Fortunately for us pandas supports reading CSV files directly from an http[s] source. This is done by providing the
26+
`pd.read_csv` function with the url.
27+
28+
```python
29+
data_in: DataFrame = pd.read_csv(self.gist_input_url, sep=";")
30+
```
31+
32+
## Saving the result
33+
34+
Because kubernetes decides on which node the code is executed the program can not make any assumption about the location or the underlying
35+
filesystem structure. To accomplish this, the results are saved on a storage that is reachable from the cluster. There are
36+
multiple storage providers (Google Cloud Storage, Dropbox, AWS S3, ...), but for the purposes of this how-to Azure Blob Storage
37+
is used (This can be replaced by any option you like - or where you still have credits left).
38+
39+
### Creating the Storage [Azure]
40+
41+
Before creating a blob storage, you have to create a storage account. You can do this by simply searching for `storage account` and selecting
42+
`create`. Then fill out the form as shown below:
43+
44+
![Creating an Azure Storage Account](assets/blob_storage.png)
45+
46+
The storage account decides which plan you use (how much you
47+
pay), the actual data will reside in one of many storage types. After that select your new account and search for `containers`.
48+
Add a new one like shown in the screenshot below:
49+
50+
![Creating a Storage Container](assets/storage_container.png)
51+
52+
Selecting the container will show you whats inside of it. A container like this can be used as a simple blob storage.
53+
The pipeline will write any results here. For now take note of the container name (The one you filled in the form above).
54+
55+
To access the storage the pipeline needs a connection string. It contains every detail on how to connect securely to your storage (Account Name, Account Key).
56+
It can be found inside your storage account in the `Access keys` section take a note of it.
57+
58+
> Be careful with connection strings. Do not commit them to your GIT repository, they grant total access (depending on
59+
>your policies) to your storage!
60+
61+
### Saving results with luigi
62+
63+
With the connection string and container name in hand you are well prepared to save the data into your blob storage.
64+
In the `simple_workflow.py` add these two strings:
65+
66+
```python
67+
azure_connection_string = '<Insert-Connection-String>'
68+
container_name = '<Insert-Container-Name>'
69+
```
70+
71+
> Conveniently luigi supports azure blob storage out-of-the-box using the `luigi.contrib.azureblob.AzureBlobStorage` class. The
72+
> connection is provided using the `luigi.contrib.azureblob.AzureBlobClient` class.
73+
74+
After that execute the pipeline and the results get saved to your storage. The output should look similar to this:
75+
76+
```text
77+
DEBUG: Checking if PreprocessAllFiles() is complete
78+
DEBUG: Checking if Preprocess(gist_input_url=https://gist.githubusercontent.com/falknerdominik/425d72f02bd58cb5d42c3ddc328f505f/raw/4ad926e347d01f45496ded5292af9a5a5d67c850/test_file1.CSV, connection_string=<CONNECTION_STRING>, filename=test_file1.CSV) is complete
79+
DEBUG: Checking if Preprocess(gist_input_url=https://gist.githubusercontent.com/falknerdominik/425d72f02bd58cb5d42c3ddc328f505f/raw/4ad926e347d01f45496ded5292af9a5a5d67c850/test_file2.CSV, connection_string=<CONNECTION_STRING>, filename=test_file2.CSV) is complete
80+
INFO: Informed scheduler that task PreprocessAllFiles__99914b932b has status PENDING
81+
INFO: Informed scheduler that task Preprocess_DefaultEndpoints_test_file2_CSV_https___gist_git_6ab2dd2a85 has status PENDING
82+
INFO: Informed scheduler that task Preprocess_DefaultEndpoints_test_file1_CSV_https___gist_git_62fd631f6d has status PENDING
83+
INFO: Done scheduling tasks
84+
INFO: Running Worker with 1 processes
85+
DEBUG: Asking scheduler for work...
86+
DEBUG: Pending tasks: 3
87+
...
88+
DEBUG:luigi-interface:There are no more tasks to run at this time
89+
INFO: Worker Worker(salt=542199482, workers=1, host=andromeda, username=dfalkner, pid=11595) was stopped. Shutting down Keep-Alive thread
90+
INFO:luigi-interface:Worker Worker(salt=542199482, workers=1, host=andromeda, username=dfalkner, pid=11595) was stopped. Shutting down Keep-Alive thread
91+
INFO:
92+
===== Luigi Execution Summary =====
93+
94+
Scheduled 3 tasks of which:
95+
* 3 ran successfully:
96+
- 2 Preprocess(...)
97+
- 1 PreprocessAllFiles()
98+
99+
This progress looks :) because there were no failed tasks or missing dependencies
100+
101+
===== Luigi Execution Summary =====
102+
```
4103

5-
- Why into the cloud?
6-
- Gist links to file
7-
- Azure storage setup (with screenshots)
8-
- where do i get the connection string
9-
- Code explanation
72.9 KB
Loading
23.4 KB
Loading

03 - Luigi and Kubernetes/01 - Move data into the cloud/simple_workflow.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ class Preprocess(luigi.Task):
1616
gist_input_url: str = luigi.Parameter()
1717
connection_string: str = luigi.Parameter()
1818
filename: str = luigi.Parameter()
19+
container_name: str = luigi.Parameter()
1920

2021
def run(self):
2122
# read data from url
@@ -30,7 +31,7 @@ def output(self) -> luigi.Target:
3031
# save the output in the azure blob storage
3132
# noinspection PyTypeChecker
3233
return AzureBlobTarget(
33-
container=r'clcstoragecontainer',
34+
container=self.container_name,
3435
blob=self.filename,
3536
client=AzureBlobClient(
3637
connection_string=self.connection_string)
@@ -44,14 +45,16 @@ class PreprocessAllFiles(luigi.WrapperTask):
4445
# gist where the CSV files are stored
4546
gist_url = 'https://gist.githubusercontent.com/falknerdominik/425d72f02bd58cb5d42c3ddc328f505f/raw/4ad926e347d01f45496ded5292af9a5a5d67c850/'
4647
# connection string obtained for the storage unit via azure
47-
azure_connection_string = 'DefaultEndpointsProtocol=https;AccountName=storageaccountclc;AccountKey=soGFPvXy+lmdLUvj3v0qK7q0rtHe5kdNBL4w2cQd6qqhQ7py5CJQDUEvyqq6AyWnn+AWV/kiIStjDQgXlri7ng==;EndpointSuffix=core.windows.net'
48+
azure_connection_string = '<Insert-Connection-String>'
49+
container_name = '<Insert-Container-Name>'
4850

4951
def requires(self) -> Generator[luigi.Task, None, None]:
5052
for filename in ['test_file1.CSV', 'test_file2.CSV']:
5153
yield Preprocess(
5254
gist_input_url=f'{self.gist_url}{filename}',
5355
filename=filename,
5456
connection_string=self.azure_connection_string,
57+
container_name=self.container_name,
5558
)
5659

5760

03 - Luigi and Kubernetes/Readme.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,6 @@
22

33
This module will run you through the steps necessary to work with luigi in kubernetes.
44

5-
1. [Move Data into the cloud]()
5+
1. [Move Data into the cloud](https://github.com/falknerdominik/luigi_with_kubernetes_summary/blob/master/03%20-%20Luigi%20and%20Kubernetes/01%20-%20Move%20data%20into%20the%20cloud/README.md)
66
2. [Use Kubernetes API]()
7-
3. [Inspect results]()
7+
3. [Inspect results / Dashboard?]()

0 commit comments

Comments
 (0)