Skip to content

Commit 13d6fb2

Browse files
author
Alex Ioannides
authored
Merge pull request #10 from AlexIoannides/oliverw1-cherry-picked
Prevent untested dependencies from being packaged [cherry picked from #9 ]
2 parents 1462058 + 6175a79 commit 13d6fb2

File tree

8 files changed

+79
-84
lines changed

8 files changed

+79
-84
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,9 @@
22
*/__pycache__/*
33
*?metastore_db/*
44
*/spark_warehouse/*
5+
.mypy_cache/
56
.vscode/*
7+
.venv
68
venv/*
79
loaded_data/*
810
derby.log

Pipfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ name = "pypi"
66
[packages]
77

88
[dev-packages]
9-
pyspark = "==2.3.1"
9+
pyspark = "==2.4.0"
1010
ipython = "*"
1111
"flake8" = "*"
1212

Pipfile.lock

Lines changed: 41 additions & 48 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -13,14 +13,14 @@ The basic project structure is as follows:
1313

1414
```bash
1515
root/
16+
|-- configs/
17+
| |-- etl_config.json
1618
|-- dependencies/
1719
| |-- logging.py
1820
| |-- spark.py
1921
|-- jobs/
2022
| |-- etl_job.py
21-
|-- configs/
22-
| |-- etl_config.json
23-
| tests/
23+
|-- tests/
2424
| |-- test_data/
2525
| |-- | -- employees/
2626
| |-- | -- employees_report/
@@ -31,19 +31,19 @@ root/
3131
| Pipfile.lock
3232
```
3333

34-
The main Python module containing the ETL job (which will be sent to the Spark cluster), is `jobs/etl_job.py`. Any external configuration parameters required by `etl_job.py` are stored in JSON format in `configs/etl_config.json`. Additional modules that support this job can be kept in the `dependencies` folder (more on this later). In the project's root we include `build_dependencies.sh`, which is bash script for building these dependencies into a zip-file to be sent to the cluster (`packages.zip`). Unit test modules are kept in the `tests` folder and small chunks of representative input and output data, to be use with the tests, are kept in `tests/test_data` folder.
34+
The main Python module containing the ETL job (which will be sent to the Spark cluster), is `jobs/etl_job.py`. Any external configuration parameters required by `etl_job.py` are stored in JSON format in `configs/etl_config.json`. Additional modules that support this job can be kept in the `dependencies` folder (more on this later). In the project's root we include `build_dependencies.sh`, which is a bash script for building these dependencies into a zip-file to be sent to the cluster (`packages.zip`). Unit test modules are kept in the `tests` folder and small chunks of representative input and output data, to be used with the tests, are kept in `tests/test_data` folder.
3535

3636
## Structure of an ETL Job
3737

38-
In order to facilitate easy debugging and testing, we recommend that the 'Transformation' step be isolated from the 'Extract' and 'Load' steps, into it's own function - taking input data arguments in the form of DataFrames and returning the transformed data as a single DataFrame. Then, the code that surrounds the use of the transformation function in the `main()` job function, is concerned with Extracting the data, passing it to the transformation function and then Loading (or writing) the results to their ultimate destination. Testing is simplified, as mock or test data can be passed to the transformation function and the results explicitly verified, which would not be possible if all of the ETL code resided in `main()` and referenced production data sources and destinations.
38+
In order to facilitate easy debugging and testing, we recommend that the 'Transformation' step be isolated from the 'Extract' and 'Load' steps, into its own function - taking input data arguments in the form of DataFrames and returning the transformed data as a single DataFrame. Then, the code that surrounds the use of the transformation function in the `main()` job function, is concerned with Extracting the data, passing it to the transformation function and then Loading (or writing) the results to their ultimate destination. Testing is simplified, as mock or test data can be passed to the transformation function and the results explicitly verified, which would not be possible if all of the ETL code resided in `main()` and referenced production data sources and destinations.
3939

40-
More generally, transformation functions should be designed to be idempotent. This is technical way of saying that the repeated application of the transformation function should have no impact on the fundamental state of output data, until the instance when the input data changes. One of the key advantages of idempotent ETL jobs, is that they can be set to run repeatedly (e.g. by using `cron` to trigger the `spark-submit` command above, on a pre-defined schedule), rather than having to factor-in potential dependencies on other ETL jobs completing successfully.
40+
More generally, transformation functions should be designed to be _idempotent_. This is a technical way of saying that the repeated application of the transformation function should have no impact on the fundamental state of output data, until the moment the input data changes. One of the key advantages of idempotent ETL jobs, is that they can be set to run repeatedly (e.g. by using `cron` to trigger the `spark-submit` command above, on a pre-defined schedule), rather than having to factor-in potential dependencies on other ETL jobs completing successfully.
4141

4242
## Passing Configuration Parameters to the ETL Job
4343

44-
Although it is possible to pass arguments to `etl_job.py`, as you would for any generic Python module running as a 'main' program - by specifying them after the module's filename and then parsing these command line arguments - this can get very complicated, very quickly, especially when there are lot of parameters (e.g. credentials for multiple databases, table names, SQL snippets, etc.). This also makes debugging the code from within a Python interpreter is extremely awkward, as you don't have access to the command line arguments that would ordinarily be passed to the code, when calling it from the command line.
44+
Although it is possible to pass arguments to `etl_job.py`, as you would for any generic Python module running as a 'main' program - by specifying them after the module's filename and then parsing these command line arguments - this can get very complicated, very quickly, especially when there are lot of parameters (e.g. credentials for multiple databases, table names, SQL snippets, etc.). This also makes debugging the code from within a Python interpreter extremely awkward, as you don't have access to the command line arguments that would ordinarily be passed to the code, when calling it from the command line.
4545

46-
A much more effective solution is to send Spark a separate file - e.g. using the `--files configs/etl_config.json` flag with `spark-subit` - containing the configuration in JSON format, which can be parsed into a Python dictionary in one line of code with `json.loads(config_file_contents)`. Testing the code from within a Python interactive console session is also greatly simplified, as all one has to do to access configuration parameters for testing, is to copy and paste the contents of the file - e.g.,
46+
A much more effective solution is to send Spark a separate file - e.g. using the `--files configs/etl_config.json` flag with `spark-submit` - containing the configuration in JSON format, which can be parsed into a Python dictionary in one line of code with `json.loads(config_file_contents)`. Testing the code from within a Python interactive console session is also greatly simplified, as all one has to do to access configuration parameters for testing, is to copy and paste the contents of the file - e.g.,
4747

4848
```python
4949
import json
@@ -61,10 +61,10 @@ In this project, functions that can be used across different ETL jobs are kept i
6161
from dependencies.spark import start_spark
6262
```
6363

64-
This package, together with any additional dependencies referenced within it, must be to copied to each Spark node for all jobs that use `dependencies` to run. This can be achieved in one of several ways:
64+
This package, together with any additional dependencies referenced within it, must be copied to each Spark node for all jobs that use `dependencies` to run. This can be achieved in one of several ways:
6565

6666
1. send all dependencies as a `zip` archive together with the job, using `--py-files` with Spark submit;
67-
2. formally package and upload `dependencies` to somewhere like the `PyPi` archive (or a private version) and then run `pip3 install dependencies` on each node; or,
67+
2. formally package and upload `dependencies` to somewhere like the `PyPI` archive (or a private version) and then run `pip3 install dependencies` on each node; or,
6868
3. a combination of manually copying new modules (e.g. `dependencies`) to the Python path of each node and using `pip3 install` for additional dependencies (e.g. for `requests`).
6969

7070
Option (1) is by far the easiest and most flexible approach, so we will make use of this for now. To make this task easier, especially when modules such as `dependencies` have additional dependencies (e.g. the `requests` package), we have provided the `build_dependencies.sh` bash script for automating the production of `packages.zip`, given a list of dependencies documented in `Pipfile` and managed by the `pipenv` python application (discussed below).
@@ -77,7 +77,7 @@ Assuming that the `$SPARK_HOME` environment variable points to your local Spark
7777
$SPARK_HOME/bin/spark-submit \
7878
--master local[*] \
7979
--packages 'com.somesparkjar.dependency:1.0.0' \
80-
--py-files dependencies.zip \
80+
--py-files packages.zip \
8181
--files configs/etl_config.json \
8282
jobs/etl_job.py
8383
```
@@ -96,7 +96,7 @@ Full details of all possible options can be found [here](http://spark.apache.org
9696

9797
It is not practical to test and debug Spark jobs by sending them to a cluster using `spark-submit` and examining stack traces for clues on what went wrong. A more productive workflow is to use an interactive console session (e.g. IPython) or a debugger (e.g. the `pdb` package in the Python standard library or the Python debugger in Visual Studio Code). In practice, however, it can be hard to test and debug Spark jobs in this way, as they implicitly rely on arguments that are sent to `spark-submit`, which are not available in a console or debug session.
9898

99-
We wrote the `start_spark` function - found in `dependencies/spark.py` - to facilitate the development of Spark jobs that are aware of the context in which they are being executed - i.e. as `spark-submit` jobs or within an IPython console, etc. The expected location of the Spark and job configuration parameters required by the job, is contingent on which execution context has been detected. The doscstring for `start_spark` gives the precise details,
99+
We wrote the `start_spark` function - found in `dependencies/spark.py` - to facilitate the development of Spark jobs that are aware of the context in which they are being executed - i.e. as `spark-submit` jobs or within an IPython console, etc. The expected location of the Spark and job configuration parameters required by the job, is contingent on which execution context has been detected. The docstring for `start_spark` gives the precise details,
100100

101101
```python
102102
def start_spark(app_name='my_spark_app', master='local[*]', jar_packages=[],
@@ -112,25 +112,25 @@ def start_spark(app_name='my_spark_app', master='local[*]', jar_packages=[],
112112
This function also looks for a file ending in 'config.json' that
113113
can be sent with the Spark job. If it is found, it is opened,
114114
the contents parsed (assuming it contains valid JSON for the ETL job
115-
configuration), into a dict of ETL job configuration parameters,
115+
configuration) into a dict of ETL job configuration parameters,
116116
which are returned as the last element in the tuple returned by
117117
this function. If the file cannot be found then the return tuple
118118
only contains the Spark session and Spark logger objects and None
119119
for config.
120120
121121
The function checks the enclosing environment to see if it is being
122122
run from inside an interactive console session or from an
123-
environment which has a `DEBUG` environment varibale set (e.g.
123+
environment which has a `DEBUG` environment variable set (e.g.
124124
setting `DEBUG=1` as an environment variable as part of a debug
125-
configuration within an IDE such as Visual Studio Code or PyCharm in
125+
configuration within an IDE such as Visual Studio Code or PyCharm.
126126
In this scenario, the function uses all available function arguments
127127
to start a PySpark driver from the local PySpark package as opposed
128128
to using the spark-submit and Spark cluster defaults. This will also
129129
use local module imports, as opposed to those in the zip archive
130130
sent to spark via the --py-files flag in spark-submit.
131131
132132
:param app_name: Name of Spark app.
133-
:param master: Cluster connection details (defaults to local[*].
133+
:param master: Cluster connection details (defaults to local[*]).
134134
:param jar_packages: List of Spark JAR package names.
135135
:param files: List of files to send to Spark cluster (master and
136136
workers).

0 commit comments

Comments
 (0)