Skip to content

Commit fa5661b

Browse files
⚙️ Staged data format should be CSV; minor formatting
1 parent d6adb39 commit fa5661b

File tree

2 files changed

+7
-7
lines changed

2 files changed

+7
-7
lines changed

dags/scripts/spark/malware_file_detection.py

+6-6
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
1-
"""This is one place where machine learning with Spark could occur. A previously
2-
trained classification algorithm could perhaps be used to classify the new batch of
3-
incoming data and predict whether or not the features in there describe a malicious
4-
file. Additional ML engineering can be done to feed the algorithm with new data
5-
and also improve its accuracy, but that is out of the scope of this project.
1+
"""This is one place where data processing or machine learning with Spark could occur.
2+
A previously trained classification algorithm could perhaps be used to classify the
3+
new batch of incoming data and predict whether or not the features in there describe
4+
a malicious file. Additional ML engineering can be done to feed the algorithm with
5+
new data and also improve its accuracy, but that is out of the scope of this project.
66
77
In this script, I am just selecting some columns that I think might be useful to
88
display on a daily dashboard, and will not be doing any machine learning. The
@@ -37,4 +37,4 @@
3737

3838
# sys.argv[2] is also the full S3 URI for the output destination folder that EMR will write to
3939
# this is going to be the `stage` folder on S3
40-
new_df.write.format("parquet").mode("overwrite").save(sys.argv[2])
40+
new_df.write.format("csv").mode("overwrite").save(sys.argv[2])

dags/utils.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ def _pause_redshift_cluster(cluster_identifier: str):
135135
cluster_state = redshift_hook.cluster_status(cluster_identifier=cluster_identifier)
136136

137137
try:
138-
if cluster_state == 'paused':
138+
if cluster_state == "paused":
139139
return
140140

141141
redshift_hook.get_conn().pause_cluster(ClusterIdentifier=cluster_identifier)

0 commit comments

Comments
 (0)