-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DO NOT MERGE - Pipeline performance test project #4154
Conversation
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I am not able to run the pipeline with missing data so I just quickly review it on a high level.
Can you add some description do the PR explaining how to use this pipeline test. I see that most of the pipeline here are mocking with sleep
, why did you end up going with this implementation?
For example, if I want to answer the question, does Kedro run too slow when it needs to connect to a database, what command should I run?
performance-test/conf/README.md
Outdated
@@ -0,0 +1,20 @@ | |||
# What is this for? | |||
|
|||
This folder should be used to store configuration files used by Kedro or by separate tools. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any specific configuration needed to be documented? Otherwise I think we can remove this from our project
@@ -0,0 +1,98 @@ | |||
# performance-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's more helpful to document how this project should be used, otherwise I suggest removing it as these template doesn't add much information for us.
def register_pipelines(self) -> Dict[str, Pipeline]: | ||
from performance_test.pipelines.expense_analysis import ( | ||
pipeline as expense_analysis_pipeline, | ||
) | ||
|
||
return { | ||
"__default__": expense_analysis_pipeline.create_pipeline(), | ||
"expense_analysis": expense_analysis_pipeline.create_pipeline(), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this belongs to pipeline_registry.py
?
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
…o into pipeline-performance-test
notebook | ||
ruff~=0.1.8 | ||
scikit-learn~=1.5.1; python_version >= "3.9" | ||
scikit-learn<=1.4.0,>=1.0; python_version < "3.9" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyspark
should probably be here
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
…o into pipeline-performance-test
I still unable to run the pipeline - am I suppose to get the data somewhere? Can we merge this folder with Ankita's |
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lrcouto, I was able to get the pipeline to run! (thanks for helping with the setup) It looks good to me, just some minor comments.
I think it'd be nice to have this project be it's own separate repository that we could use to run performance tests instead of be a part of Kedro code base but keen to hear what others think..
@@ -0,0 +1 @@ | |||
{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can get rid of this folder entirely, it's generated by viz
@@ -0,0 +1,19 @@ | |||
# performance-test | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also add setup instructions here, just so it's recorded somewhere!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can run the pipeline successfully too with an extra instruction to install java on GitPod.
kedro run --params=hook_delay=5,dataset_load_delay=5,file_save_delay=5
Apart from @ankatiyar 's comment, some minor comments about making the parameters name consistent. Like we discussed, a few preset of configuration would be helpful so people know how to use the configuration to test (we'll likely need these preset anyway to run benchmark automatically)
If I understand correctly I don't expect any difference between:
kedro run --file-save-delay=5
and kedro run --file-load-delay=5
@@ -0,0 +1,3 @@ | |||
hook_delay: 0 | |||
dataset_load_delay: 0 | |||
file_save_delay: 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we choose one name? either data_save_delay or file_load_delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just call them "save_delay" and "load_delay" maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Project is currently located at https://github.com/kedro-org/pipeline-performance-test Closing this PR since it's not necessary anymore. |
Description
Kedro project made to simulate delays and latency in specific points of a Kedro pipeline. Pass the desired delays in seconds using the
--params
flag. For example:kedro run --params=hook_delay=5,dataset_load_delay=5,file_save_delay=5
Development notes
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file