-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for multiple requirements.txt #77
Comments
To clarify: the requirements.txt referred to here is the one inside the projects folder that the user has created ? e.g., for the template/ml-intermediate project it would be https://github.com/ploomber/projects/blob/master/templates/ml-intermediate/requirements.txt ? I'm trying out the tutorial here https://soopervisor.readthedocs.io/en/latest/tutorials/airflow.html for understanding the soopervisor workflow and have the below issues:
If I manually install the latest Flask version I get this error:
|
yes, the typical structure is:
Use the argo exporter, airflow is a bit harder to get working because airflow has too many dependencies and it tends to break the environment (as you saw). You don't need to have kubernetes running. You can run this quick instructions: https://soopervisor.readthedocs.io/en/latest/cookbook/kubernetes.html but if you want to see the full end to end workflow working (you probably need this once you start making code changes), check this one out: https://soopervisor.readthedocs.io/en/latest/tutorials/kubernetes.html let me know if you have questions about the tutorial or the code in general |
|
good question. Currently, we require yes, the convention will be
no, they should be optional. For example, if the user has a task "fit" in their pipeline, but there is no
this is related to the previous question. the only mandatory one is |
On following the example here : https://soopervisor.readthedocs.io/en/latest/cookbook/kubernetes.html , these are my observations and high-level design. @edublancas Please provide your inputs
Some doubts here:
is copying all the files in the project folder. We need to to copy only the relevant requirements*.lock.txt file (exclude remaining requirements files using exclude parameter ? ).
|
|
So I have made some changes in docker.py here (looping over all the requirements.*.lock.txt files) : https://github.com/neelasha23/soopervisor/blob/issue_77/src/soopervisor/commons/docker.py Now the issue is : It runs fine for the first requirements files, but for the subsequent one it throws error On debugging I realized that at this point here : https://github.com/ploomber/ploomber/blob/4bf4d937e1dee8d3defe9a3e89e83720860819d2/src/ploomber/io/_commander.py#L260 for the first run the dst folder (which looks like So I added an explicit delete method in ploomber/io/_commander.py like so:
and called this method here : https://github.com/neelasha23/soopervisor/blob/2bcd0099300fc67e5f5508dc9b1eb1846da0879d/src/soopervisor/commons/docker.py#L187 But this doesnt seem to solve the problem. Although the dst folder gets deleted with this change it somehow comes back before the second run. Any suggestions on what exactly I'm missing here? |
some feedback:
We want to generate one image per pattern, not one image per task. Think of an scenario of a pipeline that has 100 tasks, but only a requirements.a.txt and requirements.b.txt, we want to generate two images, not 100.
yeah that's fine. we can copy the entire project in all images
Yes
What would the benefit be?
that's fine. tasks can use different images, and there would be no problems when passing input/output data. But note something: after we generate the docker image (I say the cause we only support one right now) - we pass the name and tag so the next step uses it to schedule the jobs. the implementation of that next step depends on the backend. for example, when using Argo, we need do pass the name and tag to a specific field in the argo.yml file, when using aws batch, we need o pass the name and tag to the boto3 API call that schedules the job. Given that our current priority is the AWS Batch export, let's first implement it there.
I went through the code to try to find what the error is but I couldn't. My suggestion is to use pdb and go line by line during the execution of |
Yes I realised this once I started making the code changes after commenting ,and it's currently as per the requirement above.
When I started understanding Docker compose I realised it's useful for spinning up multiple containers corresponding to different types of applications. So I don't think it's needed for this use case.
I did a debug again and realised I should change the current directory back to the parent of However now it seems docker is not able to create images with
Contents of eda project before running
Contents of each image:
Can we drop the |
This makes sense, but we already have a wildcard feature to match resources (e.g. requesting GPU) using wildcards (aka the |
Okay. So the requirements files will still be named as requirements.fit-*.txt etc but while generating the image it would be |
correct! |
I get the same error with __ as well: Issues when executing
soopervisor/src/soopervisor/commons/docker.py Lines 76 to 81 in 3ee5dec
I'm not sure how this piece will change for multiple requirement/ environment files.
|
Update: I have been successfully able to submit this example https://github.com/ploomber/projects/blob/master/templates/ml-basic/pipeline.yaml to AWS batch with a single requirements.lock.txt file. |
@neelasha23 so I assume all of the above were solved? |
With respect to code changes, point 4 is still an open question. I haven't made any changes in this if block as of yet. The other points are an issue for the ml-online project, but I'm working with ml-basic right now for testing end-to-end. This one is also open (image creation fails with __ ): I get the same error with __ as well: invalid argument "eda-clean-__:latest" for "-t, --tag" flag: invalid reference format |
For the first one, the setup is similar to the reqs.txt in esence, I don't see a point of having/supporting both. |
Yes I will try out the online template with a fresh environment. But since it has a setup.py file the control doesn't enter the else block where I have made changes for generating multiple images. So I dropped it as of now. |
In any case the image name is getting converted to this format:
and matching the keys with task name using regex downstream. |
hi, sorry for the delay. the notification got lost in my inbox. Re
https://docs.docker.com/engine/reference/commandline/tag/#extended-description so maybe the solution is to just append some fixed string at the end?
(I'm unsure what's the best suffix, but let's use ploomber for now) Re |
Updates:
I see the following log for the first job:
This is probably happening because the images were built on Apple M1. I'm trying the solutions and will post an update once it works. |
(Note: this only applies for Docker-based exporters like AWS, and Airflow)
Currently, when generating the Docker image to run tasks, users need to list the dependencies they want to include. For example, a user may have the following:
# content of requirements.txt scikit-learn pandas
However, in some cases, users may have a long list of dependencies. This is often the case when using many ML packages that implement different models (to find which model works best). Unfortunately, long environments are sometimes impossible to set up, up if two or more dependencies have conflicts. e.g. scikit-learn requires
numpy==1.2
but xgboost requiresnumpy>1.2
. If that happens, users will be unable to create a single docker image.To fix this, we should allow them to declare multiple
requirements.txt
files and then generate multiple Docker images; note that this does not require multiple Dockerfile since we can re-use the same one, with a different requirements.txt and tag them differently so they effectively become multiple images.To keep things simple, we can have a naming convention. For example, if a DAG has tasks: load, clean, fit-0, fit-1, and fit-2. We could do something like this:
If a user has
requirements.txt
and nothing else, then generate a single image.If a user has
requirement.txt
, and any other files that match the formatrequirements.*.txt
: match filenames to task names, see below.Pattern matching
(we can use fnmatch for this)
Say the user has
requirements.txt
, andrequirements.fit-*.txt
use
requirements.fit-*.txt
for tasks fit-0, fit-1, and fit-2.use
requirements.txt
for all the other tasksOpen question: is the
*
problematic? I gave it a try and doesn't seem to be any issues on UNIX if a filename has*
on it. The other thing is to ensure that docker allows*
in image names. If not, we should use another wildcard, (perhaps double underscore?__
)Changes should be applied here
The
docker.build
function currently returns a list of image names, with this change, it should extend it to return all the generated images.Modifying aws batch exporter
Once images are pushed, the aws batch exporter needs to pattern match the task names to the right images.
testing
We should test that given some combinations of
requirements.txt
files, thedocker
CLI is called with the right arguments and that the tasks are scheduled in AWS with the right arguments as well.The text was updated successfully, but these errors were encountered: