-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-311] Change the docker image for Installation Guide Test - needs sudo #10510
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are putting the docker in official mxnet docker hub. People might get confused, with base-cuda9.
Try changing the name but its up to you.
I don't think pushing to dockerhub is needed but if there is a request, I will change the name and create a new PR. |
echo | ||
echo "### Testing Virtualenv ###" | ||
echo "${virtualenv_commands}" | ||
echo | ||
nvidia-docker run --rm nvidia/cuda:9.0-cudnn7-devel bash -c "${virtualenv_commands}" | ||
nvidia-docker run --rm mxnet/base-cuda9 bash -c "${virtualenv_commands}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be more maintainable if we push this docker image for nightly tests to the docker registry (maybe with a different name as @gautamkmr suggested). The release doesn't have to block on this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where can I see the contents of this image? Also it would be good to keep consistency with nvidia images (I assume they are taken as base anyway) and name and branch them with minor versions of cuda: mxnet/base:cuda9.0
mxnet/base:cuda9.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate why we need to maintain two parallel solutions and environments? We've already taken care of problems like these in the docker images that are running in our CI - why not just use those?
Ah, is it because you're trying to replicate what we're basically doing in our containers, just from a users point of view and that's why a pretty much empty docker container is required? |
I think pushing to dockerhub could cause confusion since this is basically an empty cuda9 container and is not really related to mxnet itself. Why can't we just create a simple Dockerfile (exactly what you did in order to create that image locally), put it here and re-use it instead of relying on the local cache or the dockerhub? |
Currently, this change is to run the installation guide test locally for the 1.2 release. As a part of migration to new CI, all of these tests will use the docker images available in this repo and I will add a new one to the ci folder if needed. |
I will add a dockerfile to the repo for this test but only during migration. Doing so now would require temporary changes to the test script to build the docker image. There is no point of doing that when we move to using the pipeline soon. |
I see, thanks for elaborating. So just to clarify: This means if a new slave for the internal CI gets deployed or if the docker cache gets cleaned, this task is going to fail, right? Fine with me as a temporary solution, we just have to be aware of that risk. |
Correct, it is far from ideal. But hoping that this CI setup will be deprecated soon and 1.2 is the last release happening on this one. So we just need to run this test once to validate the release branch. |
Alrighty, fine with me :) Could you please add a jira ticket before we merge? |
@marcoabreu jira is added. Can we merge this ? |
* Remove Fermi from cmake (#10486) * updated R docs (#10473) * [MXNET-120] Float16 support for distributed training (#10183) * send as char * fix bug on pull response, and rowsparse on worker side * three modes * default to mode 0 and add support for row sparse * refactor sparse * rowsparse numbytes fixes * WIP tests * update test sync * remove prints * refactoring * Revert "refactoring" This reverts commit 05ffa1b. * undo refactoring to keep PR simple * add wait to stored in pull default * lint fixes * undo static cast for recvblob * lint fixes * mode 1 changes * sparse bug fix dtype * mshadow default * remove unused var * remove debug statements * clearer variables, reduced multiplication, const vars * add const for more vars, comments * comment syntax, code watcher, test default val * remove unnecessary print in test * trigger ci * multi precision mode (debugging race condition) * working rsp pushes * finish multiprecision for row sparse * rename num-bytes * fix bug due to rename of numbytes, and remove debug logs * address comments * add integration test * trigger ci * integration test * integration test * fix path of script * update mshadow * disable f16c for amalgamation * fix amalgamation build * trigger ci * disable f16c for jetson * Fix rat excludes (#10499) * MXNET-308 added missing license (#10497) * refactored example (#10484) * [MXNET-298] Scala Infer API docs landing page (#10474) * changed url references from dmlc to apache/incubator-mxnet * prepping scala landing pages * infer api info added * Fix infer storage type (#10507) * Fix infer_storage_type * Add test * Fix lint * Trigger CI * [MXNET-306] Add slice_like operator (#10491) * add slice_like and doc * pass unittest and lint * Minor simplifications in ci/build.py (#10496) * [MXNET-305] Scala tutorial table fix (#10488) * initial update on setting up scala ide with mxnet * moving images to web-data project * updated links to images; added readme for root folder * scala hello world feature added * workaround for make transitive error * fixed systempath * minor updates * table fix * added some spacing * more spacing * added ability to set search path for Accelerate library * [MXNET-311] change test needs a docker with sudo, hence image changed (#10510) * added new metric * changed back from branch to upstream * lint changed * Fixed typo * Clarified interpretation * Changes for variable names * fixed variable names * replay the unit tests * changed comment
…e#10524) * Remove Fermi from cmake (apache#10486) * updated R docs (apache#10473) * [MXNET-120] Float16 support for distributed training (apache#10183) * send as char * fix bug on pull response, and rowsparse on worker side * three modes * default to mode 0 and add support for row sparse * refactor sparse * rowsparse numbytes fixes * WIP tests * update test sync * remove prints * refactoring * Revert "refactoring" This reverts commit 05ffa1b. * undo refactoring to keep PR simple * add wait to stored in pull default * lint fixes * undo static cast for recvblob * lint fixes * mode 1 changes * sparse bug fix dtype * mshadow default * remove unused var * remove debug statements * clearer variables, reduced multiplication, const vars * add const for more vars, comments * comment syntax, code watcher, test default val * remove unnecessary print in test * trigger ci * multi precision mode (debugging race condition) * working rsp pushes * finish multiprecision for row sparse * rename num-bytes * fix bug due to rename of numbytes, and remove debug logs * address comments * add integration test * trigger ci * integration test * integration test * fix path of script * update mshadow * disable f16c for amalgamation * fix amalgamation build * trigger ci * disable f16c for jetson * Fix rat excludes (apache#10499) * MXNET-308 added missing license (apache#10497) * refactored example (apache#10484) * [MXNET-298] Scala Infer API docs landing page (apache#10474) * changed url references from dmlc to apache/incubator-mxnet * prepping scala landing pages * infer api info added * Fix infer storage type (apache#10507) * Fix infer_storage_type * Add test * Fix lint * Trigger CI * [MXNET-306] Add slice_like operator (apache#10491) * add slice_like and doc * pass unittest and lint * Minor simplifications in ci/build.py (apache#10496) * [MXNET-305] Scala tutorial table fix (apache#10488) * initial update on setting up scala ide with mxnet * moving images to web-data project * updated links to images; added readme for root folder * scala hello world feature added * workaround for make transitive error * fixed systempath * minor updates * table fix * added some spacing * more spacing * added ability to set search path for Accelerate library * [MXNET-311] change test needs a docker with sudo, hence image changed (apache#10510) * added new metric * changed back from branch to upstream * lint changed * Fixed typo * Clarified interpretation * Changes for variable names * fixed variable names * replay the unit tests * changed comment
Description
The installation guide test used to pull a Cuda docker image from dockerhub and run tests inside this container. Earlier this test was running in cuda7.5 containers which had 'sudo' installed inside it. I changed this to cuda9 a couple of days back.
However, the latest cuda docker images -
nvidia/cuda:9.0-cudnn7-devel
andnvidia/cuda:8.0-cudnn5-devel
do not havesudo
installed which is needed by the scripts in the mxnet installation page.Hence I have created a simple docker image on the host called mxnet/base-cuda9 which only has the following -
nvidia/cuda:9.0-cudnn7-devel
sudo
Please Note:
@anirudh2290 This needs to go into 1.2 to make a nightly test pass.
@gautamkmr @b0noI Can you please review!
Checklist
Essentials
Changes
Comments