Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-311] Change the docker image for Installation Guide Test - needs sudo #10510

Merged
merged 1 commit into from
Apr 12, 2018

Conversation

mbaijal
Copy link
Contributor

@mbaijal mbaijal commented Apr 11, 2018

Description

The installation guide test used to pull a Cuda docker image from dockerhub and run tests inside this container. Earlier this test was running in cuda7.5 containers which had 'sudo' installed inside it. I changed this to cuda9 a couple of days back.

However, the latest cuda docker images - nvidia/cuda:9.0-cudnn7-devel and nvidia/cuda:8.0-cudnn5-devel do not have sudo installed which is needed by the scripts in the mxnet installation page.

Hence I have created a simple docker image on the host called mxnet/base-cuda9 which only has the following -

  1. nvidia/cuda:9.0-cudnn7-devel
  2. sudo

Please Note:

  1. These tests are going to run on the official MXNet CI soon and these docker images will not be needed at that point since I plan to fix the script itself.
  2. The dockerfile and docker Image are only present on the Jenkins setup where this nightly test is running. Please add a comment below if you would like me to add this image to the MXNet dockerhub repo.

@anirudh2290 This needs to go into 1.2 to make a nightly test pass.
@gautamkmr @b0noI Can you please review!

Checklist

Essentials

  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Created a new docker image on the slave where these nightly tests run, I can publish this image to dockerhub if needed.
  • Changed the installation guide GPU tests to run in this new docker container

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@mbaijal mbaijal changed the title [WIP] Change the docker image for Installation Guide Test - needs sudo Change the docker image for Installation Guide Test - needs sudo Apr 11, 2018
Copy link
Contributor

@goswamig goswamig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are putting the docker in official mxnet docker hub. People might get confused, with base-cuda9.

Try changing the name but its up to you.

@mbaijal
Copy link
Contributor Author

mbaijal commented Apr 11, 2018

I don't think pushing to dockerhub is needed but if there is a request, I will change the name and create a new PR.

echo
echo "### Testing Virtualenv ###"
echo "${virtualenv_commands}"
echo
nvidia-docker run --rm nvidia/cuda:9.0-cudnn7-devel bash -c "${virtualenv_commands}"
nvidia-docker run --rm mxnet/base-cuda9 bash -c "${virtualenv_commands}"
Copy link
Member

@anirudh2290 anirudh2290 Apr 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more maintainable if we push this docker image for nightly tests to the docker registry (maybe with a different name as @gautamkmr suggested). The release doesn't have to block on this though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can I see the contents of this image? Also it would be good to keep consistency with nvidia images (I assume they are taken as base anyway) and name and branch them with minor versions of cuda: mxnet/base:cuda9.0 mxnet/base:cuda9.1

Copy link
Contributor

@marcoabreu marcoabreu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate why we need to maintain two parallel solutions and environments? We've already taken care of problems like these in the docker images that are running in our CI - why not just use those?

@marcoabreu
Copy link
Contributor

Ah, is it because you're trying to replicate what we're basically doing in our containers, just from a users point of view and that's why a pretty much empty docker container is required?

@marcoabreu
Copy link
Contributor

marcoabreu commented Apr 11, 2018

I think pushing to dockerhub could cause confusion since this is basically an empty cuda9 container and is not really related to mxnet itself. Why can't we just create a simple Dockerfile (exactly what you did in order to create that image locally), put it here and re-use it instead of relying on the local cache or the dockerhub?

@marcoabreu marcoabreu dismissed their stale review April 11, 2018 22:02

non-blocking

@mbaijal
Copy link
Contributor Author

mbaijal commented Apr 11, 2018

Currently, this change is to run the installation guide test locally for the 1.2 release.
Yes as you pointed out, it should be an (almosyt) empty container for this test similar to the user and hence I do not want to test it on the existing containers.

As a part of migration to new CI, all of these tests will use the docker images available in this repo and I will add a new one to the ci folder if needed.

@mbaijal
Copy link
Contributor Author

mbaijal commented Apr 11, 2018

I will add a dockerfile to the repo for this test but only during migration. Doing so now would require temporary changes to the test script to build the docker image. There is no point of doing that when we move to using the pipeline soon.

@marcoabreu
Copy link
Contributor

I see, thanks for elaborating. So just to clarify: This means if a new slave for the internal CI gets deployed or if the docker cache gets cleaned, this task is going to fail, right?

Fine with me as a temporary solution, we just have to be aware of that risk.

@mbaijal
Copy link
Contributor Author

mbaijal commented Apr 11, 2018

Correct, it is far from ideal. But hoping that this CI setup will be deprecated soon and 1.2 is the last release happening on this one. So we just need to run this test once to validate the release branch.

@marcoabreu
Copy link
Contributor

Alrighty, fine with me :)

Could you please add a jira ticket before we merge?

@mbaijal mbaijal changed the title Change the docker image for Installation Guide Test - needs sudo [MXNET-311] Change the docker image for Installation Guide Test - needs sudo Apr 12, 2018
@anirudh2290
Copy link
Member

@marcoabreu jira is added. Can we merge this ?

@marcoabreu marcoabreu merged commit ceb810c into apache:master Apr 12, 2018
dabraude pushed a commit to dabraude/incubator-mxnet that referenced this pull request Apr 12, 2018
zheng-da pushed a commit to zheng-da/incubator-mxnet that referenced this pull request Jun 28, 2018
piiswrong pushed a commit that referenced this pull request Jun 28, 2018
* Remove Fermi from cmake (#10486)

* updated R docs (#10473)

* [MXNET-120] Float16 support for distributed training (#10183)

* send as char

* fix bug on pull response, and rowsparse on worker side

* three modes

* default to mode 0 and add support for row sparse

* refactor sparse

* rowsparse numbytes fixes

* WIP tests

* update test sync

* remove prints

* refactoring

* Revert "refactoring"

This reverts commit 05ffa1b.

* undo refactoring to keep PR simple

* add wait to stored in pull default

* lint fixes

* undo static cast for recvblob

* lint fixes

* mode 1 changes

* sparse bug fix dtype

* mshadow default

* remove unused var

* remove debug statements

* clearer variables, reduced multiplication, const vars

* add const for more vars, comments

* comment syntax, code watcher, test default val

* remove unnecessary print in test

* trigger ci

* multi precision mode (debugging race condition)

* working rsp pushes

* finish multiprecision for row sparse

* rename num-bytes

* fix bug due to rename of numbytes, and remove debug logs

* address comments

* add integration test

* trigger ci

* integration test

* integration test

* fix path of script

* update mshadow

* disable f16c for amalgamation

* fix amalgamation build

* trigger ci

* disable f16c for jetson

* Fix rat excludes (#10499)

* MXNET-308 added missing license (#10497)

* refactored example (#10484)

* [MXNET-298] Scala Infer API docs landing page (#10474)

* changed url references from dmlc to apache/incubator-mxnet

* prepping scala landing pages

* infer api info added

* Fix infer storage type (#10507)

* Fix infer_storage_type

* Add test

* Fix lint

* Trigger CI

* [MXNET-306] Add slice_like operator (#10491)

* add slice_like and doc

* pass unittest and lint

* Minor simplifications in ci/build.py (#10496)

* [MXNET-305] Scala tutorial table fix (#10488)

* initial update on setting up scala ide with mxnet

* moving images to web-data project

* updated links to images; added readme for root folder

* scala hello world feature added

* workaround for make transitive error

* fixed systempath

* minor updates

* table fix

* added some spacing

* more spacing

* added ability to set search path for Accelerate library

* [MXNET-311] change test needs a docker with sudo, hence image changed (#10510)

* added new metric

* changed back from branch to upstream

* lint changed

* Fixed typo

* Clarified interpretation

* Changes for variable names

* fixed variable names

* replay the unit tests

* changed comment
XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018
…e#10524)

* Remove Fermi from cmake (apache#10486)

* updated R docs (apache#10473)

* [MXNET-120] Float16 support for distributed training (apache#10183)

* send as char

* fix bug on pull response, and rowsparse on worker side

* three modes

* default to mode 0 and add support for row sparse

* refactor sparse

* rowsparse numbytes fixes

* WIP tests

* update test sync

* remove prints

* refactoring

* Revert "refactoring"

This reverts commit 05ffa1b.

* undo refactoring to keep PR simple

* add wait to stored in pull default

* lint fixes

* undo static cast for recvblob

* lint fixes

* mode 1 changes

* sparse bug fix dtype

* mshadow default

* remove unused var

* remove debug statements

* clearer variables, reduced multiplication, const vars

* add const for more vars, comments

* comment syntax, code watcher, test default val

* remove unnecessary print in test

* trigger ci

* multi precision mode (debugging race condition)

* working rsp pushes

* finish multiprecision for row sparse

* rename num-bytes

* fix bug due to rename of numbytes, and remove debug logs

* address comments

* add integration test

* trigger ci

* integration test

* integration test

* fix path of script

* update mshadow

* disable f16c for amalgamation

* fix amalgamation build

* trigger ci

* disable f16c for jetson

* Fix rat excludes (apache#10499)

* MXNET-308 added missing license (apache#10497)

* refactored example (apache#10484)

* [MXNET-298] Scala Infer API docs landing page (apache#10474)

* changed url references from dmlc to apache/incubator-mxnet

* prepping scala landing pages

* infer api info added

* Fix infer storage type (apache#10507)

* Fix infer_storage_type

* Add test

* Fix lint

* Trigger CI

* [MXNET-306] Add slice_like operator (apache#10491)

* add slice_like and doc

* pass unittest and lint

* Minor simplifications in ci/build.py (apache#10496)

* [MXNET-305] Scala tutorial table fix (apache#10488)

* initial update on setting up scala ide with mxnet

* moving images to web-data project

* updated links to images; added readme for root folder

* scala hello world feature added

* workaround for make transitive error

* fixed systempath

* minor updates

* table fix

* added some spacing

* more spacing

* added ability to set search path for Accelerate library

* [MXNET-311] change test needs a docker with sudo, hence image changed (apache#10510)

* added new metric

* changed back from branch to upstream

* lint changed

* Fixed typo

* Clarified interpretation

* Changes for variable names

* fixed variable names

* replay the unit tests

* changed comment
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants