This repository was archived by the owner on Jun 6, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 549
This repository was archived by the owner on Jun 6, 2024. It is now read-only.
2020 June~July Release #4575
Copy link
Copy link
Closed
Description
This plan captures the work in June 2020. This is a 6 week iteration major focus on system features and improvements, and we will ship it in the beginning of July.
We'll continue work we started in the previous iteration, and we started to pick up items from our roadmap.
Release Manager
Endgame
Code freeze: 7.3
Scrum demo date: 7.9
Endgame & retrospective date: 7.20
Plan Items
Below is a summary of the top level plan items.
- UI improvement to support end2end marketplace needs, i.e. COVID-19 research site; AI EDU market, etc. @debuggy
@yangou1988 @TobeyQin (TODO: @debuggy book a prototype review on 6/28) 2 weeks for dev Refine team storage description [for marketplace] #4671 TestOwner @TobeyQin @qfyin - Set up latency benchmark report for 100 virtual nodes --> 1500 virtual nodes * 8 job @ydye almost done, wait for automation script. Based on ansible. TestOwner: @Binyang2014
- configuration generated role. (virtual-kubelet + openpai + ssl cert)
- Environment setup role. (virtual-kubelet + openpai + locust)
[ ] Auto stress test script and stress test role. (locust)
- GPU Fairness Usage (user gpu usage script improvement) @Binyang2014 GPU fairness usage #4266
- Storage management experience: i.e. read only storage @abuccts TestOwner: @hzy46 Read-only storage's permission is still
RW#4688 - disable http, use https instead @abuccts TestOwner: @yiyione TestDone
- resolve always retry for port conflict: Change to use Hash(podUid, portName, portIndex) to calculate the port number, avoid always retry #4384 @Binyang2014 TestOwner: @abuccts TestDone
- Support nested AD group in AAD Mode Get Recursive Nested AD Users in AAD Mode #3440 @ydye TestOwner: @scarlett2018 TestDone
- Code
- Test
Engineering Improvements
- Engineering Excellent
- Add SDK to rest-server CI @yiyione PR: Add JS SDK tests to CI #4631
- SDK api tests in Azure Pipelines (e.g. int bed tests)
- Webportal/VScode use SDK + SDK improvement @yiyione
- submit-job-v2 plugin (clone & submit jobs) PR: Use JS SDK in submit-job-v2 plugin #4613 TestOwner: @debuggy
- webportal job related API TestOwner: @debuggy
- Storage API in webportal TestOwner: @hzy46 TestDone
- JS SDK update. PR
- Webportal update. PR: [webportal] Update storage API using JS SDK #4660
- VSCode convert old SDK code into 0.1.0 PR TestOwner: @yiyione @qfyin
- Code coverage @abuccts
- Delete yarn code [Rest Server] Remove YARN code #4635
- VSCode code coverage @yiyione
- Add coverage report to CI
- JS SDK code coverage @yiyione
- Add coverage report to CI PR
- Add more tests
Bug fixes
- Align webportal submit defaulting with backend defaulting. Resolves Align webportal submit defaulting with backend defaulting #4691 TestOwner: @yqwang-ms TestDone
- For tensorboard v2 the logdir is not right Tensorboard logdir not right #4618 TestOwner: @hzy46 TestDone
- Jobs mounted to the same AzureBlob cannot run in the same host Jobs which mounted to the same Azure-Blob can not run on the same node #4637 @abuccts TestOwner: @Binyang2014 TestDone
- Ssh barrier bug fix @Binyang2014 TestOwner: @abuccts TestDone
- WebPortal submit job help link broken. Resolves WebPortal submit job help link broken #4690
- When job/task not completed, its completion time on WebPortal and RestServer is 0 (1970/1/1), confusing When job/task not completed, its completion time on WebPortal and RestServer is 0 (1970/1/1), confusing #4711
Doc
- Doc enhance Issue Document Enhancement #4686 PR Document Enhance #4700
Pending Items (moved to next iteration)
- The defaulting should only be done after submission phase The defaulting should only be done after submission phase #4576 @debuggy
- Surfacing more backend error to users in Job Details Page (@qfyin ,@yqwang-ms) Enrich job debugging info #4649
- @yqwang-ms listed of the top potential error in OneNote
- Show "More Diagnostics" in job detail page Add more task debug info in jobInfo api response [rest-server] #4667 Add task debug info in job detail page [webportal] #4670 @debuggy TestOwner: @yqwang-ms TestDone
- Also show "More Diagnostics" in job retry history page Also show "More Diagnostics" in job retry history page #4689 @debuggy
- Move "More Diagnostics" button to job level and disable its scroll button Move "More Diagnostics" button to job level and disable its scroll button #4676 @debuggy
- Show events in job detail and history page Enrich job debugging info #4649 @debuggy
- list out the potential scenario and possible display experience (i.e. advanced diagnostic mode) @debuggy
- Quick start for Azure, and deploy CNI issue @ydye
- GPU Utilization improve training cost ROI for EDU scenario @hzy46 (TODO: book a scenario vs. design meeting) Use alert-manager to alert/kill low GPU efficiency job #4623
- Webportal use new storage API (thru SDK) + support DShuttle type @qfyin /@yiyione Webportal use new storage API and support DShuttle type #4605
- DShuttle integration and internal deploy & usage @Binyang2014 (depending on the above item; TODO: book an end 2 end user experience review meeting) Dshuttle integration Plan #4599
- New RestServer Arch: RestServer -> DB -> ApiServer @hzy46 /@yqwang-ms New RestServer Architecture: RestServer -> DB -> ApiServer #4651
- open bugs: bug
Deferred Items
N/A