feat: add log exporting to e2e tests #308

RobotSail · 2024-10-25T22:08:27Z

Currently, the training library runs through a series of end-to-end tests which ensure there are
no bugs in the code being tested. However; we do not perform any form of validation to assure that
the training logic and quality has not diminished.

This presents an issue where we can potentially be "correct" in the sense of no hard errors being hit,
but invisible bugs may be introduced which cause models to regress in training quality, or other
bugs that plague the models themselves to seep in.

This commit fixes that problem by introducng the ability to export the training loss data itself
from the test and rendering the loss curve using matplotlib.

When the results are outputted, they can be found under the "Summary" tab of a Github actions run.
For example:

Resolves #179

Signed-off-by: Oleg S 97077423+RobotSail@users.noreply.github.com

.github/workflows/e2e-nvidia-l4-x1.yml

JamesKunstle

lgtm!

.github/workflows/e2e-nvidia-l4-x1.yml

.github/workflows/e2e-nvidia-l40s-x4.yml

RobotSail · 2024-11-13T14:47:04Z

@nathan-weinberg I've updated the CI scripts with your feedback, please take another pass when you get a chance and make sure that we didn't miss anything.

nathan-weinberg

I'd like the version number commenting to be consistent with how is it everywhere else, but otherwise LGTM

.github/workflows/e2e-nvidia-l4-x1.yml

nathan-weinberg · 2024-11-13T17:34:56Z

Can we squash commits before merging? Great work on this @RobotSail excited to see it in action!

Currently, the training library runs through a series of end-to-end tests which ensure there are no bugs in the code being tested. However; we do not perform any form of validation to assure that the training logic and quality has not diminished. This presents an issue where we can potentially be "correct" in the sense of no hard errors being hit, but invisible bugs may be introduced which cause models to regress in training quality, or other bugs that plague the models themselves to seep in. This commit fixes that problem by introducng the ability to export the training loss data itself from the test and rendering the loss curve using matplotlib. Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>

RobotSail · 2024-11-13T19:26:47Z

@nathan-weinberg This has been squashed, I'll remove the hold since that's the only issue.

mergify bot added CI/CD Affects CI/CD configuration ci-failure dependencies Pull requests that update a dependency file labels Oct 25, 2024

RobotSail force-pushed the official-loss-printing branch from 4742edf to 8f77076 Compare October 25, 2024 22:11

mergify bot added ci-failure and removed ci-failure labels Oct 25, 2024

RobotSail force-pushed the official-loss-printing branch from 8f77076 to 00e0231 Compare October 25, 2024 22:13

mergify bot removed the ci-failure label Oct 25, 2024

RobotSail force-pushed the official-loss-printing branch from 00e0231 to 4d3e3a7 Compare October 25, 2024 22:17

RobotSail requested review from danmcp, Maxusmusti, JamesKunstle, aldopareja, nathan-weinberg and cdoern October 25, 2024 22:18

danmcp requested changes Oct 25, 2024

View reviewed changes

.github/workflows/e2e-nvidia-l4-x1.yml Show resolved Hide resolved

.github/workflows/e2e-nvidia-l4-x1.yml Outdated Show resolved Hide resolved

RobotSail force-pushed the official-loss-printing branch from 4d3e3a7 to 82d5711 Compare October 26, 2024 17:25

danmcp reviewed Oct 26, 2024

View reviewed changes

.github/workflows/e2e-nvidia-l4-x1.yml Outdated Show resolved Hide resolved

JamesKunstle approved these changes Oct 28, 2024

View reviewed changes

mergify bot added the one-approval label Oct 28, 2024

nathan-weinberg requested changes Oct 29, 2024

View reviewed changes

RobotSail force-pushed the official-loss-printing branch 2 times, most recently from 1fd7c48 to 387828b Compare November 6, 2024 14:23

RobotSail force-pushed the official-loss-printing branch from 387828b to 039b743 Compare November 13, 2024 14:43

nathan-weinberg approved these changes Nov 13, 2024

View reviewed changes

.github/workflows/e2e-nvidia-l4-x1.yml Outdated Show resolved Hide resolved

.github/workflows/e2e-nvidia-l4-x1.yml Outdated Show resolved Hide resolved

.github/workflows/e2e-nvidia-l4-x1.yml Outdated Show resolved Hide resolved

mergify bot removed the one-approval label Nov 13, 2024

danmcp approved these changes Nov 13, 2024

View reviewed changes

nathan-weinberg added the hold label Nov 13, 2024

RobotSail force-pushed the official-loss-printing branch from ab6151d to c809c73 Compare November 13, 2024 19:26

RobotSail removed the hold label Nov 13, 2024

nathan-weinberg removed request for aldopareja, Maxusmusti and cdoern November 13, 2024 19:53

mergify bot merged commit ff36e64 into instructlab:main Nov 13, 2024
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add log exporting to e2e tests #308

feat: add log exporting to e2e tests #308

RobotSail commented Oct 25, 2024 •

edited

Loading

JamesKunstle left a comment

RobotSail commented Nov 13, 2024

nathan-weinberg left a comment

nathan-weinberg commented Nov 13, 2024

RobotSail commented Nov 13, 2024

feat: add log exporting to e2e tests #308

feat: add log exporting to e2e tests #308

Conversation

RobotSail commented Oct 25, 2024 • edited Loading

JamesKunstle left a comment

Choose a reason for hiding this comment

RobotSail commented Nov 13, 2024

nathan-weinberg left a comment

Choose a reason for hiding this comment

nathan-weinberg commented Nov 13, 2024

RobotSail commented Nov 13, 2024

RobotSail commented Oct 25, 2024 •

edited

Loading