Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add log exporting to e2e tests #308

Merged
merged 1 commit into from
Nov 13, 2024

Conversation

RobotSail
Copy link
Member

@RobotSail RobotSail commented Oct 25, 2024

Currently, the training library runs through a series of end-to-end tests which ensure there are
no bugs in the code being tested. However; we do not perform any form of validation to assure that
the training logic and quality has not diminished.

This presents an issue where we can potentially be "correct" in the sense of no hard errors being hit,
but invisible bugs may be introduced which cause models to regress in training quality, or other
bugs that plague the models themselves to seep in.

This commit fixes that problem by introducng the ability to export the training loss data itself
from the test and rendering the loss curve using matplotlib.

When the results are outputted, they can be found under the "Summary" tab of a Github actions run.
For example:

Screenshot 2024-10-25 at 6 18 14 PM

Resolves #179

Signed-off-by: Oleg S 97077423+RobotSail@users.noreply.github.com

@mergify mergify bot added CI/CD Affects CI/CD configuration ci-failure dependencies Pull requests that update a dependency file labels Oct 25, 2024
@RobotSail RobotSail force-pushed the official-loss-printing branch from 4742edf to 8f77076 Compare October 25, 2024 22:11
@mergify mergify bot added ci-failure and removed ci-failure labels Oct 25, 2024
@RobotSail RobotSail force-pushed the official-loss-printing branch from 8f77076 to 00e0231 Compare October 25, 2024 22:13
@mergify mergify bot removed the ci-failure label Oct 25, 2024
@RobotSail RobotSail force-pushed the official-loss-printing branch from 00e0231 to 4d3e3a7 Compare October 25, 2024 22:17
.github/workflows/e2e-nvidia-l4-x1.yml Show resolved Hide resolved
.github/workflows/e2e-nvidia-l4-x1.yml Outdated Show resolved Hide resolved
@RobotSail RobotSail force-pushed the official-loss-printing branch from 4d3e3a7 to 82d5711 Compare October 26, 2024 17:25
Copy link
Contributor

@JamesKunstle JamesKunstle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@mergify mergify bot added the one-approval label Oct 28, 2024
.github/workflows/e2e-nvidia-l4-x1.yml Outdated Show resolved Hide resolved
.github/workflows/e2e-nvidia-l4-x1.yml Outdated Show resolved Hide resolved
.github/workflows/e2e-nvidia-l4-x1.yml Show resolved Hide resolved
.github/workflows/e2e-nvidia-l4-x1.yml Outdated Show resolved Hide resolved
.github/workflows/e2e-nvidia-l4-x1.yml Show resolved Hide resolved
.github/workflows/e2e-nvidia-l4-x1.yml Show resolved Hide resolved
.github/workflows/e2e-nvidia-l4-x1.yml Show resolved Hide resolved
.github/workflows/e2e-nvidia-l40s-x4.yml Outdated Show resolved Hide resolved
.github/workflows/e2e-nvidia-l40s-x4.yml Show resolved Hide resolved
.github/workflows/e2e-nvidia-l40s-x4.yml Show resolved Hide resolved
@RobotSail RobotSail force-pushed the official-loss-printing branch 2 times, most recently from 1fd7c48 to 387828b Compare November 6, 2024 14:23
@RobotSail RobotSail force-pushed the official-loss-printing branch from 387828b to 039b743 Compare November 13, 2024 14:43
@RobotSail
Copy link
Member Author

@nathan-weinberg I've updated the CI scripts with your feedback, please take another pass when you get a chance and make sure that we didn't miss anything.

Copy link
Member

@nathan-weinberg nathan-weinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like the version number commenting to be consistent with how is it everywhere else, but otherwise LGTM

.github/workflows/e2e-nvidia-l4-x1.yml Outdated Show resolved Hide resolved
.github/workflows/e2e-nvidia-l4-x1.yml Outdated Show resolved Hide resolved
.github/workflows/e2e-nvidia-l4-x1.yml Outdated Show resolved Hide resolved
@mergify mergify bot removed the one-approval label Nov 13, 2024
@nathan-weinberg
Copy link
Member

Can we squash commits before merging? Great work on this @RobotSail excited to see it in action!

Currently, the training library runs through a series of end-to-end tests which ensure there are
no bugs in the code being tested. However; we do not perform any form of validation to assure that
the training logic and quality has not diminished.

This presents an issue where we can potentially be "correct" in the sense of no hard errors being hit,
but invisible bugs may be introduced which cause models to regress in training quality, or other
bugs that plague the models themselves to seep in.

This commit fixes that problem by introducng the ability to export the training loss data itself
from the test and rendering the loss curve using matplotlib.

Signed-off-by: Oleg S <97077423+RobotSail@users.noreply.github.com>
@RobotSail RobotSail force-pushed the official-loss-printing branch from ab6151d to c809c73 Compare November 13, 2024 19:26
@RobotSail
Copy link
Member Author

@nathan-weinberg This has been squashed, I'll remove the hold since that's the only issue.

@RobotSail RobotSail removed the hold label Nov 13, 2024
@mergify mergify bot merged commit ff36e64 into instructlab:main Nov 13, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Affects CI/CD configuration dependencies Pull requests that update a dependency file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include loss curve in E2E tests
4 participants