Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix things, prepare for v0.2 #85

Merged
merged 3 commits into from
Dec 6, 2024
Merged

Fix things, prepare for v0.2 #85

merged 3 commits into from
Dec 6, 2024

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Dec 4, 2024

✨ Description

  • [Breaking] Drop backward compatibility for configs, except for model configs (so we keep checkpoint compatibility)
  • [Breaking] Remove backward compatibility for old state dict checkpoints (we don't have any of these anyway)
  • Postpone remaining v0.2 todos to v0.3
  • Fix random dataset when the experiment directory is None
  • Fix tensor-parallel backup attention
  • Fix nan reporting in check_parallel_match
  • Fix debug_tensor_parallel for sequence-tensor-parallel
  • Fix presents wrongly set for the last micro-batch with sequence-data-parallel, causing a crash ([bug] Crash with sequence-data-parallel #57)
  • Add tensor_and_sequence_data distributed dim so we can reconstruct micro-sequences properly (needed for test below)
  • Add test for sequence-data-parallel.
  • Explicitly shutdown process groups to prevent a warning
  • Remove some cross-file test dependencies

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

@jlamypoirier jlamypoirier requested a review from tscholak December 4, 2024 01:05
@tscholak tscholak requested a review from charlesGE December 4, 2024 01:53
return metadata
path = config.path / f"metadata.yaml"
logger.warning(f"Loading metadata from {path}")
return CheckpointMetadata.from_dict(yaml.safe_load(path.open("r")))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charlesGE can you confirm that we don't need this anymore please?

@pytest.mark.depends(on=["test_model_sf"])
def test_model_sdp2():
# Sequence-data-parallel
run_test_script(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'd like to defer to @charlesGE who should try to work with ghcr.io/servicenow/fast-llm:sha-85dfeea and see if our experiments can be continued with it.

@jlamypoirier jlamypoirier merged commit 82189e1 into main Dec 6, 2024
4 checks passed
@jlamypoirier jlamypoirier deleted the prepare_for_v0_2 branch December 6, 2024 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bug] Nans and/or desync for sequence-tensor-parallel. [bug] Crash with sequence-data-parallel
2 participants