Skip to content

Fix things, prepare for v0.2#85

Merged
jlamypoirier merged 3 commits intomainfrom
prepare_for_v0_2
Dec 6, 2024
Merged

Fix things, prepare for v0.2#85
jlamypoirier merged 3 commits intomainfrom
prepare_for_v0_2

Conversation

@jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Dec 4, 2024

✨ Description

  • [Breaking] Drop backward compatibility for configs, except for model configs (so we keep checkpoint compatibility)
  • [Breaking] Remove backward compatibility for old state dict checkpoints (we don't have any of these anyway)
  • Postpone remaining v0.2 todos to v0.3
  • Fix random dataset when the experiment directory is None
  • Fix tensor-parallel backup attention
  • Fix nan reporting in check_parallel_match
  • Fix debug_tensor_parallel for sequence-tensor-parallel
  • Fix presents wrongly set for the last micro-batch with sequence-data-parallel, causing a crash ([bug] Crash with sequence-data-parallel #57)
  • Add tensor_and_sequence_data distributed dim so we can reconstruct micro-sequences properly (needed for test below)
  • Add test for sequence-data-parallel.
  • Explicitly shutdown process groups to prevent a warning
  • Remove some cross-file test dependencies

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

@jlamypoirier jlamypoirier requested a review from tscholak December 4, 2024 01:05
@tscholak tscholak requested a review from charlesGE December 4, 2024 01:53
return metadata
path = config.path / f"metadata.yaml"
logger.warning(f"Loading metadata from {path}")
return CheckpointMetadata.from_dict(yaml.safe_load(path.open("r")))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charlesGE can you confirm that we don't need this anymore please?

@pytest.mark.depends(on=["test_model_sf"])
def test_model_sdp2():
# Sequence-data-parallel
run_test_script(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'd like to defer to @charlesGE who should try to work with ghcr.io/servicenow/fast-llm:sha-85dfeea and see if our experiments can be continued with it.

@jlamypoirier jlamypoirier merged commit 82189e1 into main Dec 6, 2024
@jlamypoirier jlamypoirier deleted the prepare_for_v0_2 branch December 6, 2024 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] Nans and/or desync for sequence-tensor-parallel. [bug] Crash with sequence-data-parallel

2 participants