Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abra merge test #2870

Merged
merged 73 commits into from
Jan 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
1dccb14
Bump torch to 2.1.1 version (#2717)
j316chuck Nov 30, 2023
b11f7b6
Add more info when run doesnt complete (#2751)
aspfohl Dec 1, 2023
5d20db1
Lower sequence generation length on code gen to be dependent on max c…
bmosaicml Dec 4, 2023
8957550
Remove flatten params (#2761)
mvpatel2000 Dec 7, 2023
1a8a664
fix lint (#2767)
mvpatel2000 Dec 7, 2023
e87c06d
lint (#2768)
mvpatel2000 Dec 7, 2023
7f55b7a
Use time.tokens for speedmonitor instead of dataset length (#2762)
mvpatel2000 Dec 7, 2023
cb8f937
remove exception (#2759)
mvpatel2000 Dec 8, 2023
f097fd7
time to clean up time parsing 😉 (#2770)
aspfohl Dec 9, 2023
236b738
Upgrade RunConfig compute specification (#2772)
aspfohl Dec 11, 2023
39d6df4
Use async logging in MLflowLogger (#2693)
chenmoneygithub Dec 11, 2023
c04405e
Fix FSDP _param_init_fn to not reinit parameters multiple times (#2765)
dakinggg Dec 11, 2023
bc50049
Gate FSDP param init test on torch 2.1 (#2774)
dakinggg Dec 11, 2023
aad8901
Parallelize OCI multipart download (#2750)
coryMosaicML Dec 12, 2023
f497e60
[UCVolumes] Add support for list API (#2769)
panchalhp-db Dec 12, 2023
a7cad7c
Add the memory timeline profiling support through the PyTorch profile…
cli99 Dec 12, 2023
db3d187
Improve torch memory profiling arguments processing (#2777)
cli99 Dec 13, 2023
0d61164
Add platform AWS and bump aws ofi nccl version (#2776)
willgleich Dec 13, 2023
776d172
Extend checkpoint loading to accept a validation function (#2726)
irenedea Dec 14, 2023
09f4580
Fix checkpoint validation tests for torch 1.13 (#2779)
irenedea Dec 14, 2023
7e0e40a
Bump version to 0.17.2 (#2780)
mvpatel2000 Dec 14, 2023
45bb135
bump transformers version (#2781)
dakinggg Dec 15, 2023
84059b6
Bump sphinxext-opengraph from 0.9.0 to 0.9.1 (#2784)
dependabot[bot] Dec 18, 2023
15324c7
Bump coverage[toml] from 7.3.0 to 7.3.3 (#2783)
dependabot[bot] Dec 18, 2023
b8363bb
Update torch requirement from <2.1.2,>=1.13.1 to >=1.13.1,<2.1.3 (#2785)
dependabot[bot] Dec 18, 2023
af8797d
[UCVolumes] Rely on databricks-sdk auth for the right requirements (#…
panchalhp-db Dec 19, 2023
420cb07
Enable system metrics in mosaic mlflow logger (#2775)
chenmoneygithub Dec 19, 2023
f24f43c
Update parse_uri (#2787)
irenedea Dec 20, 2023
96df92d
default-no-memory-timeline (#2790)
cli99 Dec 20, 2023
a8a261b
Add eot token to ICL generate kwargs (#2782)
bmosaicml Dec 20, 2023
ff145d3
Add nightly image for torch 2.2.0 12-20-23 (#2791)
j316chuck Dec 21, 2023
a3ea7a4
Add torch nightly 12-13 (#2792)
j316chuck Dec 21, 2023
2aa50e7
Add process group as arg to FSDP (#2794)
mvpatel2000 Dec 26, 2023
910223e
Bump coverage[toml] from 7.3.3 to 7.3.4 (#2798)
dependabot[bot] Dec 28, 2023
db424e5
Fix load_ignore_keys with rng (#2803)
mvpatel2000 Jan 2, 2024
070095e
Bump ipykernel from 6.26.0 to 6.28.0 (#2806)
dependabot[bot] Jan 2, 2024
e274ca0
Bump junitparser from 3.1.0 to 3.1.1 (#2805)
dependabot[bot] Jan 2, 2024
9110c57
Bump pytest from 7.4.3 to 7.4.4 (#2807)
dependabot[bot] Jan 2, 2024
ed4e07c
Avoid futures on close for MosaicML logger (#2804)
mvpatel2000 Jan 2, 2024
ee7cb69
check (#2812)
mvpatel2000 Jan 2, 2024
52ac18c
Better communication computation overlap (#2811)
snarayan21 Jan 2, 2024
80b35a7
Improve error message for speed monitor (#2801)
mvpatel2000 Jan 4, 2024
f6ca956
bump torch version (#2814)
mvpatel2000 Jan 4, 2024
4af5076
bump vision (#2815)
mvpatel2000 Jan 4, 2024
206a9ea
fix rng load (#2816)
mvpatel2000 Jan 4, 2024
e5240d2
Correct multi-unshard stream patching for torch 2.2.0dev, and stream …
snarayan21 Jan 4, 2024
2deccf2
fix profiler (#2818)
mvpatel2000 Jan 4, 2024
5592e41
Bump traitlets from 5.13.0 to 5.14.1 (#2822)
dependabot[bot] Jan 8, 2024
c22c61a
All unshard streams wait on computation every step (#2823)
snarayan21 Jan 8, 2024
23bc6fb
Add encoding=utf-8 (#2824)
dakinggg Jan 8, 2024
a36fb74
Fix import for daily test (#2826)
snarayan21 Jan 8, 2024
f50dcaf
[MLFlowObjectStore] [1/2] Base implementation for MLFlowObjectStore (…
jerrychen109 Jan 8, 2024
0aa95e0
Remove fused layernorm (already deprecated for 2 versions) (#2827)
mvpatel2000 Jan 9, 2024
737b462
checkpoint saver tracks all checkpoints/intervals in state (#2819)
aspfohl Jan 9, 2024
c82dcc4
code-quality timeout update (#2830)
aspfohl Jan 9, 2024
7b70dde
[S] Fix how single value tensors are logged (#2831)
aspfohl Jan 9, 2024
94e0386
Adds DTensor Support (#2821)
mvpatel2000 Jan 9, 2024
eb4fbd0
Remove duplicate checkpoint verifications (#2828)
eracah Jan 10, 2024
c48e6fe
Fix seed for FSDP wrap (#2833)
mvpatel2000 Jan 10, 2024
6c63f2e
Remove fsdp patch for comm overlap (#2836)
mvpatel2000 Jan 11, 2024
83fb295
allow hsdp (#2838)
mvpatel2000 Jan 11, 2024
55341aa
Bump torch 2.1.2 (#2840)
mvpatel2000 Jan 12, 2024
2ff7c27
Upgrade pyright to 1.1.310 (#2841)
b-chu Jan 12, 2024
56fa4bd
[MLFlowObjectStore] [2/2] Support checkpointing with MLFlow (#2810)
jerrychen109 Jan 12, 2024
c9f0c21
update nightly to torch 2.3 (#2842)
j316chuck Jan 13, 2024
c19fd36
Pin sphinxcontrib applehelp (#2854)
mvpatel2000 Jan 13, 2024
027c3d0
Update setup.py (#2855)
j316chuck Jan 13, 2024
a2ae299
Torch 2.3 patch (#2849)
dakinggg Jan 14, 2024
abff6de
Update mosaicml-cli requirement from <0.6,>=0.5.25 to >=0.5.25,<0.7 (…
dependabot[bot] Jan 15, 2024
d497d8f
Rewrite to use individual state functions (#2860)
mvpatel2000 Jan 15, 2024
1bc8d0a
Add custom stopping criteria to ICL generate tasks (#2800)
bmosaicml Jan 15, 2024
31ea664
Add save_ignore_keys (#2868)
mvpatel2000 Jan 16, 2024
f5978a8
fix conflicts
cli99 Jan 16, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 19 additions & 9 deletions .github/mcli/mcli_pytest.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
import argparse
import time

from mcli import RunConfig, RunStatus, create_run, follow_run_logs, stop_run, wait_for_run_status
from mcli import RunConfig, RunStatus, create_run, follow_run_logs, wait_for_run_status

if __name__ == '__main__':

Expand Down Expand Up @@ -67,8 +67,6 @@

export COMMON_ARGS="-v --durations=20 -m '{args.pytest_markers}' {s3_bucket_flag} {clear_tmp_path_flag}"

export PYTHONUNBUFFERED=1

make test PYTEST='{args.pytest_command}' EXTRA_ARGS="$COMMON_ARGS --codeblocks"

make test-dist PYTEST='{args.pytest_command}' EXTRA_ARGS="$COMMON_ARGS" WORLD_SIZE=2
Expand All @@ -79,13 +77,25 @@
'''
config = RunConfig(
name=name,
cluster=args.cluster,
gpu_type=args.gpu_type,
gpu_num=args.gpu_num,
compute={
'cluster': args.cluster,
'gpu_type': args.gpu_type,
'gpus': args.gpu_num
},
image=args.image,
integrations=[git_integration],
command=command,
scheduling={'max_duration': args.timeout / 60 / 60},
env_variables=[
{
'key': 'MOSAICML_PLATFORM',
'value': 'False',
},
{
'key': 'PYTHONUNBUFFERED',
'value': '1',
},
],
)

# Create run
Expand All @@ -102,7 +112,7 @@
print(line, end='')

print('[GHA] Run completed. Waiting for run to finish...')
run = wait_for_run_status(run, status='completed')
run = wait_for_run_status(run, status=RunStatus.COMPLETED)

# Fail if command exited with non-zero exit code or timed out
assert run.status == RunStatus.COMPLETED
# Fail if command exited with non-zero exit code or timed out (didn't reach COMPLETED)
assert run.status == RunStatus.COMPLETED, f'Run {run.name} did not complete: {run.status} ({run.reason})'
2 changes: 1 addition & 1 deletion .github/workflows/code-quality.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ defaults:
jobs:
code-quality:
runs-on: ubuntu-20.04
timeout-minutes: 10
timeout-minutes: 15
strategy:
matrix:
python_version:
Expand Down
5 changes: 5 additions & 0 deletions .github/workflows/pr-cpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,11 @@ jobs:
markers: 'not daily and not remote and not gpu and not vision and not doctest'
pytest_command: 'coverage run -m pytest'
composer_package_name: 'mosaicml'
# - name: 'cpu-3.10-2.2'
# container: mosaicml/pytorch:2.2.0_cu121-nightly20231213-python3.10-ubuntu20.04
# markers: 'not daily and not remote and not gpu and not vision and not doctest'
# pytest_command: 'coverage run -m pytest'
# composer_package_name: 'mosaicml'
- name: 'cpu-vision'
container: mosaicml/pytorch_vision:1.13.1_cpu-python3.10-ubuntu20.04
markers: 'not daily and not remote and not gpu and vision and not doctest'
Expand Down
5 changes: 5 additions & 0 deletions .github/workflows/pr-gpu.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,11 @@ jobs:
markers: 'not daily and not remote and gpu and (doctest or not doctest)'
pytest_command: 'coverage run -m pytest'
composer_package_name: 'mosaicml'
# - name: 'gpu-3.10-2.2'
# container: mosaicml/pytorch:2.2.0_cu121-nightly20231213-python3.10-ubuntu20.04
# markers: 'not daily and not remote and gpu and (doctest or not doctest)'
# pytest_command: 'coverage run -m pytest'
# composer_package_name: 'mosaicml'
name: ${{ matrix.name }}
if: github.repository_owner == 'mosaicml'
with:
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ repos:
types: [python]
pass_filenames: false
args: [--warnings]
additional_dependencies: ["pyright@1.1.256"]
additional_dependencies: ["pyright@1.1.310"]
- repo: https://github.com/trufflesecurity/trufflehog.git
rev: v3.40.0
hooks:
Expand Down
6 changes: 3 additions & 3 deletions CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,11 @@
# as an owner for all sections, so anyone on Composer Eng can approve any Composer PR
# According to the CODEOWNER docs, the last match takes precedence, so @mosaicml/composer-team-eng
# must be mentioned for each rule below.
/composer/algorithms/ @dskhudia @mvpatel2000 @nik-mosaic
/composer/algorithms/ @mosaicml/composer-team-eng
/composer/cli/ @mosaicml/composer-team-eng
/composer/datasets/ @mosaicml/composer-team-eng
/composer/functional/ @dblalock @mvpatel2000
/composer/loggers/ @eracah @dakinggg
/composer/functional/ @mosaicml/composer-team-eng @dblalock
/composer/loggers/ @mosaicml/composer-team-eng @eracah @dakinggg
/composer/loss/ @mosaicml/composer-team-eng
/composer/metrics/ @mosaicml/composer-team-eng
/composer/models/ @mosaicml/composer-team-eng
Expand Down
2 changes: 0 additions & 2 deletions composer/algorithms/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,6 @@ def apply(self, state: State, event: Event, logger: Logger):
from composer.algorithms.cutout import CutOut
from composer.algorithms.ema import EMA
from composer.algorithms.factorize import Factorize
from composer.algorithms.fused_layernorm import FusedLayerNorm
from composer.algorithms.gated_linear_units import GatedLinearUnits
from composer.algorithms.ghost_batchnorm import GhostBatchNorm
from composer.algorithms.gradient_clipping import GradientClipping
Expand Down Expand Up @@ -79,7 +78,6 @@ def apply(self, state: State, event: Event, logger: Logger):
'CutOut',
'EMA',
'Factorize',
'FusedLayerNorm',
'GatedLinearUnits',
'GhostBatchNorm',
'GradientClipping',
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,8 @@
from composer.utils import MissingConditionalImportError

try:
from composer.algorithms.alibi.attention_surgery_functions import _bert, _gpt2 # pyright: reportUnusedImport=none
from composer.algorithms.alibi.attention_surgery_functions import _bert # pyright: ignore[reportUnusedImport]
from composer.algorithms.alibi.attention_surgery_functions import _gpt2 # pyright: ignore[reportUnusedImport]
from composer.algorithms.alibi.attention_surgery_functions.utils import policy_registry
except ImportError as e:
raise MissingConditionalImportError(extra_deps_group='nlp', conda_package='transformers') from e
Expand Down
10 changes: 6 additions & 4 deletions composer/algorithms/alibi/attention_surgery_functions/_bert.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# Copyright 2022 MosaicML Composer authors
# SPDX-License-Identifier: Apache-2.0

import copy
import math
from types import MethodType
from typing import Optional, Tuple
Expand All @@ -20,13 +21,14 @@ def bert_embedding_converter(module: torch.nn.Module, module_index: int, max_seq
"""
assert isinstance(module, (BertEmbeddings, RobertaEmbeddings))
del module_index # unused
zero_and_freeze_expand_position_embeddings(module,
new_module = copy.deepcopy(module)
zero_and_freeze_expand_position_embeddings(new_module,
max_sequence_length,
position_embedding_attribute='position_embeddings')

module_device = next(module.parameters()).device
module.register_buffer('position_ids', torch.arange(max_sequence_length).expand((1, -1)).to(module_device))
return module
module_device = next(new_module.parameters()).device
new_module.register_buffer('position_ids', torch.arange(max_sequence_length).expand((1, -1)).to(module_device))
return new_module


@policy_registry.register(BertSelfAttention, RobertaSelfAttention)
Expand Down
15 changes: 10 additions & 5 deletions composer/algorithms/colout/colout.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,12 @@
__all__ = ['ColOut', 'ColOutTransform', 'colout_batch']


def colout_batch(sample: Union[ImgT, Tuple[ImgT, ImgT]],
p_row: float = 0.15,
p_col: float = 0.15,
resize_target: Union[bool, str] = 'auto') -> Union[ImgT, Tuple[ImgT, ImgT]]:
def colout_batch(
sample: Union[ImgT, Tuple[ImgT, ImgT]],
p_row: float = 0.15,
p_col: float = 0.15,
resize_target: Union[bool,
str] = 'auto') -> Union[torch.Tensor, ImgT, Tuple[Tensor, Tensor], Tuple[ImgT, ImgT]]:
"""Applies ColOut augmentation to a batch of images and (optionally) targets,
dropping the same random rows and columns from all images and targets in a batch.

Expand Down Expand Up @@ -136,7 +138,10 @@ def __init__(self, p_row: float = 0.15, p_col: float = 0.15, resize_target: Unio
self.p_col = p_col
self.resize_target = resize_target

def __call__(self, sample: Union[ImgT, Tuple[ImgT, ImgT]]) -> Union[ImgT, Tuple[ImgT, ImgT]]:
def __call__(
self, sample: Union[ImgT,
Tuple[ImgT,
ImgT]]) -> Union[torch.Tensor, ImgT, Tuple[Tensor, Tensor], Tuple[ImgT, ImgT]]:
"""Drops random rows and columns from up to two images.

Args:
Expand Down
17 changes: 9 additions & 8 deletions composer/algorithms/factorize/factorize_modules.py
Original file line number Diff line number Diff line change
Expand Up @@ -327,8 +327,8 @@ def solution_for_rank(self, input: torch.Tensor, rank: int) -> LowRankSolution:

def apply_solution(self, solution: LowRankSolution):
self.latent_size = solution.rank
self.module0.out_channels = solution.rank
self.module1.in_channels = solution.rank
self.module0.out_channels = solution.rank # pyright: ignore[reportGeneralTypeIssues]
self.module1.in_channels = solution.rank # pyright: ignore[reportGeneralTypeIssues]
_apply_solution_to_module_parameters(solution, self.module0, self.module1, transpose=False)

@staticmethod
Expand Down Expand Up @@ -452,8 +452,8 @@ def solution_for_rank(self, input: torch.Tensor, rank: int) -> LowRankSolution:

def apply_solution(self, solution: LowRankSolution) -> None:
self.latent_size = solution.rank
self.module0.out_features = solution.rank
self.module1.in_features = solution.rank
self.module0.out_features = solution.rank # pyright: ignore[reportGeneralTypeIssues]
self.module1.in_features = solution.rank # pyright: ignore[reportGeneralTypeIssues]
_apply_solution_to_module_parameters(solution, self.module0, self.module1, transpose=True)

@staticmethod
Expand All @@ -471,9 +471,10 @@ def max_allowed_latent_channels(in_features: int, out_features: int) -> int:

@staticmethod
def from_linear(module: torch.nn.Linear, module_ix: int = -1, **kwargs) -> FactorizedLinear:
ret = FactorizedLinear(in_features=module.in_features,
out_features=module.out_features,
bias=((module.bias is not None) and (module.bias is not False)),
**kwargs)
ret = FactorizedLinear(
in_features=module.in_features,
out_features=module.out_features,
bias=(module.bias is not None and module.bias is not False), # pyright: ignore[reportUnnecessaryComparison]
**kwargs)
ret.reset_parameters()
return ret
92 changes: 0 additions & 92 deletions composer/algorithms/fused_layernorm/README.md

This file was deleted.

14 changes: 0 additions & 14 deletions composer/algorithms/fused_layernorm/__init__.py

This file was deleted.

Loading
Loading