Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air/train/docs] Add trainer user guide and update trainer docs #27389

Merged
merged 44 commits into from
Aug 4, 2022
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
e9b690e
tryout sphinx...
xwjiang2010 Aug 2, 2022
6705481
update-annotations
richardliaw Aug 2, 2022
8b46170
remove-train-tf-example
richardliaw Aug 2, 2022
28e8241
remove train torch dataset example
richardliaw Aug 2, 2022
333dfec
Remove tune train tf example
richardliaw Aug 2, 2022
0d2a201
reference-benchmarks
richardliaw Aug 2, 2022
e7d4047
update train faq
richardliaw Aug 2, 2022
e842a72
skeleton code for trainer guide.
xwjiang2010 Aug 2, 2022
fd9db20
Merge branch 'trainer_doc' of https://github.com/xwjiang2010/ray into…
richardliaw Aug 2, 2022
6c09d36
add more trainer doc.
xwjiang2010 Aug 2, 2022
dbffccd
Merge branch 'trainer_doc' of https://github.com/xwjiang2010/ray into…
xwjiang2010 Aug 2, 2022
2563512
add more
xwjiang2010 Aug 2, 2022
51d1831
get-build
richardliaw Aug 2, 2022
88fa63c
add-trainers
richardliaw Aug 3, 2022
4421c58
Merge branch 'master' into trainer_doc
richardliaw Aug 3, 2022
c0c8384
update-trainer-docs
richardliaw Aug 3, 2022
0b01075
update-trainer
richardliaw Aug 3, 2022
11e0a2d
update
richardliaw Aug 3, 2022
ac80e54
update
richardliaw Aug 3, 2022
0858d8f
Merge branch 'master' into air/doc/trainer-doc
Aug 3, 2022
37b9d7e
Add lightgbm section
Aug 3, 2022
c4c3172
Update results section, link to GBDT
Aug 3, 2022
4a8f6ad
Merge remote-tracking branch 'upstream/master' into air/doc/trainer-doc
Aug 3, 2022
1be37e4
Add torch regression example
Aug 3, 2022
ab77cf7
update BUILD
Aug 3, 2022
4f13d4c
Add tensorflow example
Aug 3, 2022
0bcc0a2
remove-trainer-scaling
richardliaw Aug 3, 2022
939d653
Merge branch 'master' into trainer_doc
richardliaw Aug 3, 2022
4fe5ebf
Update trainer.rst
xwjiang2010 Aug 3, 2022
c42db91
update trainer and fix docstrings
richardliaw Aug 3, 2022
9c6763e
Merge branch 'trainer_doc' of https://github.com/xwjiang2010/ray into…
richardliaw Aug 3, 2022
005023c
Update code examples
richardliaw Aug 3, 2022
86f1a83
update
richardliaw Aug 3, 2022
bfdded7
update
richardliaw Aug 3, 2022
672a929
update-comments
richardliaw Aug 3, 2022
d4932d4
update
richardliaw Aug 4, 2022
e9b7c47
lint
richardliaw Aug 4, 2022
42a39e9
update-regression-test
richardliaw Aug 4, 2022
b98b47c
Merge branch 'master' into trainer_doc
richardliaw Aug 4, 2022
6f530f0
Merge remote-tracking branch 'upstream/master' into air/doc/trainer-doc
Aug 4, 2022
5c3027d
update-docs
richardliaw Aug 4, 2022
2ce2069
Fix hf trainer but don't run on CI
Aug 4, 2022
dc8b71a
Fix GPU test import
Aug 4, 2022
99bb074
fix torch trainer example
Aug 4, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ parts:
- file: ray-air/preprocessors
- file: ray-air/checkpoints
- file: ray-air/check-ingest
- file: ray-air/config-scaling
- file: ray-air/trainer
- file: ray-air/tuner
- file: ray-air/predictors
- file: ray-air/examples/serving_guide
Expand Down
3 changes: 3 additions & 0 deletions doc/source/ray-air/benchmarks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,7 @@ We test out the performance across different cluster sizes and data sizes.
- 434.95 s (2 epochs, 746.29 images/sec)
- `python pytorch_training_e2e.py --data-size-gb=100 --num-workers=16`

.. _pytorch-training-parity:

Pytorch Training Parity
-----------------------
Expand Down Expand Up @@ -207,6 +208,8 @@ Performance may vary greatly across different model, hardware, and cluster confi
- `python workloads/torch_benchmark.py run --num-runs 3 --num-epochs 20 --num-workers 16 --cpus-per-worker 4 --use-gpu`


.. _tf-training-parity:

Tensorflow Training Parity
--------------------------

Expand Down
35 changes: 0 additions & 35 deletions doc/source/ray-air/config-scaling.rst

This file was deleted.

72 changes: 0 additions & 72 deletions doc/source/ray-air/doc_code/config_scaling.py

This file was deleted.

86 changes: 86 additions & 0 deletions doc/source/ray-air/doc_code/hf_trainer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# __hf_trainer_start__

# Based on
# huggingface/notebooks/examples/language_modeling_from_scratch.ipynb

# Hugging Face imports
from datasets import load_dataset
import transformers
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer

import ray
from ray.train.huggingface import HuggingFaceTrainer
from ray.air.config import ScalingConfig

model_checkpoint = "gpt2"
tokenizer_checkpoint = "sgugger/gpt2-like-tokenizer"
block_size = 128

datasets = load_dataset("wikitext", "wikitext-2-raw-v1")
tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)


def tokenize_function(examples):
return tokenizer(examples["text"])


tokenized_datasets = datasets.map(
tokenize_function, batched=True, num_proc=1, remove_columns=["text"]
)


def group_texts(examples):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model
# supported it.
# instead of this drop, you can customize this part to your needs.
total_length = (total_length // block_size) * block_size
# Split by chunks of max_len.
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result


lm_datasets = tokenized_datasets.map(
group_texts,
batched=True,
batch_size=1000,
num_proc=1,
)
ray_train_ds = ray.data.from_huggingface(lm_datasets["train"])
ray_evaluation_ds = ray.data.from_huggingface(lm_datasets["evaluation"])


def trainer_init_per_worker(train_dataset, eval_dataset, **config):
model_config = AutoConfig.from_pretrained(model_checkpoint)
model = AutoModelForCausalLM.from_config(model_config)
args = transformers.TrainingArguments(
output_dir=f"{model_checkpoint}-wikitext2",
evaluation_strategy="epoch",
learning_rate=2e-5,
weight_decay=0.01,
)
return transformers.Trainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)


scaling_config = ScalingConfig(num_workers=3)
# If using GPUs, use the below scaling config instead.
# scaling_config = ScalingConfig(num_workers=3, use_gpu=True)
trainer = HuggingFaceTrainer(
trainer_init_per_worker=trainer_init_per_worker,
scaling_config=scaling_config,
datasets={"train": ray_train_ds, "evaluation": ray_evaluation_ds},
)
result = trainer.fit()

# __hf_trainer_end__
76 changes: 76 additions & 0 deletions doc/source/ray-air/doc_code/hvd_trainer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
import ray
import ray.train as train
import ray.train.torch # Need this to use `train.torch.get_device()`
import horovod.torch as hvd
import torch
import torch.nn as nn
from ray.air import session, Checkpoint
from ray.train.horovod import HorovodTrainer
from ray.air.config import ScalingConfig

input_size = 1
layer_size = 15
output_size = 1
num_epochs = 3


class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.layer1 = nn.Linear(input_size, layer_size)
self.relu = nn.ReLU()
self.layer2 = nn.Linear(layer_size, output_size)

def forward(self, input):
return self.layer2(self.relu(self.layer1(input)))


def train_loop_per_worker():
hvd.init()
dataset_shard = session.get_dataset_shard("train")
model = NeuralNetwork()
device = train.torch.get_device()
model.to(device)
loss_fn = nn.MSELoss()
lr_scaler = 1
optimizer = torch.optim.SGD(model.parameters(), lr=0.1 * lr_scaler)
# Horovod: wrap optimizer with DistributedOptimizer.
optimizer = hvd.DistributedOptimizer(
optimizer,
named_parameters=model.named_parameters(),
op=hvd.Average,
)
for epoch in range(num_epochs):
model.train()
for inputs, labels in iter(
dataset_shard.to_torch(
label_column="y",
label_column_dtype=torch.float,
feature_column_dtypes=torch.float,
batch_size=32,
)
):
inputs.to(device)
labels.to(device)
outputs = model(inputs)
loss = loss_fn(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"epoch: {epoch}, loss: {loss.item()}")
session.report(
{},
checkpoint=Checkpoint.from_dict(dict(model=model.state_dict())),
)


train_dataset = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)])
scaling_config = ScalingConfig(num_workers=3)
# If using GPUs, use the below scaling config instead.
# scaling_config = ScalingConfig(num_workers=3, use_gpu=True)
trainer = HorovodTrainer(
train_loop_per_worker=train_loop_per_worker,
scaling_config=scaling_config,
datasets={"train": train_dataset},
)
result = trainer.fit()
13 changes: 13 additions & 0 deletions doc/source/ray-air/doc_code/lightgbm_trainer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import ray

from ray.train.lightgbm import LightGBMTrainer
from ray.air.config import ScalingConfig

train_dataset = ray.data.from_items([{"x": x, "y": x + 1} for x in range(32)])
trainer = LightGBMTrainer(
label_column="y",
params={"objective": "regression"},
scaling_config=ScalingConfig(num_workers=3),
datasets={"train": train_dataset},
)
result = trainer.fit()
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# flake8: noqa
# isort: skip_file

# __air_session_start__

import tensorflow as tf
from ray.air import session
from ray.air.checkpoint import Checkpoint
from ray.air.config import ScalingConfig
from ray.train.tensorflow import TensorflowTrainer


def build_model() -> tf.keras.Model:
model = tf.keras.Sequential(
[
tf.keras.layers.InputLayer(input_shape=(1,)),
tf.keras.layers.Dense(10),
tf.keras.layers.Dense(1),
]
)
return model


def train_func():
ckpt = session.get_checkpoint()
if ckpt:
with ckpt.as_directory() as loaded_checkpoint_dir:
import tensorflow as tf

model = tf.keras.models.load_model(loaded_checkpoint_dir)
else:
model = build_model()

model.save("my_model", overwrite=True)
session.report(
metrics={"iter": 1}, checkpoint=Checkpoint.from_directory("my_model")
)


scaling_config = ScalingConfig(num_workers=2)
trainer = TensorflowTrainer(
train_loop_per_worker=train_func, scaling_config=scaling_config
)
result = trainer.fit()

# trainer2 will pick up from the checkpoint saved by trainer1.
trainer2 = TensorflowTrainer(
train_loop_per_worker=train_func,
scaling_config=scaling_config,
# this is ultimately what is accessed through
# ``Session.get_checkpoint()``
resume_from_checkpoint=result.checkpoint,
)
result2 = trainer2.fit()

# __air_session_end__
16 changes: 16 additions & 0 deletions doc/source/ray-air/doc_code/rl_trainer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
from ray.air.config import RunConfig, ScalingConfig
from ray.train.rl import RLTrainer

trainer = RLTrainer(
run_config=RunConfig(stop={"training_iteration": 5}),
scaling_config=ScalingConfig(num_workers=2, use_gpu=False),
algorithm="PPO",
config={
"env": "CartPole-v0",
"framework": "tf",
"evaluation_num_workers": 1,
"evaluation_interval": 1,
"evaluation_config": {"input": "sampler"},
},
)
result = trainer.fit()
Loading