Skip to content

Commit c1b3b6f

Browse files
committed
Sync contents
1 parent f77a8a3 commit c1b3b6f

File tree

96 files changed

+3213
-2677
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

96 files changed

+3213
-2677
lines changed

.sync_state

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
{
2-
"last_synced_sha": "5d73981d7ab362ab5715f22894a234f45956c9c0",
3-
"last_sync_time": "2025-09-23T18:32:46.985949"
2+
"last_synced_sha": "b0f49367e227ac8f6f4e25bb580f1d7a7e56655b",
3+
"last_sync_time": "2025-09-30T23:38:41.029668"
44
}

CONTRIBUTING.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Development
2+
3+
## Organization of training scripts
4+
5+
We're designing the codebase with the following goals:
6+
7+
1. Low barrier to entry: it should be dead simple to run something and see numbers go up.
8+
2. Extensible: it should be possible to pass in custom datasets and evals and control all the hyperparameters.
9+
3. Science-friendly: it should be easy to run sweeps, and analyze the results.
10+
11+
To achieve this, we'll use the following structure around training scripts:
12+
13+
- There's a main training function, such as `sft.py` or `rl_bandit/train.py`, which contains the main loop.
14+
- This function contains a detailed config object (`Config`), which isn't constructable from the command line.
15+
- The config contains members that specify things like datasets and evals. However, these should be chz configs (with a `.build` method that constructs the actual object) or callables (we recommend using functools.partial). This way, the config is serializable, which is useful for sweeps.
16+
- There's an auxiliary script, called something like `sft_cli.py` or `rl_bandit/train_cli.py`, which contains a smaller config object (`CLIConfig`), which is constructable from the command line. This script is useful to let people get started with the library, without digging into a lot of code and learning about new classes.
17+
18+
## Async
19+
20+
Async is very useful for RL, where it allows us to make many queries in parallel (e.g., sampling calls). For all of the interfaces used in RL (such as the `Env` class), all the methods that take nontrivial amounts of time should be async. For some of the other code, such as `sft.py`, we've chosen not to use async methods, just to make it more beginner-friendly, as many python programmers are not familiar with async.
21+
22+
## Typing
23+
24+
Please use typing wherever possible; avoid `Any` and `type: ignore`; prefer casting. However, avoid using convoluted generics or writing code that's much more verbose just to satisfy the type checker. Prefer using single types over union types.
25+
26+
## Classes
27+
28+
There are a lot of different classes, which might make the code feel less approachable. However, they follow *the builder pattern*, and the code should be less confusing when you know the pattern.
29+
30+
We can illustrate the pattern with the two main examples:
31+
32+
- A `SupervisedDatasetBuilder` is a configuration object which builds a `SupervisedDataset`.
33+
- An `RLDatasetBuilder` is a configuration object which builds an `RLDataset`, which generates batches of `EnvGroupBuilder` objects, which each generate a group of `Env` objects.
34+
35+
Here, the `SupervisedDatasetBuilder`, `RLDatasetBuilder`, and `EnvGroupBuilder` are all configuration objects, which have a `__call__` method that builds another object. You can see these objects in [supervised/types.py](tinker_cookbook/supervised/types.py) and [rl/types.py](tinker_cookbook/rl/types.py).
36+
37+
In general, we use a lot of configuration objects, with a `__call__` method that returns a heavyweight object (like a dataset). We use `chz` for the configuration objects -- it's similar to a dataclass but with some extra features that are nice for configs. We use either dataclasses or regular python classes for the heavyweight objects.
38+
39+
## Envs
40+
41+
An `Env` is an RL environment. For those with an RL background, it roughly corresponds to an MDP or a POMDP, however we use in more general cases (such as multi-agent settings) that don't strictly correspond to the MDP/POMDP formalism. It's roughly analogous the concept of an Env in OpenAI Gym, but unlike OpenAI Gym, we don't have a `reset` method; rather, the env should be discarded after a rollout. Any shared resources should be maintained by whatever object is creating the envs.
42+
43+
The `Env`s are created by `EnvGroupBuilder`s. The group of envs returned by `EnvGroupBuilder` have something in common; either they correspond to the same task (in which case we can use this information for variance reduction, as in GRPO, which centers per group); or, we can use the group to define a multi-agent environment.
44+
45+
- One common multi-agent environment is where we use a pairwise preference model to compare pairs of completions.
46+
- We can also use the group to define a two-player game. Some two player games such as tic-tac-toe are currently supported through the [textarena](tinker_cookbook/rl/textarena_envs.py) environments.
47+
48+
49+
## Notation
50+
51+
We'll use subscripts to indicate the shapes of objects. For example, `tokens_P_G_T` indicates a three-dimensional array of tokens, with `P` problems, `G` groups, and `T` tokens per groups, so `tokens_P_G_T[p][g][t]` should refer to a single token. In many cases, the arrays will be ragged. E.g., the `T` axis will have different lengths for different `(p,g)`. Sometimes, a given dimension will be flattened from two dimensions. If we write `tokens_PG_T`, that means that we have a two dimensional array, where the 0th dimension is flattened from the `P` and `G` dimensions.
52+
53+
### Common Dimension Names
54+
55+
Here are the standard dimension subscripts used throughout the codebase:
56+
57+
- `_D`: Data/Datum dimension (for training data items)
58+
- `_G`: Group dimension (for multiple attempts/rollouts of the same problem)
59+
- `_P`: Problem dimension (for different problems/prompts)
60+
- `_T`: Token/Time dimension (for sequences)
61+
62+
The relationship between dimensions in RL:
63+
- A batch contains multiple problems (`_P`)
64+
- Each problem spawns multiple attempts/environments (`_G`), forming a group
65+
- Each attempt produces one trajectory
66+
- Advantages are normalized within each group (across the `_G` dimension)
67+
68+
Examples:
69+
- `env_group_builders_P`: A list of environment builders, one per problem
70+
- `trajectories_G`: Multiple trajectories from attempts at the same problem
71+
- `rewards_G`: Rewards for each attempt within a group
72+
- `tokens_P_G_T`: Tokens with problem, group, and time dimensions
73+
- `data_D`: A list of training data items
74+
75+
## Testing
76+
77+
TODO(tianyi): add testing info
78+
79+
# Call for Proposals
80+
81+
TODO(tianyi): add

FEATURED_PROJECTS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
// TODO(tianyi): any launch partners we can talk about

README.md

Lines changed: 28 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -1,58 +1,44 @@
1-
Implementations of post-training algorithms using the Tinker API. See [public documentation](https://tinker-docs.thinkingmachines.dev/cookbook).
1+
// TODO(tianyi) Fancy hero image?
22

3-
There are several main directories, including different types of algorithms and datasets.
43

5-
- [supervised](tinker_cookbook/supervised): supervised learning, aka supervised fine-tuning (SFT)
6-
- [preference](tinker_cookbook/preference): preference datasets that can be used for training reward models or training policies with direct preference optimization (DPO)
7-
- [rl](tinker_cookbook/rl): reinforcement learning on general MDPs.
4+
Tinker cookbook collects recommended programming patterns, reusable utilities, and extensible abstractions to help people build on [Tinker](https://tinker-docs.thinkingmachines.ai/).
85

9-
The user-friendly training entrypoints can be found in [supervised/train_cli.py](tinker_cookbook/supervised/train_cli.py) and [rl/train_cli.py](tinker_cookbook/rl/train_cli.py).
6+
## Installation
107

11-
## Classes
8+
1. Obtain a Tinker API token and export it as `TINKER_API_KEY`. // TODO(tianyi): add onboarding flow link
9+
2. Install tinker python client via `pip install git+https://github.com/thinking-machines-lab/tinker.git` // TODO(tianyi): update to pypi
10+
3. As a starting point, we recommend cloning this repo locally and installing it via `pip install -e .`.
1211

13-
There are a lot of different classes, which might make the code feel less approachable. However, they follow *the builder pattern*, and the code should be less confusing when you know the pattern.
12+
## Usage
1413

15-
We can illustrate the pattern with the two main examples:
14+
We build Tinker cookbook to allow flexible usage. You can run our examples, build your own training loop, or simply import useful utilities from this repo.
1615

17-
- A `SupervisedDatasetBuilder` is a configuration object which builds a `SupervisedDataset`.
18-
- An `RLDatasetBuilder` is a configuration object which builds an `RLDataset`, which generates batches of `EnvGroupBuilder` objects, which each generate a group of `Env` objects.
16+
### Running our examples
1917

20-
Here, the `SupervisedDatasetBuilder`, `RLDatasetBuilder`, and `EnvGroupBuilder` are all configuration objects, which have a `__call__` method that builds another object. You can see these objects in [supervised/types.py](tinker_cookbook/supervised/types.py) and [rl/types.py](tinker_cookbook/rl/types.py).
18+
`tinker_cookbook/supervised/train.py` and `tinker_cookbook/rl/train.py` contain our reference entrypoints for supervised learning and reinforcement learning accordingly.
2119

22-
In general, we use a lot of configuration objects, with a `__call__` method that returns a heavyweight object (like a dataset). We use `chz` for the configuration objects -- it's similar to a dataclass but with some extra features that are nice for configs. We use either dataclasses or regular python classes for the heavyweight objects.
20+
Navigate to `tinker_cookbook/recipes` and you will find ready-to-go post-training examples. Here are the list of examples you can try out:
21+
- `chat_sl` shows supervised fine-tuning on Tulu3
22+
- `prompt_distillation` XXXX // TODO(tianyi): add take away message
23+
- `math_rl` demontrates Refinforcement Learning with Verifiable Reward (RLVR) on math problems
24+
- `multiplayer_rl` leverages the flexibility of Tinker to learn on multiplayer / multi-model games
25+
- `tool_use/search` replicates a recent academic paper on using RL to teach the ability to use a vector search tool.
2326

24-
## Envs
27+
### Building your own
2528

26-
An `Env` is an RL environment. For those with an RL background, it roughly corresponds to an MDP or a POMDP, however we use in more general cases (such as multi-agent settings) that don't strictly correspond to the MDP/POMDP formalism. It's roughly analogous the concept of an Env in OpenAI Gym, but unlike OpenAI Gym, we don't have a `reset` method; rather, the env should be discarded after a rollout. Any shared resources should be maintained by whatever object is creating the envs.
29+
`sl_basic.py` and `rl_basic.py` remove most of our abstractions and provide clean starting points for building your own projects.
2730

28-
The `Env`s are created by `EnvGroupBuilder`s. The group of envs returned by `EnvGroupBuilder` have something in common; either they correspond to the same task (in which case we can use this information for variance reduction, as in GRPO, which centers per group); or, we can use the group to define a multi-agent environment.
31+
### Import our utilities
2932

30-
- One common multi-agent environment is where we use a pairwise preference model to compare pairs of completions.
31-
- We can also use the group to define a two-player game. Some two player games such as tic-tac-toe are currently supported through the [textarena](tinker_cookbook/rl/textarena_envs.py) environments.
33+
Tinker cookbook includes several patterns we like. Here's a quick overview,
34+
- [renderers]() converts tokens from/to structured chat message objects
35+
- [hyperparam_utils]() helps calculate hyperparameters suitable for LoRAs
36+
- [evaluation]() shows how to evaluate Tinker models and also integrate with InspectAI to make evaluating on standard benchmarks easy.
3237

38+
## Contributing
3339

34-
## Notation
40+
We welcome community contributions to Tinker cookbook. At the same time, we want to keep this official repo lean and hackable.
41+
If you build a cool project, please share with us and we'd love to highlight them in `FEATURED_PROJECTS.md`.
42+
If you want to help us improve the core utilities in Tinker cookbook, please be familiar with `CONTRIBUTING.md`. We also post ideas on where we could use help.
3543

36-
We'll use subscripts to indicate the shapes of objects. For example, `tokens_P_G_T` indicates a three-dimensional array of tokens, with `P` problems, `G` groups, and `T` tokens per groups, so `tokens_P_G_T[p][g][t]` should refer to a single token. In many cases, the arrays will be ragged. E.g., the `T` axis will have different lengths for different `(p,g)`. Sometimes, a given dimension will be flattened from two dimensions. If we write `tokens_PG_T`, that means that we have a two dimensional array, where the 0th dimension is flattened from the `P` and `G` dimensions.
37-
38-
### Common Dimension Names
39-
40-
Here are the standard dimension subscripts used throughout the codebase:
41-
42-
- `_D`: Data/Datum dimension (for training data items)
43-
- `_G`: Group dimension (for multiple attempts/rollouts of the same problem)
44-
- `_P`: Problem dimension (for different problems/prompts)
45-
- `_T`: Token/Time dimension (for sequences)
46-
47-
The relationship between dimensions in RL:
48-
- A batch contains multiple problems (`_P`)
49-
- Each problem spawns multiple attempts/environments (`_G`), forming a group
50-
- Each attempt produces one trajectory
51-
- Advantages are normalized within each group (across the `_G` dimension)
52-
53-
Examples:
54-
- `env_group_builders_P`: A list of environment builders, one per problem
55-
- `trajectories_G`: Multiple trajectories from attempts at the same problem
56-
- `rewards_G`: Rewards for each attempt within a group
57-
- `tokens_P_G_T`: Tokens with problem, group, and time dimensions
58-
- `data_D`: A list of training data items
44+
For general feedback, you can XXXX // TODO(tianyi): check with clare

STYLE_GUIDE.md

Lines changed: 0 additions & 30 deletions
This file was deleted.

invocations.md

Lines changed: 0 additions & 71 deletions
This file was deleted.

pyproject.toml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ dependencies = [
1414
"torch",
1515
"transformers",
1616
"blobfile",
17+
"inspect-ai",
1718
]
1819

1920
[project.optional-dependencies]
@@ -28,12 +29,17 @@ envs = [
2829
"pylatexenc",
2930
"sympy",
3031
"textarena",
32+
"math-verify",
33+
"scipy",
3134
]
3235
vector-search = [
3336
"chromadb",
3437
"google-genai",
3538
"huggingface_hub"
3639
]
40+
wandb = [
41+
"wandb",
42+
]
3743

3844
[build-system]
3945
requires = ["hatchling"]
@@ -44,3 +50,6 @@ packages = ["tinker_cookbook"]
4450

4551
[tool.pytest.ini_options]
4652
python_files = ["test_*.py", "*_test.py"]
53+
54+
[tool.ruff]
55+
line-length = 100

0 commit comments

Comments
 (0)