|
| 1 | +Implementations of post-training algorithms using the Tinker API. See [public documentation](https://tinker-docs.thinkingmachines.dev/cookbook). |
| 2 | + |
| 3 | +There are several main directories, including different types of algorithms and datasets. |
| 4 | + |
| 5 | +- [supervised](tinker_cookbook/supervised): supervised learning, aka supervised fine-tuning (SFT) |
| 6 | +- [preference](tinker_cookbook/preference): preference datasets that can be used for training reward models or training policies with direct preference optimization (DPO) |
| 7 | +- [rl](tinker_cookbook/rl): reinforcement learning on general MDPs. |
| 8 | + |
| 9 | +The user-friendly training entrypoints can be found in [supervised/train_cli.py](tinker_cookbook/supervised/train_cli.py) and [rl/train_cli.py](tinker_cookbook/rl/train_cli.py). |
| 10 | + |
| 11 | +## Classes |
| 12 | + |
| 13 | +There are a lot of different classes, which might make the code feel less approachable. However, they follow *the builder pattern*, and the code should be less confusing when you know the pattern. |
| 14 | + |
| 15 | +We can illustrate the pattern with the two main examples: |
| 16 | + |
| 17 | +- A `SupervisedDatasetBuilder` is a configuration object which builds a `SupervisedDataset`. |
| 18 | +- An `RLDatasetBuilder` is a configuration object which builds an `RLDataset`, which generates batches of `EnvGroupBuilder` objects, which each generate a group of `Env` objects. |
| 19 | + |
| 20 | +Here, the `SupervisedDatasetBuilder`, `RLDatasetBuilder`, and `EnvGroupBuilder` are all configuration objects, which have a `__call__` method that builds another object. You can see these objects in [supervised/types.py](tinker_cookbook/supervised/types.py) and [rl/types.py](tinker_cookbook/rl/types.py). |
| 21 | + |
| 22 | +In general, we use a lot of configuration objects, with a `__call__` method that returns a heavyweight object (like a dataset). We use `chz` for the configuration objects -- it's similar to a dataclass but with some extra features that are nice for configs. We use either dataclasses or regular python classes for the heavyweight objects. |
| 23 | + |
| 24 | +## Envs |
| 25 | + |
| 26 | +An `Env` is an RL environment. For those with an RL background, it roughly corresponds to an MDP or a POMDP, however we use in more general cases (such as multi-agent settings) that don't strictly correspond to the MDP/POMDP formalism. It's roughly analogous the concept of an Env in OpenAI Gym, but unlike OpenAI Gym, we don't have a `reset` method; rather, the env should be discarded after a rollout. Any shared resources should be maintained by whatever object is creating the envs. |
| 27 | + |
| 28 | +The `Env`s are created by `EnvGroupBuilder`s. The group of envs returned by `EnvGroupBuilder` have something in common; either they correspond to the same task (in which case we can use this information for variance reduction, as in GRPO, which centers per group); or, we can use the group to define a multi-agent environment. |
| 29 | + |
| 30 | +- One common multi-agent environment is where we use a pairwise preference model to compare pairs of completions. |
| 31 | +- We can also use the group to define a two-player game. Some two player games such as tic-tac-toe are currently supported through the [textarena](tinker_cookbook/rl/textarena_envs.py) environments. |
| 32 | + |
| 33 | + |
| 34 | +## Notation |
| 35 | + |
| 36 | +We'll use subscripts to indicate the shapes of objects. For example, `tokens_P_G_T` indicates a three-dimensional array of tokens, with `P` problems, `G` groups, and `T` tokens per groups, so `tokens_P_G_T[p][g][t]` should refer to a single token. In many cases, the arrays will be ragged. E.g., the `T` axis will have different lengths for different `(p,g)`. Sometimes, a given dimension will be flattened from two dimensions. If we write `tokens_PG_T`, that means that we have a two dimensional array, where the 0th dimension is flattened from the `P` and `G` dimensions. |
| 37 | + |
| 38 | +### Common Dimension Names |
| 39 | + |
| 40 | +Here are the standard dimension subscripts used throughout the codebase: |
| 41 | + |
| 42 | +- `_D`: Data/Datum dimension (for training data items) |
| 43 | +- `_G`: Group dimension (for multiple attempts/rollouts of the same problem) |
| 44 | +- `_P`: Problem dimension (for different problems/prompts) |
| 45 | +- `_T`: Token/Time dimension (for sequences) |
| 46 | + |
| 47 | +The relationship between dimensions in RL: |
| 48 | +- A batch contains multiple problems (`_P`) |
| 49 | +- Each problem spawns multiple attempts/environments (`_G`), forming a group |
| 50 | +- Each attempt produces one trajectory |
| 51 | +- Advantages are normalized within each group (across the `_G` dimension) |
| 52 | + |
| 53 | +Examples: |
| 54 | +- `env_group_builders_P`: A list of environment builders, one per problem |
| 55 | +- `trajectories_G`: Multiple trajectories from attempts at the same problem |
| 56 | +- `rewards_G`: Rewards for each attempt within a group |
| 57 | +- `tokens_P_G_T`: Tokens with problem, group, and time dimensions |
| 58 | +- `data_D`: A list of training data items |
0 commit comments