|
1 | | -Implementations of post-training algorithms using the Tinker API. See [public documentation](https://tinker-docs.thinkingmachines.dev/cookbook). |
| 1 | +// TODO(tianyi) Fancy hero image? |
2 | 2 |
|
3 | | -There are several main directories, including different types of algorithms and datasets. |
4 | 3 |
|
5 | | -- [supervised](tinker_cookbook/supervised): supervised learning, aka supervised fine-tuning (SFT) |
6 | | -- [preference](tinker_cookbook/preference): preference datasets that can be used for training reward models or training policies with direct preference optimization (DPO) |
7 | | -- [rl](tinker_cookbook/rl): reinforcement learning on general MDPs. |
| 4 | +Tinker cookbook collects recommended programming patterns, reusable utilities, and extensible abstractions to help people build on [Tinker](https://tinker-docs.thinkingmachines.ai/). |
8 | 5 |
|
9 | | -The user-friendly training entrypoints can be found in [supervised/train_cli.py](tinker_cookbook/supervised/train_cli.py) and [rl/train_cli.py](tinker_cookbook/rl/train_cli.py). |
| 6 | +## Installation |
10 | 7 |
|
11 | | -## Classes |
| 8 | +1. Obtain a Tinker API token and export it as `TINKER_API_KEY`. // TODO(tianyi): add onboarding flow link |
| 9 | +2. Install tinker python client via `pip install git+https://github.com/thinking-machines-lab/tinker.git` // TODO(tianyi): update to pypi |
| 10 | +3. As a starting point, we recommend cloning this repo locally and installing it via `pip install -e .`. |
12 | 11 |
|
13 | | -There are a lot of different classes, which might make the code feel less approachable. However, they follow *the builder pattern*, and the code should be less confusing when you know the pattern. |
| 12 | +## Usage |
14 | 13 |
|
15 | | -We can illustrate the pattern with the two main examples: |
| 14 | +We build Tinker cookbook to allow flexible usage. You can run our examples, build your own training loop, or simply import useful utilities from this repo. |
16 | 15 |
|
17 | | -- A `SupervisedDatasetBuilder` is a configuration object which builds a `SupervisedDataset`. |
18 | | -- An `RLDatasetBuilder` is a configuration object which builds an `RLDataset`, which generates batches of `EnvGroupBuilder` objects, which each generate a group of `Env` objects. |
| 16 | +### Running our examples |
19 | 17 |
|
20 | | -Here, the `SupervisedDatasetBuilder`, `RLDatasetBuilder`, and `EnvGroupBuilder` are all configuration objects, which have a `__call__` method that builds another object. You can see these objects in [supervised/types.py](tinker_cookbook/supervised/types.py) and [rl/types.py](tinker_cookbook/rl/types.py). |
| 18 | +`tinker_cookbook/supervised/train.py` and `tinker_cookbook/rl/train.py` contain our reference entrypoints for supervised learning and reinforcement learning accordingly. |
21 | 19 |
|
22 | | -In general, we use a lot of configuration objects, with a `__call__` method that returns a heavyweight object (like a dataset). We use `chz` for the configuration objects -- it's similar to a dataclass but with some extra features that are nice for configs. We use either dataclasses or regular python classes for the heavyweight objects. |
| 20 | +Navigate to `tinker_cookbook/recipes` and you will find ready-to-go post-training examples. Here are the list of examples you can try out: |
| 21 | +- `chat_sl` shows supervised fine-tuning on Tulu3 |
| 22 | +- `prompt_distillation` XXXX // TODO(tianyi): add take away message |
| 23 | +- `math_rl` demontrates Refinforcement Learning with Verifiable Reward (RLVR) on math problems |
| 24 | +- `multiplayer_rl` leverages the flexibility of Tinker to learn on multiplayer / multi-model games |
| 25 | +- `tool_use/search` replicates a recent academic paper on using RL to teach the ability to use a vector search tool. |
23 | 26 |
|
24 | | -## Envs |
| 27 | +### Building your own |
25 | 28 |
|
26 | | -An `Env` is an RL environment. For those with an RL background, it roughly corresponds to an MDP or a POMDP, however we use in more general cases (such as multi-agent settings) that don't strictly correspond to the MDP/POMDP formalism. It's roughly analogous the concept of an Env in OpenAI Gym, but unlike OpenAI Gym, we don't have a `reset` method; rather, the env should be discarded after a rollout. Any shared resources should be maintained by whatever object is creating the envs. |
| 29 | +`sl_basic.py` and `rl_basic.py` remove most of our abstractions and provide clean starting points for building your own projects. |
27 | 30 |
|
28 | | -The `Env`s are created by `EnvGroupBuilder`s. The group of envs returned by `EnvGroupBuilder` have something in common; either they correspond to the same task (in which case we can use this information for variance reduction, as in GRPO, which centers per group); or, we can use the group to define a multi-agent environment. |
| 31 | +### Import our utilities |
29 | 32 |
|
30 | | -- One common multi-agent environment is where we use a pairwise preference model to compare pairs of completions. |
31 | | -- We can also use the group to define a two-player game. Some two player games such as tic-tac-toe are currently supported through the [textarena](tinker_cookbook/rl/textarena_envs.py) environments. |
| 33 | +Tinker cookbook includes several patterns we like. Here's a quick overview, |
| 34 | +- [renderers]() converts tokens from/to structured chat message objects |
| 35 | +- [hyperparam_utils]() helps calculate hyperparameters suitable for LoRAs |
| 36 | +- [evaluation]() shows how to evaluate Tinker models and also integrate with InspectAI to make evaluating on standard benchmarks easy. |
32 | 37 |
|
| 38 | +## Contributing |
33 | 39 |
|
34 | | -## Notation |
| 40 | +We welcome community contributions to Tinker cookbook. At the same time, we want to keep this official repo lean and hackable. |
| 41 | +If you build a cool project, please share with us and we'd love to highlight them in `FEATURED_PROJECTS.md`. |
| 42 | +If you want to help us improve the core utilities in Tinker cookbook, please be familiar with `CONTRIBUTING.md`. We also post ideas on where we could use help. |
35 | 43 |
|
36 | | -We'll use subscripts to indicate the shapes of objects. For example, `tokens_P_G_T` indicates a three-dimensional array of tokens, with `P` problems, `G` groups, and `T` tokens per groups, so `tokens_P_G_T[p][g][t]` should refer to a single token. In many cases, the arrays will be ragged. E.g., the `T` axis will have different lengths for different `(p,g)`. Sometimes, a given dimension will be flattened from two dimensions. If we write `tokens_PG_T`, that means that we have a two dimensional array, where the 0th dimension is flattened from the `P` and `G` dimensions. |
37 | | - |
38 | | -### Common Dimension Names |
39 | | - |
40 | | -Here are the standard dimension subscripts used throughout the codebase: |
41 | | - |
42 | | -- `_D`: Data/Datum dimension (for training data items) |
43 | | -- `_G`: Group dimension (for multiple attempts/rollouts of the same problem) |
44 | | -- `_P`: Problem dimension (for different problems/prompts) |
45 | | -- `_T`: Token/Time dimension (for sequences) |
46 | | - |
47 | | -The relationship between dimensions in RL: |
48 | | -- A batch contains multiple problems (`_P`) |
49 | | -- Each problem spawns multiple attempts/environments (`_G`), forming a group |
50 | | -- Each attempt produces one trajectory |
51 | | -- Advantages are normalized within each group (across the `_G` dimension) |
52 | | - |
53 | | -Examples: |
54 | | -- `env_group_builders_P`: A list of environment builders, one per problem |
55 | | -- `trajectories_G`: Multiple trajectories from attempts at the same problem |
56 | | -- `rewards_G`: Rewards for each attempt within a group |
57 | | -- `tokens_P_G_T`: Tokens with problem, group, and time dimensions |
58 | | -- `data_D`: A list of training data items |
| 44 | +For general feedback, you can XXXX // TODO(tianyi): check with clare |
0 commit comments