thinking-machines-lab
diff --git a/‎.sync_state‎
Lines changed: 2 additions & 2 deletions b/‎.sync_state‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 81 additions & 0 deletions b/‎CONTRIBUTING.md‎
Lines changed: 81 additions & 0 deletions
diff --git a/‎FEATURED_PROJECTS.md‎
Lines changed: 1 addition & 0 deletions b/‎FEATURED_PROJECTS.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 28 additions & 42 deletions b/‎README.md‎
Lines changed: 28 additions & 42 deletions
diff --git a/‎STYLE_GUIDE.md‎
Lines changed: 0 additions & 30 deletions b/‎STYLE_GUIDE.md‎
Lines changed: 0 additions & 30 deletions
diff --git a/‎invocations.md‎
Lines changed: 0 additions & 71 deletions b/‎invocations.md‎
Lines changed: 0 additions & 71 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 9 additions & 0 deletions b/‎pyproject.toml‎
Lines changed: 9 additions & 0 deletions
@@ -1,4 +1,4 @@
 {
-  "last_synced_sha": "5d73981d7ab362ab5715f22894a234f45956c9c0",
-  "last_sync_time": "2025-09-23T18:32:46.985949"
+  "last_synced_sha": "b0f49367e227ac8f6f4e25bb580f1d7a7e56655b",
+  "last_sync_time": "2025-09-30T23:38:41.029668"
 }
@@ -0,0 +1,81 @@
+# Development
+
+## Organization of training scripts
+
+We're designing the codebase with the following goals:
+
+1. Low barrier to entry: it should be dead simple to run something and see numbers go up.
+2. Extensible: it should be possible to pass in custom datasets and evals and control all the hyperparameters.
+3. Science-friendly: it should be easy to run sweeps, and analyze the results.
+
+To achieve this, we'll use the following structure around training scripts:
+
+- There's a main training function, such as `sft.py` or `rl_bandit/train.py`, which contains the main loop.
+    - This function contains a detailed config object (`Config`), which isn't constructable from the command line.
+    - The config contains members that specify things like datasets and evals. However, these should be chz configs (with a `.build` method that constructs the actual object) or callables (we recommend using functools.partial). This way, the config is serializable, which is useful for sweeps.
+- There's an auxiliary script, called something like `sft_cli.py` or `rl_bandit/train_cli.py`, which contains a smaller config object (`CLIConfig`), which is constructable from the command line. This script is useful to let people get started with the library, without digging into a lot of code and learning about new classes.
+
+## Async
+
+Async is very useful for RL, where it allows us to make many queries in parallel (e.g., sampling calls). For all of the interfaces used in RL (such as the `Env` class), all the methods that take nontrivial amounts of time should be async. For some of the other code, such as `sft.py`, we've chosen not to use async methods, just to make it more beginner-friendly, as many python programmers are not familiar with async.
+
+## Typing
+
+Please use typing wherever possible; avoid `Any` and `type: ignore`; prefer casting. However, avoid using convoluted generics or writing code that's much more verbose just to satisfy the type checker. Prefer using single types over union types.
+
+## Classes
+
+There are a lot of different classes, which might make the code feel less approachable. However, they follow *the builder pattern*, and the code should be less confusing when you know the pattern.
+
+We can illustrate the pattern with the two main examples:
+
+- A `SupervisedDatasetBuilder` is a configuration object which builds a `SupervisedDataset`.
+- An `RLDatasetBuilder` is a configuration object which builds an `RLDataset`, which generates batches of `EnvGroupBuilder` objects, which each generate a group of `Env` objects.
+
+Here, the `SupervisedDatasetBuilder`, `RLDatasetBuilder`, and `EnvGroupBuilder` are all configuration objects, which have a `__call__` method that builds another object. You can see these objects in [supervised/types.py](tinker_cookbook/supervised/types.py) and [rl/types.py](tinker_cookbook/rl/types.py).
+
+In general, we use a lot of configuration objects, with a `__call__` method that returns a heavyweight object (like a dataset). We use `chz` for the configuration objects -- it's similar to a dataclass but with some extra features that are nice for configs. We use either dataclasses or regular python classes for the heavyweight objects.
+
+## Envs
+
+An `Env` is an RL environment. For those with an RL background, it roughly corresponds to an MDP or a POMDP, however we use in more general cases (such as multi-agent settings) that don't strictly correspond to the MDP/POMDP formalism. It's roughly analogous the concept of an Env in OpenAI Gym, but unlike OpenAI Gym, we don't have a `reset` method; rather, the env should be discarded after a rollout. Any shared resources should be maintained by whatever object is creating the envs.
+
+The `Env`s are created by `EnvGroupBuilder`s. The group of envs returned by `EnvGroupBuilder` have something in common; either they correspond to the same task (in which case we can use this information for variance reduction, as in GRPO, which centers per group); or, we can use the group to define a multi-agent environment.
+
+- One common multi-agent environment is where we use a pairwise preference model to compare pairs of completions.
+- We can also use the group to define a two-player game. Some two player games such as tic-tac-toe are currently supported through the [textarena](tinker_cookbook/rl/textarena_envs.py) environments.
+
+
+## Notation
+
+We'll use subscripts to indicate the shapes of objects. For example, `tokens_P_G_T` indicates a three-dimensional array of tokens, with `P` problems, `G` groups, and `T` tokens per groups, so `tokens_P_G_T[p][g][t]` should refer to a single token. In many cases, the arrays will be ragged. E.g., the `T` axis will have different lengths for different `(p,g)`. Sometimes, a given dimension will be flattened from two dimensions. If we write `tokens_PG_T`, that means that we have a two dimensional array, where the 0th dimension is flattened from the `P` and `G` dimensions.
+
+### Common Dimension Names
+
+Here are the standard dimension subscripts used throughout the codebase:
+
+- `_D`: Data/Datum dimension (for training data items)
+- `_G`: Group dimension (for multiple attempts/rollouts of the same problem)
+- `_P`: Problem dimension (for different problems/prompts)
+- `_T`: Token/Time dimension (for sequences)
+
+The relationship between dimensions in RL:
+- A batch contains multiple problems (`_P`)
+- Each problem spawns multiple attempts/environments (`_G`), forming a group
+- Each attempt produces one trajectory
+- Advantages are normalized within each group (across the `_G` dimension)
+
+Examples:
+- `env_group_builders_P`: A list of environment builders, one per problem
+- `trajectories_G`: Multiple trajectories from attempts at the same problem
+- `rewards_G`: Rewards for each attempt within a group
+- `tokens_P_G_T`: Tokens with problem, group, and time dimensions
+- `data_D`: A list of training data items
+
+## Testing
+
+TODO(tianyi): add testing info
+
+# Call for Proposals
+
+TODO(tianyi): add
@@ -0,0 +1 @@
+// TODO(tianyi): any launch partners we can talk about
@@ -1,58 +1,44 @@
-Implementations of post-training algorithms using the Tinker API. See [public documentation](https://tinker-docs.thinkingmachines.dev/cookbook).
+// TODO(tianyi) Fancy hero image?
 
-There are several main directories, including different types of algorithms and datasets.
 
-- [supervised](tinker_cookbook/supervised): supervised learning, aka supervised fine-tuning (SFT)
-- [preference](tinker_cookbook/preference): preference datasets that can be used for training reward models or training policies with direct preference optimization (DPO)
-- [rl](tinker_cookbook/rl): reinforcement learning on general MDPs.
+Tinker cookbook collects recommended programming patterns, reusable utilities, and extensible abstractions to help people build on [Tinker](https://tinker-docs.thinkingmachines.ai/).
 
-The user-friendly training entrypoints can be found in [supervised/train_cli.py](tinker_cookbook/supervised/train_cli.py) and [rl/train_cli.py](tinker_cookbook/rl/train_cli.py).
+## Installation
 
-## Classes
+1. Obtain a Tinker API token and export it as `TINKER_API_KEY`. // TODO(tianyi): add onboarding flow link
+2. Install tinker python client via `pip install git+https://github.com/thinking-machines-lab/tinker.git` // TODO(tianyi): update to pypi
+3. As a starting point, we recommend cloning this repo locally and installing it via `pip install -e .`.
 
-There are a lot of different classes, which might make the code feel less approachable. However, they follow *the builder pattern*, and the code should be less confusing when you know the pattern.
+## Usage
 
-We can illustrate the pattern with the two main examples:
+We build Tinker cookbook to allow flexible usage. You can run our examples, build your own training loop, or simply import useful utilities from this repo.
 
-- A `SupervisedDatasetBuilder` is a configuration object which builds a `SupervisedDataset`.
-- An `RLDatasetBuilder` is a configuration object which builds an `RLDataset`, which generates batches of `EnvGroupBuilder` objects, which each generate a group of `Env` objects.
+### Running our examples
 
-Here, the `SupervisedDatasetBuilder`, `RLDatasetBuilder`, and `EnvGroupBuilder` are all configuration objects, which have a `__call__` method that builds another object. You can see these objects in [supervised/types.py](tinker_cookbook/supervised/types.py) and [rl/types.py](tinker_cookbook/rl/types.py).
+`tinker_cookbook/supervised/train.py` and `tinker_cookbook/rl/train.py` contain our reference entrypoints for supervised learning and reinforcement learning accordingly.
 
-In general, we use a lot of configuration objects, with a `__call__` method that returns a heavyweight object (like a dataset). We use `chz` for the configuration objects -- it's similar to a dataclass but with some extra features that are nice for configs. We use either dataclasses or regular python classes for the heavyweight objects.
+Navigate to `tinker_cookbook/recipes` and you will find ready-to-go post-training examples. Here are the list of examples you can try out:
+- `chat_sl` shows supervised fine-tuning on Tulu3
+- `prompt_distillation` XXXX // TODO(tianyi): add take away message
+- `math_rl` demontrates Refinforcement Learning with Verifiable Reward (RLVR) on math problems
+- `multiplayer_rl` leverages the flexibility of Tinker to learn on multiplayer / multi-model games
+- `tool_use/search` replicates a recent academic paper on using RL to teach the ability to use a vector search tool.
 
-## Envs
+### Building your own
 
-An `Env` is an RL environment. For those with an RL background, it roughly corresponds to an MDP or a POMDP, however we use in more general cases (such as multi-agent settings) that don't strictly correspond to the MDP/POMDP formalism. It's roughly analogous the concept of an Env in OpenAI Gym, but unlike OpenAI Gym, we don't have a `reset` method; rather, the env should be discarded after a rollout. Any shared resources should be maintained by whatever object is creating the envs.
+`sl_basic.py` and `rl_basic.py` remove most of our abstractions and provide clean starting points for building your own projects.
 
-The `Env`s are created by `EnvGroupBuilder`s. The group of envs returned by `EnvGroupBuilder` have something in common; either they correspond to the same task (in which case we can use this information for variance reduction, as in GRPO, which centers per group); or, we can use the group to define a multi-agent environment.
+### Import our utilities
 
-- One common multi-agent environment is where we use a pairwise preference model to compare pairs of completions.
-- We can also use the group to define a two-player game. Some two player games such as tic-tac-toe are currently supported through the [textarena](tinker_cookbook/rl/textarena_envs.py) environments.
+Tinker cookbook includes several patterns we like. Here's a quick overview,
+- [renderers]() converts tokens from/to structured chat message objects
+- [hyperparam_utils]() helps calculate hyperparameters suitable for LoRAs
+- [evaluation]() shows how to evaluate Tinker models and also integrate with InspectAI to make evaluating on standard benchmarks easy.
 
+## Contributing
 
-## Notation
+We welcome community contributions to Tinker cookbook. At the same time, we want to keep this official repo lean and hackable.
+If you build a cool project, please share with us and we'd love to highlight them in `FEATURED_PROJECTS.md`.
+If you want to help us improve the core utilities in Tinker cookbook, please be familiar with `CONTRIBUTING.md`. We also post ideas on where we could use help.
 
-We'll use subscripts to indicate the shapes of objects. For example, `tokens_P_G_T` indicates a three-dimensional array of tokens, with `P` problems, `G` groups, and `T` tokens per groups, so `tokens_P_G_T[p][g][t]` should refer to a single token. In many cases, the arrays will be ragged. E.g., the `T` axis will have different lengths for different `(p,g)`. Sometimes, a given dimension will be flattened from two dimensions. If we write `tokens_PG_T`, that means that we have a two dimensional array, where the 0th dimension is flattened from the `P` and `G` dimensions.
-
-### Common Dimension Names
-
-Here are the standard dimension subscripts used throughout the codebase:
-
-- `_D`: Data/Datum dimension (for training data items)
-- `_G`: Group dimension (for multiple attempts/rollouts of the same problem)
-- `_P`: Problem dimension (for different problems/prompts)
-- `_T`: Token/Time dimension (for sequences)
-
-The relationship between dimensions in RL:
-- A batch contains multiple problems (`_P`)
-- Each problem spawns multiple attempts/environments (`_G`), forming a group
-- Each attempt produces one trajectory
-- Advantages are normalized within each group (across the `_G` dimension)
-
-Examples:
-- `env_group_builders_P`: A list of environment builders, one per problem
-- `trajectories_G`: Multiple trajectories from attempts at the same problem
-- `rewards_G`: Rewards for each attempt within a group
-- `tokens_P_G_T`: Tokens with problem, group, and time dimensions
-- `data_D`: A list of training data items
+For general feedback, you can XXXX // TODO(tianyi): check with clare
@@ -14,6 +14,7 @@ dependencies = [
     "torch",
     "transformers",
     "blobfile",
+    "inspect-ai",
 ]
 
 [project.optional-dependencies]
@@ -28,12 +29,17 @@ envs = [
     "pylatexenc",
     "sympy",
     "textarena",
+    "math-verify",
+    "scipy",
 ]
 vector-search = [
     "chromadb",
     "google-genai",
     "huggingface_hub"
 ]
+wandb = [
+    "wandb",
+]
 
 [build-system]
 requires = ["hatchling"]
@@ -44,3 +50,6 @@ packages = ["tinker_cookbook"]
 
 [tool.pytest.ini_options]
 python_files = ["test_*.py", "*_test.py"]
+
+[tool.ruff]
+line-length = 100
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`	`1`	`{`
`2`		`- "last_synced_sha": "5d73981d7ab362ab5715f22894a234f45956c9c0",`
`3`		`- "last_sync_time": "2025-09-23T18:32:46.985949"`
	`2`	`+ "last_synced_sha": "b0f49367e227ac8f6f4e25bb580f1d7a7e56655b",`
	`3`	`+ "last_sync_time": "2025-09-30T23:38:41.029668"`
`4`	`4`	`}`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+// TODO(tianyi): any launch partners we can talk about`