git clone https://github.com/llama2d/llama2d.git --recursive
cd transformers && pip install -e . && cd ..
pip install -r requirements.txt
playwright install
pre-commit install
Secrets are posted in this Slack thread.
-
Download the
gcp-vision.json
credential file from our Slack channel and put it insecrets/
. -
Run the Modal login command in the Slack channel. It looks like this:
modal token set --token-id <secret> --token-secret <secret>
Datasets are defined in the src/llama2d/datasets/
directory.
Every row of a dataset is defined by a prompt, a 2d "screen", and an output.
However, a row is converted into pure tokens before being fed into Llama - see this dataset for an example.
You can visualize a dataset on Huggingface by copying all the numbers in a row and pasting it into this webpage.
We will have lots of synthetic datasets--i.e. the Zoo Compass dataset defined in src/llama2d/datasets/synthetic/zoo_compass.py
.
These datasets are simple. They each spit out a bunch of rows with prompt: str
, screen: Llama2dScreen
, and output: str
.
It is easy to create a Llama2dScreen
:
from llama2d.vision import Llama2dScreen
screen = Llama2dScreen()
screen.push_word(word="north",xy=(0.5,0))
screen.push_word(word="south",xy=(0.5,1))
screen.push_word(word="east",xy=(1,0.5))
screen.push_word(word="west",xy=(0,0.5))
To create this dataset, look at it in your console, and publish it to Huggingface, run the following:
python -m llama2d.datasets.synthetic.zoo_compass
I recommend reading the Zoo Compass dataset code for reference.
This dataset contains over 600 retail websites. The task is next-token prediction.
Here, the prompt and output are empty. The website text is all in the screen.
The model is trained to predict the next token of the website text. It is NOT trained to predict the position of the next token.
This dataset is implemented in src/llama2d/datasets/pretraining.py
.
To collect this dataset and upload it to Huggingface, run the file:
python -m src.llama2d.datasets.pretraining
This dataset contains ~1000 tasks from the Mind2Web dataset.
The task is to take an intention, a screenshot of a webpage, and choose the correct action to take.
To download this dataset, first download the Mind2Web mhtml
files generated by Andrew Stelmach.
The zip with the files is here. Download it and unzip it into src/data/mind2web-mhtml
. Your src/data/mind2web-mhtml
directory should look like this:
src/data/mind2web-mhtml
βββ 0004f2a7-90d6-4f96-902a-b1d25d39a93d_before.mhtml
βββ 00068a1e-b6a3-4c53-a60c-3ed777d4b05d_before.mhtml
βββ 00146964-4b74-4e28-8292-5810a604639a_before.mhtml
βββ 0018120a-8da1-4a36-a1c4-b4642c97211b_before.mhtml
To process and cache the Mind2Web dataset, run the following:
python -m llama2d.datasets.mind2web
To train a model with Modal, change your directory to src/llama2d/modal/
and run i.e.
modal run train.py --dataset hf_dataset.py --repo src/llama2d/llama2d-mind2web --no-peft --num-epochs 4
peft
is a synonym for LoRA. hf_dataset
means we are using a dataset uploaded to Huggingface (thanks Matthew!). src/llama2d/llama2d-mind2web
is the Huggingface repo containing the dataset.
To add a requirement, add it to requirements.in
, run pip-compile
, and run pip-sync
.
Run black . --exclude '/transformers/|/venv/'
to format the code.
Pre-commit hooks are used to maintain code quality.