Skip to content

Commit 3f9ab26

Browse files
authored
Update README.md
1 parent 2919e7c commit 3f9ab26

File tree

1 file changed

+69
-2
lines changed

1 file changed

+69
-2
lines changed

README.md

Lines changed: 69 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -139,7 +139,7 @@ wtp.split(text, threshold=threshold)
139139

140140
### Advanced Usage
141141

142-
Get the newline or sentence boundary probabilities for a text:
142+
__Get the newline or sentence boundary probabilities for a text:__
143143

144144
```python
145145
# returns newline probabilities (supports batching!)
@@ -149,7 +149,7 @@ wtp.predict_proba(text)
149149
wtp.predict_proba(text, lang_code="en", style="ud")
150150
```
151151

152-
Load a WtP model in [HuggingFace `transformers`](https://github.com/huggingface/transformers):
152+
__Load a WtP model in [HuggingFace `transformers`](https://github.com/huggingface/transformers):__
153153

154154
```python
155155
# import wtpsplit to register the custom models
@@ -160,6 +160,73 @@ from transformers import AutoModelForTokenClassification
160160
model = AutoModelForTokenClassification.from_pretrained("benjamin/wtp-bert-mini") # or some other model name
161161
```
162162

163+
__** NEW ** Adapt to your own corpus using WtP_Punct:__
164+
165+
Clone the repository:
166+
167+
```
168+
git clone https://github.com/bminixhofer/wtpsplit
169+
cd wtpsplit
170+
```
171+
172+
Create your data:
173+
```python
174+
import torch
175+
176+
torch.save(
177+
{
178+
"en": {
179+
"sentence": {
180+
"dummy-dataset": {
181+
"meta": {
182+
"train_data": ["train sentence 1", "train sentence 2"],
183+
},
184+
"data": [
185+
"test sentence 1",
186+
"test sentence 2",
187+
]
188+
}
189+
}
190+
}
191+
},
192+
"dummy-dataset.pth"
193+
)
194+
```
195+
196+
Run adaptation:
197+
198+
```
199+
python3 wtpsplit/evaluation/adapt.py --model_path=benjamin/wtp-bert-mini --eval_data_path dummy-dataset.pth --include_langs=en
200+
```
201+
202+
This should print something like
203+
204+
```
205+
en dummy-dataset U=0.500 T=0.667 PUNCT=0.667
206+
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 30.52it/s]
207+
Wrote mixture to /Users/bminixhofer/Documents/wtpsplit/wtpsplit/.cache/wtp-bert-mini.skops
208+
Wrote results to /Users/bminixhofer/Documents/wtpsplit/wtpsplit/.cache/wtp-bert-mini_intrinsic_results.json
209+
```
210+
211+
i.e. run adaptation on your data and save the mixtures and evaluation results. You can then load and use the mixture like this:
212+
213+
```python
214+
from wtpsplit import WtP
215+
import skops.io as sio
216+
217+
wtp = WtP(
218+
"wtp-bert-mini",
219+
mixtures=sio.load(
220+
"wtpsplit/.cache/wtp-bert-mini.skops",
221+
["numpy.float32", "numpy.float64", "sklearn.linear_model._logistic.LogisticRegression"],
222+
),
223+
)
224+
225+
wtp.split("your text here", lang_code="en", style="dummy-dataset")
226+
```
227+
228+
... and adjust the dataset name, language and model in the above to your needs.
229+
163230
## Reproducing the paper
164231

165232
`configs/` contains the configs for the runs from the paper. We trained on a TPUv3-8. Launch training like this:

0 commit comments

Comments
 (0)