@@ -139,7 +139,7 @@ wtp.split(text, threshold=threshold)
139139
140140### Advanced Usage
141141
142- Get the newline or sentence boundary probabilities for a text:
142+ __ Get the newline or sentence boundary probabilities for a text:__
143143
144144``` python
145145# returns newline probabilities (supports batching!)
@@ -149,7 +149,7 @@ wtp.predict_proba(text)
149149wtp.predict_proba(text, lang_code = " en" , style = " ud" )
150150```
151151
152- Load a WtP model in [ HuggingFace ` transformers ` ] ( https://github.com/huggingface/transformers ) :
152+ __ Load a WtP model in [ HuggingFace ` transformers ` ] ( https://github.com/huggingface/transformers ) :__
153153
154154``` python
155155# import wtpsplit to register the custom models
@@ -160,6 +160,73 @@ from transformers import AutoModelForTokenClassification
160160model = AutoModelForTokenClassification.from_pretrained(" benjamin/wtp-bert-mini" ) # or some other model name
161161```
162162
163+ __ ** NEW ** Adapt to your own corpus using WtP_Punct:__
164+
165+ Clone the repository:
166+
167+ ```
168+ git clone https://github.com/bminixhofer/wtpsplit
169+ cd wtpsplit
170+ ```
171+
172+ Create your data:
173+ ``` python
174+ import torch
175+
176+ torch.save(
177+ {
178+ " en" : {
179+ " sentence" : {
180+ " dummy-dataset" : {
181+ " meta" : {
182+ " train_data" : [" train sentence 1" , " train sentence 2" ],
183+ },
184+ " data" : [
185+ " test sentence 1" ,
186+ " test sentence 2" ,
187+ ]
188+ }
189+ }
190+ }
191+ },
192+ " dummy-dataset.pth"
193+ )
194+ ```
195+
196+ Run adaptation:
197+
198+ ```
199+ python3 wtpsplit/evaluation/adapt.py --model_path=benjamin/wtp-bert-mini --eval_data_path dummy-dataset.pth --include_langs=en
200+ ```
201+
202+ This should print something like
203+
204+ ```
205+ en dummy-dataset U=0.500 T=0.667 PUNCT=0.667
206+ 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 30.52it/s]
207+ Wrote mixture to /Users/bminixhofer/Documents/wtpsplit/wtpsplit/.cache/wtp-bert-mini.skops
208+ Wrote results to /Users/bminixhofer/Documents/wtpsplit/wtpsplit/.cache/wtp-bert-mini_intrinsic_results.json
209+ ```
210+
211+ i.e. run adaptation on your data and save the mixtures and evaluation results. You can then load and use the mixture like this:
212+
213+ ``` python
214+ from wtpsplit import WtP
215+ import skops.io as sio
216+
217+ wtp = WtP(
218+ " wtp-bert-mini" ,
219+ mixtures = sio.load(
220+ " wtpsplit/.cache/wtp-bert-mini.skops" ,
221+ [" numpy.float32" , " numpy.float64" , " sklearn.linear_model._logistic.LogisticRegression" ],
222+ ),
223+ )
224+
225+ wtp.split(" your text here" , lang_code = " en" , style = " dummy-dataset" )
226+ ```
227+
228+ ... and adjust the dataset name, language and model in the above to your needs.
229+
163230## Reproducing the paper
164231
165232` configs/ ` contains the configs for the runs from the paper. We trained on a TPUv3-8. Launch training like this:
0 commit comments