Skip to content

Commit 2ebe22a

Browse files
committed
improve documentation
1 parent 834f918 commit 2ebe22a

20 files changed

+318716
-282
lines changed

README.md

Lines changed: 12 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -41,19 +41,20 @@ Run notebooks in the order listed
4141
* 100_train_test_split - split similar names into train and test sets, removing bad pairs
4242
* input: similar-v2, pref-names, bad-pairs
4343
* output: train-v2, test-v2
44+
*
4445
* 200_generate_triplets - generate triplets from training data
4546
* input: train-v2
4647
* output: triplets
4748
* 204_generate_subword_tokenizer - train a subword tokenizer
4849
* input: triplets, pref-names, train-v2
4950
* output: subword-tokenizer
50-
* 205_augment_common_non_negatives - augment common non-negatives with additional names
51-
* input: common-non-negatives, triplets, name-variants, given-nicknames
52-
* output: common-non-negatives-augmented
51+
* 205_generate_common_non_negatives - generate pairs of names that are not negative examples
52+
* input: std-buckets, pref-names, triplets, given-nicknames
53+
* output: common-non-negatives
5354
* 206_analyze_triplets - review triplets (optional)
54-
* input: triplets, pref-names, common-non-negatives-augmented,
55+
* input: triplets, pref-names, common-non-negatives,
5556
* 207_augment_triplets - augment triplets with additional triplets
56-
* input: triplets, pref-names, common-non-negatives-augmented, subword-tokenizer
57+
* input: triplets, pref-names, common-non-negatives, subword-tokenizer
5758
* output: triplets-augmented
5859
* 220_create_language_model_dataset - create a dataset to train roberta
5960
* input: pref-names, tree-hr-parquet
@@ -62,16 +63,16 @@ Run notebooks in the order listed
6263
* input: all-tree-hr-names-sample, pref-names
6364
* output: roberta
6465
* 222_train_cross_encoder - train a cross-encoder model
65-
* input: roberta, triplets-augmented, pref-names
66+
* input: roberta, triplets-augmented
6667
* output: cross-encoder
6768
* 223_generate_triplets_from_cross_encoder - generate triplets for training the bi-encoder from the cross-encoder
68-
* input: pref-names, train-v2, common-non-negatives-augmented, std-buckets, cross-encoder
69+
* input: pref-names, train-v2, common-non-negatives, std-buckets, cross-encoder
6970
* output: cross-encoder-triplets-0 and cross-encoder-triplets-common (run twice)
7071
* 224_train_bi_encoder - train a bi-encoder model
7172
* input: cross-encoder-triplets-common-0-augmented, subword-tokenizer
7273
* output: bi-encoder
7374
* 230_eval_bi_encoder - evaluate a bi-encoder model, used to pick hyperparameters
74-
* input: subword-tokenizer, bi-encoder, pref-names, triplets, common-non-negatives-augmented
75+
* input: subword-tokenizer, bi-encoder, pref-names, triplets, common-non-negatives
7576
* 240_create_clusters_from_buckets - split buckets into clusters using the cross encoder; clusters in the same bucket form a super-cluster
7677
* input: std-buckets, subword-tokenizer, cross-encoder, bi-encoder, pref-names
7778
* output: clusters, super-clusters
@@ -101,24 +102,20 @@ Run notebooks in the order listed
101102
* f"../data/models/bi_encoder-{given_surname}-{model_type}.pth"
102103
* clusters - similar names from the same bucket
103104
* f"../data/processed/clusters_{given_surname}-{scorer}-{linkage}-{similarity_threshold}-{cluster_freq_normalizer}.json"
104-
* !!! common-non-negatives - pairs of names that may be similar (are not negative)
105-
* f"../references/common_{given_surname}_non_negatives.csv"
106-
* common-non-negatives-augmented - pairs of names that may be similar (are not negative), augmented
107-
* f"../data/processed/common_{given_surname}_non_negatives-augmented.csv"
105+
* common-non-negatives - pairs of names that may be similar (are not negative)
106+
* f"../data/processed/common_{given_surname}_non_negatives.csv"
108107
* cross-encoder - model to evaluate the similarity of two names
109108
* f"../data/models/cross-encoder-{given_surname}-10m-265-same-all"
110109
* cross-encoder-triplets-0 - triplets generated from cross-encoder with num_easy_negs=0
111110
* f"../data/processed/cross-encoder-triplets-{given_surname}-0.csv"
112111
* cross-encoder-triplets-common - triplets generated from cross-encoder with num_easy_negs='common'
113112
* f"../data/processed/cross-encoder-triplets-{given_surname}-common.csv"
114-
* cross-encoder-triplets-common-0-augmented = cross-encoder-triplets-common + cross-encoder-triplets-0 + test-triplets-augmented
113+
* cross-encoder-triplets-common-0-augmented = cross-encoder-triplets-common + cross-encoder-triplets-0 + triplets-augmented
115114
* f"../data/processed/cross-encoder-triplets-{given_surname}-common-0-augmented.csv"
116115
* dissimilar-v2 - pairs of names from tree-record attachments that are probably not similar
117116
* f"s3://familysearch-names/processed/tree-hr-{given_surname}-dissimilar-v2.csv.gz"
118117
* given-nicknames - nicknames for given names (hand curated from a variety of sources)
119118
* f"../references/givenname_nicknames.csv"
120-
* !!! name-variants - ???
121-
* f"../references/{given_surname}_variants.csv"
122119
* nearby-clusters - for each cluster, list the nearby clusters
123120
* f"../data/processed/nearby_clusters_{given_surname}-{scorer}-{linkage}-{similarity_threshold}-{cluster_freq_normalizer}.json"
124121
* pref-names - preferred names from the tree

notebooks/205_augment_non_negatives.ipynb renamed to notebooks/205_generate_common_non_negatives.ipynb

Lines changed: 51 additions & 104 deletions
Original file line numberDiff line numberDiff line change
@@ -16,9 +16,9 @@
1616
"id": "028823c5",
1717
"metadata": {},
1818
"source": [
19-
"# Augment common non-negatives\n",
19+
"# Generate common non-negatives\n",
2020
"\n",
21-
"Add triplets and name variants and nicknames to common non-negatives"
21+
"Add existing standard, triplets, and nicknames to common non-negatives"
2222
]
2323
},
2424
{
@@ -28,6 +28,8 @@
2828
"metadata": {},
2929
"outputs": [],
3030
"source": [
31+
"import re\n",
32+
"\n",
3133
"import pandas as pd\n",
3234
"from tqdm.auto import tqdm\n",
3335
"\n",
@@ -41,14 +43,16 @@
4143
"metadata": {},
4244
"outputs": [],
4345
"source": [
44-
"given_surname = \"given\"\n",
46+
"given_surname = \"surname\"\n",
47+
"\n",
48+
"num_common_names = 10000\n",
4549
"\n",
46-
"common_non_negatives_path = f\"../references/common_{given_surname}_non_negatives.csv\"\n",
50+
"pref_path = f\"s3://familysearch-names/processed/tree-preferred-{given_surname}-aggr.csv.gz\"\n",
51+
"std_path = f\"../references/std_{given_surname}.txt\"\n",
4752
"triplets_path=f\"../data/processed/tree-hr-{given_surname}-triplets-v2-1000.csv.gz\"\n",
48-
"name_variants_path = f\"../references/{given_surname}_variants.csv\"\n",
4953
"given_nicknames_path = \"../references/givenname_nicknames.csv\"\n",
5054
"\n",
51-
"augmented_path = f\"../data/processed/common_{given_surname}_non_negatives-augmented.csv\""
55+
"non_negatives_path = f\"../data/processed/common_{given_surname}_non_negatives.csv\""
5256
]
5357
},
5458
{
@@ -64,7 +68,7 @@
6468
"id": "401ad99c",
6569
"metadata": {},
6670
"source": [
67-
"### read common non-negatives"
71+
"### read preferred names"
6872
]
6973
},
7074
{
@@ -74,23 +78,52 @@
7478
"metadata": {},
7579
"outputs": [],
7680
"source": [
77-
"common_non_negatives_df = read_csv(common_non_negatives_path)\n",
78-
"print(len(common_non_negatives_df))\n",
79-
"common_non_negatives_df.head(3)"
81+
"pref_df = read_csv(pref_path)\n",
82+
"common_names = set([name for name in pref_df['name'][:num_common_names].tolist() \\\n",
83+
" if len(name) > 1 and re.fullmatch(r'[a-z]+', name)])\n",
84+
"len(common_names)"
85+
]
86+
},
87+
{
88+
"cell_type": "markdown",
89+
"id": "2be4909b",
90+
"metadata": {},
91+
"source": [
92+
"## Start with FS buckets"
8093
]
8194
},
8295
{
8396
"cell_type": "code",
8497
"execution_count": null,
85-
"id": "ab78be66",
98+
"id": "6a359ed7",
8699
"metadata": {},
87100
"outputs": [],
88101
"source": [
102+
"common_names_set = set(common_names)\n",
89103
"common_non_negatives = set()\n",
90-
"for name1, name2 in common_non_negatives_df.values.tolist():\n",
91-
" common_non_negatives.add((name1, name2))\n",
92-
" common_non_negatives.add((name2, name1))\n",
93-
"len(common_non_negatives)"
104+
"\n",
105+
"with open(std_path) as f:\n",
106+
" for ix, line in enumerate(f.readlines()):\n",
107+
" line = line.strip()\n",
108+
" head_names, tail_names = line.split(':')\n",
109+
" head_names = head_names.strip()\n",
110+
" tail_names = tail_names.strip()\n",
111+
" names = set()\n",
112+
" if len(head_names):\n",
113+
" names |= set(head_names.split(' '))\n",
114+
" if len(tail_names):\n",
115+
" names |= set(tail_names.split(' '))\n",
116+
" names = [name for name in names if len(name) > 0]\n",
117+
" for name1 in names:\n",
118+
" if name1 not in common_names_set:\n",
119+
" continue\n",
120+
" for name2 in names:\n",
121+
" if name2 not in common_names_set:\n",
122+
" continue\n",
123+
" if name1 == name2:\n",
124+
" continue\n",
125+
" common_non_negatives.add((name1, name2))\n",
126+
"print(len(common_non_negatives))"
94127
]
95128
},
96129
{
@@ -132,39 +165,6 @@
132165
"len(common_non_negatives)"
133166
]
134167
},
135-
{
136-
"cell_type": "markdown",
137-
"id": "2a9e8224",
138-
"metadata": {},
139-
"source": [
140-
"### add name variants"
141-
]
142-
},
143-
{
144-
"cell_type": "code",
145-
"execution_count": null,
146-
"id": "7174dff9",
147-
"metadata": {},
148-
"outputs": [],
149-
"source": [
150-
"name_variants_df = read_csv(name_variants_path)\n",
151-
"print(len(name_variants_df))\n",
152-
"name_variants_df.head(3)"
153-
]
154-
},
155-
{
156-
"cell_type": "code",
157-
"execution_count": null,
158-
"id": "fc2e63ac",
159-
"metadata": {},
160-
"outputs": [],
161-
"source": [
162-
"for name1, name2 in name_variants_df.values.tolist():\n",
163-
" common_non_negatives.add((name1, name2))\n",
164-
" common_non_negatives.add((name2, name1))\n",
165-
"len(common_non_negatives)"
166-
]
167-
},
168168
{
169169
"cell_type": "markdown",
170170
"id": "b7d9ce98",
@@ -197,7 +197,7 @@
197197
"id": "e0df5a95",
198198
"metadata": {},
199199
"source": [
200-
"## Save augmented non-negatives"
200+
"## Save common non-negatives"
201201
]
202202
},
203203
{
@@ -211,66 +211,13 @@
211211
"for name1, name2 in common_non_negatives:\n",
212212
" records.append({'name1': name1, 'name2': name2})\n",
213213
"df = pd.DataFrame(records)\n",
214-
"df.to_csv(augmented_path, index=False)"
215-
]
216-
},
217-
{
218-
"cell_type": "markdown",
219-
"id": "21fa63a6",
220-
"metadata": {},
221-
"source": [
222-
"## Miscellaneous\n",
223-
"\n",
224-
"Generate common non-negatives from existing standard"
225-
]
226-
},
227-
{
228-
"cell_type": "raw",
229-
"id": "9278c668",
230-
"metadata": {},
231-
"source": [
232-
"common_names_set = set(common_names)\n",
233-
"\n",
234-
"with open(f\"../references/std_{given_surname}.txt\") as f:\n",
235-
" for ix, line in enumerate(f.readlines()):\n",
236-
" line = line.strip()\n",
237-
" head_names, tail_names = line.split(':')\n",
238-
" head_names = head_names.strip()\n",
239-
" tail_names = tail_names.strip()\n",
240-
" names = set()\n",
241-
" if len(head_names):\n",
242-
" names |= set(head_names.split(' '))\n",
243-
" if len(tail_names):\n",
244-
" names |= set(tail_names.split(' '))\n",
245-
" names = [name for name in names if len(name) > 0]\n",
246-
" for i in range(0, len(names)):\n",
247-
" if names[i] not in common_names_set:\n",
248-
" continue\n",
249-
" for j in range(i+1, len(names)):\n",
250-
" if names[j] not in common_names_set:\n",
251-
" continue\n",
252-
" name1 = names[i]\n",
253-
" name2 = names[j]\n",
254-
" if name1 > name2:\n",
255-
" name1, name2 = name2, name1\n",
256-
" common_non_negatives.add(f\"{name1}:{name2}\")\n",
257-
"print(len(common_non_negatives))\n",
258-
"\n",
259-
"variants = []\n",
260-
"for name_pair in sorted(common_non_negatives):\n",
261-
" name1, name2 = name_pair.split(':')\n",
262-
" if name1 > name2:\n",
263-
" print(\"ERROR\", name1, name2)\n",
264-
" variants.append({\"name1\": name1, \"name2\": name2})\n",
265-
"print(len(variants))\n",
266-
"df = pd.DataFrame(variants)\n",
267-
"df.to_csv(common_non_negatives_path, index=False)"
214+
"df.to_csv(non_negatives_path, index=False)"
268215
]
269216
},
270217
{
271218
"cell_type": "code",
272219
"execution_count": null,
273-
"id": "dd31da69",
220+
"id": "4a0ac472",
274221
"metadata": {},
275222
"outputs": [],
276223
"source": []

notebooks/206_analyze_triplets.ipynb

Lines changed: 19 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -45,14 +45,14 @@
4545
"metadata": {},
4646
"outputs": [],
4747
"source": [
48-
"given_surname = \"given\"\n",
48+
"given_surname = \"surname\"\n",
4949
"sample_frac = 1.0\n",
50-
"num_common_names = 1000\n",
51-
"num_semi_common_names = 1500\n",
50+
"num_common_names = 1000 if given_surname == \"given\" else 2500\n",
51+
"num_semi_common_names = 1500 if given_surname == \"given\" else 4000\n",
5252
"\n",
5353
"pref_path = f\"s3://familysearch-names/processed/tree-preferred-{given_surname}-aggr.csv.gz\"\n",
5454
"triplets_path=f\"../data/processed/tree-hr-{given_surname}-triplets-v2-1000.csv.gz\"\n",
55-
"common_non_negatives_path = f\"../data/processed//common_{given_surname}_non_negatives-augmented.csv\""
55+
"common_non_negatives_path = f\"../data/processed/common_{given_surname}_non_negatives.csv\""
5656
]
5757
},
5858
{
@@ -117,6 +117,17 @@
117117
"triplets_df[(triplets_df['anchor'] == 'zsuzsanna') | (triplets_df['positive'] == 'zsuzsanna')]"
118118
]
119119
},
120+
{
121+
"cell_type": "code",
122+
"execution_count": null,
123+
"id": "05ce6ee1",
124+
"metadata": {},
125+
"outputs": [],
126+
"source": [
127+
"name = 'quass'\n",
128+
"triplets_df[(triplets_df['anchor'] == name) | (triplets_df['positive'] == name)]"
129+
]
130+
},
120131
{
121132
"cell_type": "markdown",
122133
"id": "086d25ac",
@@ -312,7 +323,10 @@
312323
"id": "b09c7c6c",
313324
"metadata": {},
314325
"source": [
315-
"## Review semi-common non-negatives that aren't represented in anchor-pos pairs"
326+
"## Review semi-common non-negatives that aren't represented in anchor-pos pairs\n",
327+
"\n",
328+
"**TODO:** We should ask someone to review these pairs and take out the non-non-negatives (non-matches), \n",
329+
"and then somehow add the remaining matches when we augment the triplets in notebook 207."
316330
]
317331
},
318332
{

0 commit comments

Comments
 (0)