From 398e88cfe344f69574e08a2419e8346511c96561 Mon Sep 17 00:00:00 2001 From: Insop Song Date: Sun, 25 Jul 2021 12:40:43 -0700 Subject: [PATCH] Revert previous commit, winnow is correct as "narrow down" --- vsm_01_distributional.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/vsm_01_distributional.ipynb b/vsm_01_distributional.ipynb index df569e9..aa0812b 100644 --- a/vsm_01_distributional.ipynb +++ b/vsm_01_distributional.ipynb @@ -180,7 +180,7 @@ "1. Scan through your corpus building a dictionary $d$ mapping word-pairs to co-occurrence values. Every time a pair of words $w$ and $w'$ occurs in the same context (as you defined it in 1), increment $d[(w, w')]$ by whatever value is determined by your weighting scheme. You'd increment by $1$ with the weighting scheme that simply counts co-occurrences.\n", "\n", "1. Using the count dictionary $d$ that you collected in 3, establish your full vocabulary $V$, an ordered list of words types. \n", - " 1. For large collections of documents, $|V|$ will typically be huge. You will probably want to window the vocabulary at this point. \n", + " 1. For large collections of documents, $|V|$ will typically be huge. You will probably want to winnow (narrow down) the vocabulary at this point. \n", " 1. You might do this by filtering to a specific subset, or just imposing a minimum count threshold. \n", " 1. You might impose a minimum count threshold even if $|V|$ is small — for words with very low counts, you simply don't have enough evidence to support good representations.\n", " 1. For words outside the vocabulary you choose, you could ignore them entirely or accumulate all their values into a designated _UNK_ vector.\n",