|
25 | 25 | "\n",
|
26 | 26 | "If you took the `triangulate.py` route, you eliminated noise from data to triangulate latitude/longitude coordinates of an actual stuffed unicorn, hidden on campus. And it's not like we controlled the noise at all: the data we generated was **really** noisy. Every data point was generated with a variance of several kilometers.\n",
|
27 | 27 | "\n",
|
28 |
| - "If you walked through the Row of Puzzles, you've worked a truly dazzling array of Python language features. You wrote a decorator to inspect a strange function, you used `requests` and `numpy` to piece together *audio* files! That's freaking amazing.\n", |
| 28 | + "If you walked through the Row of Puzzles, you've used a truly dazzling array of Python language features. You wrote a decorator to inspect a strange function, you used `requests` and `numpy` to piece together *audio* files! That's freaking amazing.\n", |
29 | 29 | "\n",
|
30 | 30 | "Take a pause, breathe, and pat yourself on the back.\n",
|
31 | 31 | "\n",
|
|
67 | 67 | "\n",
|
68 | 68 | "Notice that every user that rated both movies rated them pretty similarly (i.e., the values in the two columns are very close to each other). Based on that, we can conclude that Inside Out is pretty similar to Frozen 2, and if you like one movie, you'll probably like the other. Similarly, if you hate one movie, you'll probably hate the other.\n",
|
69 | 69 | "\n",
|
70 |
| - "We'll compute the \"closeness\" of movies using **cosine similarity**. But first, let's load our data. <br />\n", |
| 70 | + "We'll formalize and compute the \"closeness\" of movies using **cosine similarity**. But first, let's load our data. <br />\n", |
71 | 71 | "*This data comes from CS 124: From Languages to Information*"
|
72 | 72 | ]
|
73 | 73 | },
|
|
142 | 142 | "metadata": {},
|
143 | 143 | "source": [
|
144 | 144 | "### `clean_data(ratings)`\n",
|
145 |
| - "Great! We've got our data loaded! Now, let's clean our data. For cosine similarity, we need each column to have norm 1. That is, it's length, as a 9125-dimensional vector, should be 1. Recall that the length of a vector is the square root of the sum of its entries (this is the Pythagorean Theorem, also called the Euclidean norm).\n", |
| 145 | + "Great! We've got our data loaded! Now, let's clean it. For cosine similarity, we need each column to have norm 1. That is, it's length, as a 9125-dimensional vector, should be 1. Recall that the length of a vector is the square root of the sum of its entries, squared (this is the Pythagorean Theorem, also called the Euclidean norm). For example, if $x = (x_1, x_2, \\dots, x_n)$, then\n", |
| 146 | + "$$\\lVert x \\rVert = \\sqrt{x_1^2 + x_2^2 + \\cdots + x_n^2}$$\n", |
146 | 147 | "\n",
|
147 | 148 | "You can compute the norm of a vector using `np.linalg.norm`. That function also supports an `axis` keyword argument, which allows you to compute the norm \"along a given axis,\" to use Michael's terminology. **Be careful:** some movies don't have ratings, so their norm will be 0. To avoid a divide-by-zero issue, leave those columns untouched. It might help to treat their norms as though they're 1, so you don't modify their values when renormalizing.\n",
|
148 | 149 | "\n",
|
|
201 | 202 | "Unicornelius 0.680 * 4 + 0.737 * 3 = 4.931 -> 0.707 \n",
|
202 | 203 | "```\n",
|
203 | 204 | "\n",
|
204 |
| - "Notice that this vector is the same size as each of the movie vectors (it'll have 671 entries)... That's because we can think of this vector as a vector which represents the *perfect movie* for this user.\n", |
| 205 | + "Notice that this vector is the same size as each of the movie vectors (it'll have 671 entries)... That hints towards the significance of the vector: we can think of it as a vector which represents the *perfect movie* for this user.\n", |
205 | 206 | "\n",
|
206 | 207 | "The cosine similarity between two vectors $x = (x_1, x_2, \\dots, x_n)$ and $y = (y_1, y_2, \\dots, y_n)$ (which both have norm 1) is defined as their dot product, or the sum of element-wise products of their entries: $x_1 y_1 + x_2 y_2 + \\cdots + x_n y_n$. This will be a number between 0 and 1 with higher values representing more similar vectors. You can think of the cosine similarity as an estimation of the \"closeness\" between the two vectors.\n",
|
207 | 208 | "\n",
|
|
0 commit comments