|
15 | 15 | "source": [
|
16 | 16 | "## CONTENTS\n",
|
17 | 17 | "\n",
|
18 |
| - "* Language Recognition" |
| 18 | + "* Language Recognition\n", |
| 19 | + "* Author Recognition" |
19 | 20 | ]
|
20 | 21 | },
|
21 | 22 | {
|
|
30 | 31 | "\n",
|
31 | 32 | "First we need to build our dataset. We will take as input text in English and in German and we will extract n-gram character models (in this case, *bigrams* for n=2). For English, we will use *Flatland* by Edwin Abbott and for German *Faust* by Goethe.\n",
|
32 | 33 | "\n",
|
33 |
| - "Let's build our text models for each language, which will hold the probability of each bigram occurring in the text." |
| 34 | + "Let's build our text models for each language, which will hold the probability of each bigram occuring in the text." |
34 | 35 | ]
|
35 | 36 | },
|
36 | 37 | {
|
37 | 38 | "cell_type": "code",
|
38 | 39 | "execution_count": 1,
|
39 |
| - "metadata": { |
40 |
| - "collapsed": true |
41 |
| - }, |
| 40 | + "metadata": {}, |
42 | 41 | "outputs": [],
|
43 | 42 | "source": [
|
44 | 43 | "from utils import open_data\n",
|
|
67 | 66 | {
|
68 | 67 | "cell_type": "code",
|
69 | 68 | "execution_count": 2,
|
70 |
| - "metadata": { |
71 |
| - "collapsed": true |
72 |
| - }, |
| 69 | + "metadata": {}, |
73 | 70 | "outputs": [],
|
74 | 71 | "source": [
|
75 | 72 | "from learning import NaiveBayesLearner\n",
|
|
91 | 88 | {
|
92 | 89 | "cell_type": "code",
|
93 | 90 | "execution_count": 3,
|
94 |
| - "metadata": { |
95 |
| - "collapsed": true |
96 |
| - }, |
| 91 | + "metadata": {}, |
97 | 92 | "outputs": [],
|
98 | 93 | "source": [
|
99 | 94 | "def recognize(sentence, nBS, n):\n",
|
|
106 | 101 | " for b, p in P_sentence.dictionary.items():\n",
|
107 | 102 | " ngrams += [b]*p\n",
|
108 | 103 | " \n",
|
| 104 | + " print(ngrams)\n", |
| 105 | + " \n", |
109 | 106 | " return nBS(ngrams)"
|
110 | 107 | ]
|
111 | 108 | },
|
|
121 | 118 | "execution_count": 4,
|
122 | 119 | "metadata": {},
|
123 | 120 | "outputs": [
|
| 121 | + { |
| 122 | + "name": "stdout", |
| 123 | + "output_type": "stream", |
| 124 | + "text": [ |
| 125 | + "[(' ', 'i'), ('i', 'c'), ('c', 'h'), (' ', 'b'), ('b', 'i'), ('i', 'n'), ('i', 'n'), (' ', 'e'), ('e', 'i'), (' ', 'p'), ('p', 'l'), ('l', 'a'), ('a', 't'), ('t', 'z')]\n" |
| 126 | + ] |
| 127 | + }, |
124 | 128 | {
|
125 | 129 | "data": {
|
126 | 130 | "text/plain": [
|
|
141 | 145 | "execution_count": 5,
|
142 | 146 | "metadata": {},
|
143 | 147 | "outputs": [
|
| 148 | + { |
| 149 | + "name": "stdout", |
| 150 | + "output_type": "stream", |
| 151 | + "text": [ |
| 152 | + "[(' ', 't'), ('t', 'u'), ('u', 'r'), ('r', 't'), ('t', 'l'), ('l', 'e'), ('e', 's'), (' ', 'f'), ('f', 'l'), ('l', 'y'), (' ', 'h'), ('h', 'i'), ('i', 'g'), ('g', 'h')]\n" |
| 153 | + ] |
| 154 | + }, |
144 | 155 | {
|
145 | 156 | "data": {
|
146 | 157 | "text/plain": [
|
|
161 | 172 | "execution_count": 6,
|
162 | 173 | "metadata": {},
|
163 | 174 | "outputs": [
|
| 175 | + { |
| 176 | + "name": "stdout", |
| 177 | + "output_type": "stream", |
| 178 | + "text": [ |
| 179 | + "[(' ', 'd'), ('d', 'e'), ('e', 'r'), ('e', 'r'), (' ', 'p'), ('p', 'e'), ('e', 'l'), ('l', 'i'), ('i', 'k'), ('k', 'a'), ('a', 'n'), (' ', 'i'), ('i', 's'), ('s', 't'), (' ', 'h'), ('h', 'i'), ('i', 'e')]\n" |
| 180 | + ] |
| 181 | + }, |
164 | 182 | {
|
165 | 183 | "data": {
|
166 | 184 | "text/plain": [
|
|
181 | 199 | "execution_count": 7,
|
182 | 200 | "metadata": {},
|
183 | 201 | "outputs": [
|
| 202 | + { |
| 203 | + "name": "stdout", |
| 204 | + "output_type": "stream", |
| 205 | + "text": [ |
| 206 | + "[(' ', 'a'), ('a', 'n'), ('n', 'd'), (' ', 't'), (' ', 't'), ('t', 'h'), ('t', 'h'), ('h', 'u'), ('u', 's'), ('h', 'e'), (' ', 'w'), ('w', 'i'), ('i', 'z'), ('z', 'a'), ('a', 'r'), ('r', 'd'), (' ', 's'), ('s', 'p'), ('p', 'o'), ('o', 'k'), ('k', 'e')]\n" |
| 207 | + ] |
| 208 | + }, |
184 | 209 | {
|
185 | 210 | "data": {
|
186 | 211 | "text/plain": [
|
|
202 | 227 | "source": [
|
203 | 228 | "You can add more languages if you want, the algorithm works for as many as you like! Also, you can play around with *n*. Here we used 2, but other numbers work too (even though 2 suffices). The algorithm is not perfect, but it has high accuracy even for small samples like the ones we used. That is because English and German are very different languages. The closer together languages are (for example, Norwegian and Swedish share a lot of common ground) the lower the accuracy of the classifier."
|
204 | 229 | ]
|
| 230 | + }, |
| 231 | + { |
| 232 | + "cell_type": "markdown", |
| 233 | + "metadata": {}, |
| 234 | + "source": [ |
| 235 | + "## AUTHOR RECOGNITION\n", |
| 236 | + "\n", |
| 237 | + "Another similar application to language recognition is recognizing who is more likely to have written a sentence, given text written by them. Here we will try and predict text from Edwin Abbott and Jane Austen. They wrote *Flatland* and *Pride and Prejudice* respectively.\n", |
| 238 | + "\n", |
| 239 | + "We are optimistic we can determine who wrote what based on the fact that Abbott wrote his novella on much later date than Austen, which means there will be linguistic differences between the two works. Indeed, *Flatland* uses more modern and direct language while *Pride and Prejudice* is written in a more archaic tone containing more sophisticated wording.\n", |
| 240 | + "\n", |
| 241 | + "Similarly with Language Recognition, we will first import the two datasets. This time though we are not looking for connections between characters, since that wouldn't give that great results. Why? Because both authors use English and English follows a set of patterns, as we show earlier. Trying to determine authorship based on this patterns would not be very efficient.\n", |
| 242 | + "\n", |
| 243 | + "Instead, we will abstract our querying to a higher level. We will use words instead of characters. That way we can more accurately pick at the differences between their writing style and thus have a better chance at guessing the correct author.\n", |
| 244 | + "\n", |
| 245 | + "Let's go right ahead and import our data:" |
| 246 | + ] |
| 247 | + }, |
| 248 | + { |
| 249 | + "cell_type": "code", |
| 250 | + "execution_count": 8, |
| 251 | + "metadata": {}, |
| 252 | + "outputs": [], |
| 253 | + "source": [ |
| 254 | + "from utils import open_data\n", |
| 255 | + "from text import *\n", |
| 256 | + "\n", |
| 257 | + "flatland = open_data(\"EN-text/flatland.txt\").read()\n", |
| 258 | + "wordseq = words(flatland)\n", |
| 259 | + "\n", |
| 260 | + "P_Abbott = UnigramWordModel(wordseq, 5)\n", |
| 261 | + "\n", |
| 262 | + "pride = open_data(\"EN-text/pride.txt\").read()\n", |
| 263 | + "wordseq = words(pride)\n", |
| 264 | + "\n", |
| 265 | + "P_Austen = UnigramWordModel(wordseq, 5)" |
| 266 | + ] |
| 267 | + }, |
| 268 | + { |
| 269 | + "cell_type": "markdown", |
| 270 | + "metadata": {}, |
| 271 | + "source": [ |
| 272 | + "This time we set the `default` parameter of the model to 5, instead of 0. If we leave it at 0, then when we get a sentence containing a word we have not seen from that particular author, the chance of that sentence coming from that author is exactly 0 (since to get the probability, we multiply all the separate probabilities; if one is 0 then the result is also 0). To avoid that, we tell the model to add 5 to the count of all the words that appear.\n", |
| 273 | + "\n", |
| 274 | + "Next we will build the Naive Bayes Classifier:" |
| 275 | + ] |
| 276 | + }, |
| 277 | + { |
| 278 | + "cell_type": "code", |
| 279 | + "execution_count": 9, |
| 280 | + "metadata": {}, |
| 281 | + "outputs": [], |
| 282 | + "source": [ |
| 283 | + "from learning import NaiveBayesLearner\n", |
| 284 | + "\n", |
| 285 | + "dist = {('Abbott', 1): P_Abbott, ('Austen', 1): P_Austen}\n", |
| 286 | + "\n", |
| 287 | + "nBS = NaiveBayesLearner(dist, simple=True)" |
| 288 | + ] |
| 289 | + }, |
| 290 | + { |
| 291 | + "cell_type": "markdown", |
| 292 | + "metadata": {}, |
| 293 | + "source": [ |
| 294 | + "Now that we have build our classifier, we will start classifying. First, we need to convert the given sentence to the format the classifier needs. That is, a list of words." |
| 295 | + ] |
| 296 | + }, |
| 297 | + { |
| 298 | + "cell_type": "code", |
| 299 | + "execution_count": 10, |
| 300 | + "metadata": {}, |
| 301 | + "outputs": [], |
| 302 | + "source": [ |
| 303 | + "def recognize(sentence, nBS):\n", |
| 304 | + " sentence = sentence.lower()\n", |
| 305 | + " sentence_words = words(sentence)\n", |
| 306 | + " \n", |
| 307 | + " return nBS(sentence_words)" |
| 308 | + ] |
| 309 | + }, |
| 310 | + { |
| 311 | + "cell_type": "markdown", |
| 312 | + "metadata": {}, |
| 313 | + "source": [ |
| 314 | + "First we will input a sentence that is something Abbott would write. Note the use of square and the simpler language." |
| 315 | + ] |
| 316 | + }, |
| 317 | + { |
| 318 | + "cell_type": "code", |
| 319 | + "execution_count": 11, |
| 320 | + "metadata": {}, |
| 321 | + "outputs": [ |
| 322 | + { |
| 323 | + "data": { |
| 324 | + "text/plain": [ |
| 325 | + "'Abbott'" |
| 326 | + ] |
| 327 | + }, |
| 328 | + "execution_count": 11, |
| 329 | + "metadata": {}, |
| 330 | + "output_type": "execute_result" |
| 331 | + } |
| 332 | + ], |
| 333 | + "source": [ |
| 334 | + "recognize(\"the square is mad\", nBS)" |
| 335 | + ] |
| 336 | + }, |
| 337 | + { |
| 338 | + "cell_type": "markdown", |
| 339 | + "metadata": {}, |
| 340 | + "source": [ |
| 341 | + "The classifier correctly guessed Abbott.\n", |
| 342 | + "\n", |
| 343 | + "Next we will input a more sophisticated sentence, similar to the style of Austen." |
| 344 | + ] |
| 345 | + }, |
| 346 | + { |
| 347 | + "cell_type": "code", |
| 348 | + "execution_count": 12, |
| 349 | + "metadata": {}, |
| 350 | + "outputs": [ |
| 351 | + { |
| 352 | + "data": { |
| 353 | + "text/plain": [ |
| 354 | + "'Austen'" |
| 355 | + ] |
| 356 | + }, |
| 357 | + "execution_count": 12, |
| 358 | + "metadata": {}, |
| 359 | + "output_type": "execute_result" |
| 360 | + } |
| 361 | + ], |
| 362 | + "source": [ |
| 363 | + "recognize(\"a most peculiar acquaintance\", nBS)" |
| 364 | + ] |
| 365 | + }, |
| 366 | + { |
| 367 | + "cell_type": "markdown", |
| 368 | + "metadata": {}, |
| 369 | + "source": [ |
| 370 | + "The classifier guessed correctly again.\n", |
| 371 | + "\n", |
| 372 | + "You can try more sentences on your own. Unfortunately though, since the datasets are pretty small, chances are the guesses will not always be correct." |
| 373 | + ] |
205 | 374 | }
|
206 | 375 | ],
|
207 | 376 | "metadata": {
|
|
220 | 389 | "name": "python",
|
221 | 390 | "nbconvert_exporter": "python",
|
222 | 391 | "pygments_lexer": "ipython3",
|
223 |
| - "version": "3.5.3" |
| 392 | + "version": "3.6.3" |
224 | 393 | }
|
225 | 394 | },
|
226 | 395 | "nbformat": 4,
|
|
0 commit comments