Skip to content

Commit 6d66019

Browse files
committed
add simple exercises
1 parent 4061ac7 commit 6d66019

File tree

3 files changed

+85
-50
lines changed

3 files changed

+85
-50
lines changed

1 - Streamed Corpora.ipynb

Lines changed: 32 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
{
22
"metadata": {
33
"name": "",
4-
"signature": "sha256:00cb81ac4679ddb7ca14eda1511c747897a3d026bb9a633e1cb444a0a64fa06e"
4+
"signature": "sha256:39725d0baebcf62b577927ded7ef2aa3f9d5a34b1b355b3ebc62bcdd8f53f0c9"
55
},
66
"nbformat": 3,
77
"nbformat_minor": 0,
@@ -93,7 +93,7 @@
9393
" # get information (metadata) about all files in the tarball\n",
9494
" file_infos = [file_info for file_info in tf if file_info.isfile()]\n",
9595
" \n",
96-
" # print one of them; for example, the last one\n",
96+
" # print one of them; for example, the first one\n",
9797
" message = tf.extractfile(file_infos[0]).read()\n",
9898
" print(message)"
9999
],
@@ -107,14 +107,14 @@
107107
"source": [
108108
"This text is typical of real-world data. It contains a mix of relevant text, metadata (email headers), and downright noise. Even its relevant content is unstructured, with email addresses, people's names, quotations etc.\n",
109109
"\n",
110-
"Most machine learning methods, topic modeling included, are only as good as the data you give it. At this point, we generally want to clean the data as much as possible. While the subsequent steps in the machine learning pipeline are more or less automated, handling the raw data should reflect the intended purpose of the application, its business logic, idiosyncracies, sanity check (aren't we accidentally receiving and parsing *image data instead of plain text*?). As always with automated processing it's [garbage in, garbage out](http://en.wikipedia.org/wiki/Garbage_in,_garbage_out)."
110+
"Most machine learning methods, topic modeling included, are only as good as the data you give it. At this point, we generally want to clean the data as much as possible. While the subsequent steps in the machine learning pipeline are more or less automated, handling the raw data should reflect the intended purpose of the application, its business logic, idiosyncracies, sanity check (aren't we accidentally receiving and parsing *image data* instead of *plain text*?). As always with automated processing it's [garbage in, garbage out](http://en.wikipedia.org/wiki/Garbage_in,_garbage_out)."
111111
]
112112
},
113113
{
114114
"cell_type": "markdown",
115115
"metadata": {},
116116
"source": [
117-
"As an example, let's write a function that aims to extract only the chunk of relevant text, ignoring the email headers:"
117+
"As an example, let's write a function that aims to extract only the chunk of relevant text from each message, ignoring email headers:"
118118
]
119119
},
120120
{
@@ -130,7 +130,7 @@
130130
" message = gensim.utils.to_unicode(message, 'latin1').strip()\n",
131131
" blocks = message.split(u'\\n\\n')\n",
132132
" # skip email headers (first block) and footer (last block)\n",
133-
" content = u'\\n\\n'.join(blocks[1:-1])\n",
133+
" content = u'\\n\\n'.join(blocks[1:])\n",
134134
" return content\n",
135135
"\n",
136136
"print process_message(message)"
@@ -145,23 +145,23 @@
145145
"source": [
146146
"Feel free to modify this function and test out other ideas for clean up. The flexibility Python gives you in processing text is superb -- [it'd be a crime](http://radimrehurek.com/2014/03/data-streaming-in-python-generators-iterators-iterables/) to hide the processing behind opaque APIs, exposing only one or two tweakable parameters.\n",
147147
"\n",
148-
"There are a handful of handy Python libraries for text cleanup: [jusText](https://github.com/miso-belica/jusText) removes HTML boilerplate and extracting \"main text\". FIXME NLTK, Pattern and TextBlob for tokenization and generic NLP. Tokenization, sentence splitting, chunking, POS tagging."
148+
"There are a handful of handy Python libraries for text cleanup: [jusText](https://github.com/miso-belica/jusText) removes HTML boilerplate and extracts \"main text\" of a web page. [NLTK](http://www.nltk.org/), [Pattern](http://www.clips.ua.ac.be/pattern) and [TextBlob](http://textblob.readthedocs.org/en/dev/) are good for tokenization, POS tagging, sentence splitting and generic NLP, with a nice Pythonic interface. None of them scales very well though, so keep the inputs small."
149149
]
150150
},
151151
{
152152
"cell_type": "markdown",
153153
"metadata": {},
154154
"source": [
155-
"**Exercise (10 min)**: FIXME"
155+
"**Exercise (5 min)**: Modify the `process_message` function to ignore message footers, too."
156156
]
157157
},
158158
{
159159
"cell_type": "markdown",
160160
"metadata": {},
161161
"source": [
162-
"It's a good practice to inspect your data visually, at each point as it passes through your data processing pipeline. Simple printing (logging) a few arbitrary entries, ala UNIX `head`, does wonders for spotting unexpected bugs. *Oh, bad encoding! What is Chinese doing there, we were told all texts are English only? Do these rubbish tokens come from embedded images? How come everything's empty?* Taking a text with \"hey, let's tokenize it into a bag of words blindly, like they do in the tutorials, push it through this magical unicorn ML library and hope for the best\" is ill advised.\n",
162+
"It's a good practice to inspect your data visually, at each point as it passes through your data processing pipeline. Simple printing (logging) a few arbitrary entries, ala UNIX `head`, does wonders for spotting unexpected bugs. *Oh, bad encoding! What is Chinese doing there, we were told all texts are English only? Do these rubbish tokens come from embedded images? How come everything's empty?* Taking a text with \"hey, let's tokenize it into a bag of words blindly, like they do in the tutorials, push it through this magical unicorn machine learning library and hope for the best\" is ill advised.\n",
163163
"\n",
164-
"Another good practice is to keep internal strings as Unicode, and only encode/decode on IO (preferably using UTF8). As of Python 3.3, there is practically no memory penalty for using decoded Unicode ([PEP 393](http://legacy.python.org/dev/peps/pep-0393/))."
164+
"Another good practice is to keep internal strings as Unicode, and only encode/decode on IO (preferably using UTF8). As of Python 3.3, there is practically no memory penalty for using Unicode over UTF8 byte strings ([PEP 393](http://legacy.python.org/dev/peps/pep-0393/))."
165165
]
166166
},
167167
{
@@ -211,7 +211,7 @@
211211
"source": [
212212
"This uses the `process_message()` we wrote above, to process each message in turn. The messages are extracted on-the-fly, one after another, using a generator.\n",
213213
"\n",
214-
"Such **data streaming** is a very important pattern: real data is typically too large to fit into RAM, and we don't need all of it in RAM at the same time anyway -- that's just wasteful. With streamed data, we can process arbitrarily large input, reading the data from a file on disk, shared network disk, or even more exotic remote network protocols."
214+
"Such **data streaming** is a very important pattern: real data is typically too large to fit into RAM, and we don't need all of it in RAM at the same time anyway -- that's just wasteful. With streamed data, we can process arbitrarily large input, reading the data from a file on disk, SQL database, shared network disk, or even more exotic remote network protocols."
215215
]
216216
},
217217
{
@@ -327,7 +327,7 @@
327327
"\n",
328328
"print(gensim.utils.lemmatize(\"worked\"))\n",
329329
"print(gensim.utils.lemmatize(\"working\"))\n",
330-
"print(gensim.utils.lemmatize(\"The big fat cows jumped over a quick brown foxes.\"))"
330+
"print(gensim.utils.lemmatize(\"I was working with a working class hero.\"))"
331331
],
332332
"language": "python",
333333
"metadata": {},
@@ -367,7 +367,7 @@
367367
"cell_type": "markdown",
368368
"metadata": {},
369369
"source": [
370-
"**Task (10 min)**: Modify `split_words()` to ignore (=not return) generic words, such as \"do\", \"then\", \"be\". These are called stopwords and we may want to remove them because some topic modeling algorithms are sensitive to their presence. An example of common stopwords for English is in `from gensim.parsing.preprocessing import STOPWORDS`."
370+
"**Exercise (10 min)**: Modify `tokenize()` to ignore (=not return) generic words, such as \"do\", \"then\", \"be\", \"as\"... These are called stopwords and we may want to remove them because some topic modeling algorithms are sensitive to their presence. An example of common stopwords set for English is in `from gensim.parsing.preprocessing import STOPWORDS`."
371371
]
372372
},
373373
{
@@ -386,7 +386,7 @@
386386
"\n",
387387
"[Named entity recognition (NER)](http://en.wikipedia.org/wiki/Named-entity_recognition) is the task of locating chunks of text that refer to people, locations, organizations etc.\n",
388388
"\n",
389-
"Detecting collocations and named entities often has a significant business value: \"General Electric\" stays a single entity (token), rather than two words \"general\" and \"electric\". Same with \"Marathon Petroleum\", \"George Bush\" etc -- the model doesn't confuse its topics via words coming from unrelated entities, such as \"Korea\" and \"Carolina\" via \"North\"."
389+
"Detecting collocations and named entities often has a significant business value: \"General Electric\" stays a single entity (token), rather than two words \"general\" and \"electric\". Same with \"Marathon Petroleum\", \"George Bush\" etc -- a topic model doesn't confuse its topics via words coming from unrelated entities, such as \"Korea\" and \"Carolina\" via \"North\"."
390390
]
391391
},
392392
{
@@ -412,12 +412,12 @@
412412
" tcf = TrigramCollocationFinder.from_words(words)\n",
413413
" tcf.apply_freq_filter(min_freq)\n",
414414
" trigrams = [' '.join(w) for w in tcf.nbest(TrigramAssocMeasures.chi_sq, top_n)]\n",
415-
" logging.info(\"%i trigrams found: %s...\" % (len(trigrams), list(trigrams)[:20]))\n",
415+
" logging.info(\"%i trigrams found: %s...\" % (len(trigrams), trigrams[:20]))\n",
416416
"\n",
417417
" bcf = tcf.bigram_finder()\n",
418418
" bcf.apply_freq_filter(min_freq)\n",
419419
" bigrams = [' '.join(w) for w in bcf.nbest(BigramAssocMeasures.pmi, top_n)]\n",
420-
" logging.info(\"%i bigrams found: %s...\" % (len(bigrams), list(bigrams)[:20]))\n",
420+
" logging.info(\"%i bigrams found: %s...\" % (len(bigrams), bigrams[:20]))\n",
421421
"\n",
422422
" pat_gram2 = re.compile('(%s)' % '|'.join(bigrams), re.UNICODE)\n",
423423
" pat_gram3 = re.compile('(%s)' % '|'.join(trigrams), re.UNICODE)\n",
@@ -438,13 +438,21 @@
438438
" def __init__(self, fname):\n",
439439
" self.fname = fname\n",
440440
" logging.info(\"collecting ngrams from %s\" % self.fname)\n",
441+
" # generator of documents; one element = list of words\n",
441442
" documents = (self.split_words(text) for text in iter_20newsgroups(self.fname, log_every=1000))\n",
443+
" # generator: concatenate (chain) all words into a single sequence, lazily\n",
442444
" words = itertools.chain.from_iterable(documents)\n",
443445
" self.bigrams, self.trigrams = best_ngrams(words)\n",
444446
"\n",
445-
" def __iter__(self):\n",
446-
" for message in iter_20newsgroups(self.fname):\n",
447-
" yield self.tokenize(message)\n",
447+
" def split_words(self, text, stopwords=STOPWORDS):\n",
448+
" \"\"\"\n",
449+
" Break text into a list of single words. Ignore any token that falls into\n",
450+
" the `stopwords` set.\n",
451+
"\n",
452+
" \"\"\"\n",
453+
" return [word\n",
454+
" for word in gensim.utils.tokenize(text, lower=True)\n",
455+
" if word not in STOPWORDS and len(word) > 3]\n",
448456
"\n",
449457
" def tokenize(self, message):\n",
450458
" \"\"\"\n",
@@ -459,15 +467,9 @@
459467
" text = re.sub(self.bigrams, lambda match: match.group(0).replace(u' ', u'_'), text)\n",
460468
" return text.split()\n",
461469
"\n",
462-
" def split_words(self, text, stopwords=STOPWORDS):\n",
463-
" \"\"\"\n",
464-
" Break text into a list of single words. Ignore any token that falls into\n",
465-
" the `stopwords` set.\n",
466-
"\n",
467-
" \"\"\"\n",
468-
" return [word\n",
469-
" for word in gensim.utils.tokenize(text, lower=True)\n",
470-
" if word not in STOPWORDS and len(word) > 3]\n",
470+
" def __iter__(self):\n",
471+
" for message in iter_20newsgroups(self.fname):\n",
472+
" yield self.tokenize(message)\n",
471473
"\n",
472474
"%time collocations_corpus = Corpus20News_Collocations('./data/20news-bydate.tar.gz')\n",
473475
"print(list(itertools.islice(collocations_corpus, 2)))"
@@ -493,7 +495,7 @@
493495
" \"\"\"Convenience fnc: return the first `n` elements of the stream, as plain list.\"\"\"\n",
494496
" return list(itertools.islice(stream, n))\n",
495497
"\n",
496-
"def best_phrases(document_stream, top_n=1000, prune_at=100000):\n",
498+
"def best_phrases(document_stream, top_n=1000, prune_at=50000):\n",
497499
" \"\"\"Return a set of `top_n` most common noun phrases.\"\"\"\n",
498500
" np_counts = {}\n",
499501
" for docno, doc in enumerate(document_stream):\n",
@@ -510,8 +512,8 @@
510512
" # only consider multi-word NEs where each word contains at least one letter\n",
511513
" if u' ' not in np:\n",
512514
" continue\n",
513-
" # ignore phrases that contains too short/non-alphanetic words\n",
514-
" if all(len([ch for ch in word if ch.isalpha()]) > 2 for word in np.split()):\n",
515+
" # ignore phrases that contain too short/non-alphabetic words\n",
516+
" if all(word.isalpha() and len(word) > 2 for word in np.split()):\n",
515517
" np_counts[np] = np_counts.get(np, 0) + 1\n",
516518
"\n",
517519
" sorted_phrases = sorted(np_counts, key=lambda np: -np_counts[np])\n",
@@ -565,13 +567,6 @@
565567
"metadata": {},
566568
"outputs": []
567569
},
568-
{
569-
"cell_type": "markdown",
570-
"metadata": {},
571-
"source": [
572-
"**Exercise**: FIXME"
573-
]
574-
},
575570
{
576571
"cell_type": "heading",
577572
"level": 2,

0 commit comments

Comments
 (0)