|
1 | 1 | {
|
2 | 2 | "metadata": {
|
3 | 3 | "name": "",
|
4 |
| - "signature": "sha256:d47ab3bdbb8947837ed647f17839e988544b737b9660a40130ff49b8b17c1d91" |
| 4 | + "signature": "sha256:faa812492cedf41f121b213c09df73e486bfb9dcde47a584b0596882dbc13c23" |
5 | 5 | },
|
6 | 6 | "nbformat": 3,
|
7 | 7 | "nbformat_minor": 0,
|
|
462 | 462 | "A common pattern that we'll be using is **combining the efficiency of in-memory arrays** (numpy, scipy.sparse) with the **scalability of data streaming**. Instead of processing one document at a time (slow), or all documents at once (non-scalable), we'll be reading **a chunk of documents** into RAM (= as many documents as RAM allows), processing this chunk, then throwing it away and streaming a new chunk into RAM."
|
463 | 463 | ]
|
464 | 464 | },
|
| 465 | + { |
| 466 | + "cell_type": "heading", |
| 467 | + "level": 3, |
| 468 | + "metadata": {}, |
| 469 | + "source": [ |
| 470 | + "Itertools" |
| 471 | + ] |
| 472 | + }, |
| 473 | + { |
| 474 | + "cell_type": "markdown", |
| 475 | + "metadata": {}, |
| 476 | + "source": [ |
| 477 | + "A [built-in Python library](https://docs.python.org/2/library/itertools.html) for efficient work data streams (iterables, iterators, generators):" |
| 478 | + ] |
| 479 | + }, |
| 480 | + { |
| 481 | + "cell_type": "code", |
| 482 | + "collapsed": false, |
| 483 | + "input": [ |
| 484 | + "import itertools\n", |
| 485 | + "\n", |
| 486 | + "infinite_stream = OddNumbers()\n", |
| 487 | + "\n", |
| 488 | + "# compute the first 10 items (and no more) & print them\n", |
| 489 | + "print(list(itertools.islice(infinite_stream, 10)))\n", |
| 490 | + "\n", |
| 491 | + "# lazily concatenate streams; the result is also infinite\n", |
| 492 | + "concat_stream = itertools.chain('abcde', infinite_stream)\n", |
| 493 | + "print(list(itertools.islice(concat_stream, 10)))\n", |
| 494 | + "\n", |
| 495 | + "numbered_stream = enumerate(infinite_stream) # also infinite\n", |
| 496 | + "print(list(itertools.islice(numbered_stream, 10)))\n", |
| 497 | + "\n", |
| 498 | + "# etc; see the itertools docs for more examples" |
| 499 | + ], |
| 500 | + "language": "python", |
| 501 | + "metadata": {}, |
| 502 | + "outputs": [ |
| 503 | + { |
| 504 | + "output_type": "stream", |
| 505 | + "stream": "stdout", |
| 506 | + "text": [ |
| 507 | + "[1, 3, 5, 7, 9, 11, 13, 15, 17, 19]\n", |
| 508 | + "['a', 'b', 'c', 'd', 'e', 1, 3, 5, 7, 9]\n", |
| 509 | + "[(0, 1), (1, 3), (2, 5), (3, 7), (4, 9), (5, 11), (6, 13), (7, 15), (8, 17), (9, 19)]\n" |
| 510 | + ] |
| 511 | + } |
| 512 | + ], |
| 513 | + "prompt_number": 17 |
| 514 | + }, |
| 515 | + { |
| 516 | + "cell_type": "markdown", |
| 517 | + "metadata": {}, |
| 518 | + "source": [ |
| 519 | + "The examples above show another useful pattern: take a small sample of the stream (e.g. the first ten elements) and convert them into plain Python list, with `list(islice(stream, 10))`. To convert an entire stream into list, simply `list(stream)` (watch out for RAM here though, especially with infinite streams!). Nothing beats the simplicity of `list(stream)` for debugging purposes." |
| 520 | + ] |
| 521 | + }, |
465 | 522 | {
|
466 | 523 | "cell_type": "heading",
|
467 | 524 | "level": 2,
|
|
474 | 531 | "cell_type": "markdown",
|
475 | 532 | "metadata": {},
|
476 | 533 | "source": [
|
477 |
| - "At any point, you can save the notebook (any notebook) to disk by pressing `CTRL`+`s` or `CMD`+`s`. This will **save all your changes**, including cell outputs.\n", |
| 534 | + "At any point, you can save the notebook (any notebook) to disk by pressing `CTRL`+`s` (or `CMD`+`s`). This will **save all changes you've made to the notebook**, including cell outputs, locally to your disk.\n", |
478 | 535 | "\n",
|
479 |
| - "To discard your notebook changes, simply checkout the file again from git (or extract it again from the repository ZIP archive). This will reset the notebook to its original state, **losing all changes**." |
| 536 | + "To discard your notebook changes, simply checkout the notebook file again from git (or extract it again from the repository ZIP archive). This will reset the notebook to its original state, **losing all changes changes**." |
480 | 537 | ]
|
481 | 538 | }
|
482 | 539 | ],
|
|
0 commit comments