diff --git a/.gitignore b/.gitignore index 644ffd486..f013fa1a0 100644 --- a/.gitignore +++ b/.gitignore @@ -44,6 +44,8 @@ tex/modeller-images tex/modeller.markdown tex/objmodel-images tex/objmodel.markdown +tex/ocr-images +tex/ocr.markdown tex/pedometer-images tex/pedometer.markdown tex/sample-images diff --git a/blockcode/blockcode.markdown b/blockcode/blockcode.markdown index 35a8f39c8..aa04a5f26 100644 --- a/blockcode/blockcode.markdown +++ b/blockcode/blockcode.markdown @@ -1,9 +1,11 @@ title: Blockcode: A visual programming toolkit -author: Dethe Elze +author: Dethe Elza + +_[Dethe](https://twitter.com/dethe) is a geek dad, aesthetic programmer, mentor, and creator of the [Waterbear](http://waterbearlang.com/) visual programming tool. He co-hosts the Maker Education Salons and wants to fill the world with robotic origami rabbits._ In block-based programming languages, you write programs by dragging and connecting blocks that represent parts of the program. Block-based languages differ from conventional programming languages, in which you type words and symbols. -Learning a programming language can be difficult because they are extremely sensitive to even the slightest of typos. Most programming languages are case-sensitive, have obscure syntax, and will refuse to run if you get so much as a semicolon in the wrong place --- or worse, leave one out. Further, most programming languages in use today are based on English and their syntax cannot be localized. +Learning a programming language can be difficult because they are extremely sensitive to even the slightest of typos. Most programming languages are case-sensitive, have obscure syntax, and will refuse to run if you get so much as a semicolon in the wrong place --- or worse, leave one out. Further, most programming languages in use today are based on English and their syntax cannot be localized. In contrast, a well-done block language can eliminate syntax errors completely. You can still create a program which does the wrong thing, but you cannot create one with the wrong syntax: the blocks just won't fit that way. Block languages are more discoverable: you can see all the constructs and libraries of the language right in the list of blocks. Further, blocks can be localized into any human language without changing the meaning of the programming language. @@ -33,11 +35,11 @@ There is nothing stopping us from adding additional stages to be more like a tra ### Web Applications -In order to make the tool available to the widest possible audience, it is web-native. It's written in HTML, CSS, and JavaScript, so it should work in most browsers and platforms. +In order to make the tool available to the widest possible audience, it is web-native. It's written in HTML, CSS, and JavaScript, so it should work in most browsers and platforms. -Modern web browsers are powerful platforms, with a rich set of tools for building great apps. If something about the implementation became too complex, I took that as a sign that I wasn't doing it "the web way" and, where possible, tried to re-think how to better leverage the tools built into the browser. +Modern web browsers are powerful platforms, with a rich set of tools for building great apps. If something about the implementation became too complex, I took that as a sign that I wasn't doing it "the web way" and, where possible, tried to re-think how to better leverage the tools built into the browser. -An important difference between web applications and traditional desktop or server applications is the lack of a `main()` or other entry point. There is no explicit run loop because that is already built into the browser and implicit on every web page. All our code will be parsed and executed on load, at which point we can register for events we are interested in for interacting with the user. After the first run, all further interaction with our code will be through callbacks we set up and register, whether we register those for events (like mouse movement), timeouts (fired with the periodicity we specify), or frame handlers (called for each screen redraw, generally 60 frames per second). The browser does not expose full-featured threads either (only shared-nothing web workers). +An important difference between web applications and traditional desktop or server applications is the lack of a `main()` or other entry point. There is no explicit run loop because that is already built into the browser and implicit on every web page. All our code will be parsed and executed on load, at which point we can register for events we are interested in for interacting with the user. After the first run, all further interaction with our code will be through callbacks we set up and register, whether we register those for events (like mouse movement), timeouts (fired with the periodicity we specify), or frame handlers (called for each screen redraw, generally 60 frames per second). The browser does not expose full-featured threads either (only shared-nothing web workers). ## Stepping Through the Code @@ -45,7 +47,7 @@ I've tried to follow some conventions and best practices throughout this project The code style is procedural, not object-oriented or functional. We could do the same things in any of these paradigms, but that would require more setup code and wrappers to impose on what exists already for the DOM. Recent work on [Custom Elements](http://webcomponents.org/) make it easier to work with the DOM in an OO way, and there has been a lot of great writing on [Functional JavaScript](https://leanpub.com/javascript-allonge/read), but either would require a bit of shoe-horning, so it felt simpler to keep it procedural. -There are eight source files in this project, but `index.html` and `blocks.css` are basic structure and style for the app and won't be discussed. Two of the JavaScript files won't be discussed in any detail either: `util.js` contains some helpers and serves as a bridge between different browser implementations --- similar to a library like jQuery but in less than 50 lines of code. `file.js` is a similar utility used for loading and saving files and serializing scripts. +There are eight source files in this project, but `index.html` and `blocks.css` are basic structure and style for the app and won't be discussed. Two of the JavaScript files won't be discussed in any detail either: `util.js` contains some helpers and serves as a bridge between different browser implementations --- similar to a library like jQuery but in less than 50 lines of code. `file.js` is a similar utility used for loading and saving files and serializing scripts. These are the remaining files: @@ -77,8 +79,8 @@ The `createBlock(name, value, contents)` function returns a block as a DOM eleme ```javascript function createBlock(name, value, contents){ - var item = elem('div', - {'class': 'block', draggable: true, 'data-name': name}, + var item = elem('div', + {'class': 'block', draggable: true, 'data-name': name}, [name] ); if (value !== undefined && value !== null){ @@ -89,7 +91,7 @@ The `createBlock(name, value, contents)` function returns a block as a DOM eleme elem('div', {'class': 'container'}, contents.map(function(block){ return createBlock.apply(null, block); }))); - }else if (typeof contents === 'string'){ + }else if (typeof contents === 'string'){ // Add units (degrees, etc.) specifier item.appendChild(document.createTextNode(' ' + contents)); } @@ -97,7 +99,7 @@ The `createBlock(name, value, contents)` function returns a block as a DOM eleme } ``` -We have some utilities for handling blocks as DOM elements: +We have some utilities for handling blocks as DOM elements: - `blockContents(block)` retrieves the child blocks of a container block. It always returns a list if called on a container block, and always returns null on a simple block - `blockValue(block)` returns the numerical value of the input on a block if the block has an input field of type number, or null if there is no input element for the block @@ -116,8 +118,8 @@ We have some utilities for handling blocks as DOM elements: } function blockUnits(block){ - if (block.children.length > 1 && - block.lastChild.nodeType === Node.TEXT_NODE && + if (block.children.length > 1 && + block.lastChild.nodeType === Node.TEXT_NODE && block.lastChild.textContent){ return block.lastChild.textContent.slice(1); } @@ -195,7 +197,7 @@ While we are dragging, the `dragenter`, `dragover`, and `dragout` events give us return; } // Necessary. Allows us to drop. - if (evt.preventDefault) { evt.preventDefault(); } + if (evt.preventDefault) { evt.preventDefault(); } if (dragType === 'menu'){ // See the section on the DataTransfer object. evt.dataTransfer.dropEffect = 'copy'; @@ -216,7 +218,7 @@ When we release the mouse, we get a `drop` event. This is where the magic happen var dropType = 'script'; if (matches(dropTarget, '.menu')){ dropType = 'menu'; } // stops the browser from redirecting. - if (evt.stopPropagation) { evt.stopPropagation(); } + if (evt.stopPropagation) { evt.stopPropagation(); } if (dragType === 'script' && dropType === 'menu'){ trigger('blockRemoved', dragTarget.parentElement, dragTarget); dragTarget.parentElement.removeChild(dragTarget); @@ -277,7 +279,7 @@ We use `scriptDirty` to keep track of whether the script has been modified since var scriptDirty = false; ``` -When we want to notify the system to run the script during the next frame handler, we call `runSoon()` which sets the `scriptDirty` flag to `true`. The system calls `run()` on every frame, but returns immediately unless `scriptDirty` is set. When `scriptDirty` is set, it runs all the script blocks, and also triggers events to let the specific language handle any tasks it needs before and after the script is run. This decouples the blocks-as-toolkit from the turtle language to make the blocks re-usable (or the language pluggable, depending how you look at it). +When we want to notify the system to run the script during the next frame handler, we call `runSoon()` which sets the `scriptDirty` flag to `true`. The system calls `run()` on every frame, but returns immediately unless `scriptDirty` is set. When `scriptDirty` is set, it runs all the script blocks, and also triggers events to let the specific language handle any tasks it needs before and after the script is run. This decouples the blocks-as-toolkit from the turtle language to make the blocks re-usable (or the language pluggable, depending how you look at it). As part of running the script, we iterate over each block, calling `runEach(evt)` on it, which sets a class on the block, then finds and executes its associated function. If we slow things down, you should be able to watch the code execute as each block highlights to show when it is running. @@ -342,7 +344,7 @@ We define `repeat(block)` here, outside of the turtle language, because it is ge \aosafigure[240pt]{blockcode-images/turtle_example.png}{Example of Turtle code running}{500l.blockcode.turtle} -Turtle programming is a style of graphics programming, first popularized by Logo, where you have an imaginary turtle carrying a pen walking on the screen. You can tell the turtle to pick up the pen (stop drawing, but still move), put the pen down (leaving a line everywhere it goes), move forward a number of steps, or turn a number of degrees. Just those commands, combined with looping, can create amazingly intricate images. +Turtle programming is a style of graphics programming, first popularized by Logo, where you have an imaginary turtle carrying a pen walking on the screen. You can tell the turtle to pick up the pen (stop drawing, but still move), put the pen down (leaving a line everywhere it goes), move forward a number of steps, or turn a number of degrees. Just those commands, combined with looping, can create amazingly intricate images. In this version of turtle graphics we have a few extra blocks. Technically we don't need both `turn right` and `turn left` because you can have one and get the other with negative numbers. Likewise `move back` can be done with `move forward` and negative numbers. In this case it felt more balanced to have both. @@ -359,7 +361,7 @@ The image above was formed by putting two loops inside another loop and adding a var WIDTH, HEIGHT, position, direction, visible, pen, color; ``` -The `reset()` function clears all the state variables to their defaults. If we were to support multiple turtles, these variables would be encapsulated in an object. We also have a utility, `deg2rad(deg)`, because we work in degrees in the UI, but we draw in radians. Finally, `drawTurtle()` draws the turtle itself. The default turtle is simply a triangle, but you could override this to get a more "turtle-looking" turtle. +The `reset()` function clears all the state variables to their defaults. If we were to support multiple turtles, these variables would be encapsulated in an object. We also have a utility, `deg2rad(deg)`, because we work in degrees in the UI, but we draw in radians. Finally, `drawTurtle()` draws the turtle itself. The default turtle is simply a triangle, but you could override this to get a more "turtle-looking" turtle. Note that `drawTurtle` uses the same primitive operations that we define to implement the turtle drawing. Sometimes you don't want to reuse code at different abstraction layers, but when the meaning is clear it can be a big win for code size and performance. @@ -390,7 +392,7 @@ Note that `drawTurtle` uses the same primitive operations that we define to impl } ``` -We have a special block to draw a circle with a given radius at the current mouse position. We special-case `drawCircle` because, while you can certainly draw a circle by repeating `MOVE 1 RIGHT 1` 360 times, controlling the size of the circle is very difficult that way. +We have a special block to draw a circle with a given radius at the current mouse position. We special-case `drawCircle` because, while you can certainly draw a circle by repeating `MOVE 1 RIGHT 1` 360 times, controlling the size of the circle is very difficult that way. ```javascript function drawCircle(radius){ @@ -491,7 +493,7 @@ Now we can use the functions above, with the `Menu.item` function from `menu.js` ### Why Not Use MVC? -Model-View-Controller (MVC) was a good design choice for Smalltalk programs in the '80s and it can work in some variation or other for web apps, but it isn't the right tool for every problem. All the state (the "model" in MVC) is captured by the block elements in a block language anyway, so replicating it into Javascript has little benefit unless there is some other need for the model (if we were editing shared, distributed code, for instance). +Model-View-Controller (MVC) was a good design choice for Smalltalk programs in the '80s and it can work in some variation or other for web apps, but it isn't the right tool for every problem. All the state (the "model" in MVC) is captured by the block elements in a block language anyway, so replicating it into Javascript has little benefit unless there is some other need for the model (if we were editing shared, distributed code, for instance). An early version of Waterbear went to great lengths to keep the model in JavaScript and sync it with the DOM, until I noticed that more than half the code and 90% of the bugs were due to keeping the model in sync with the DOM. Eliminating the duplication allowed the code to be simpler and more robust, and with all the state on the DOM elements, many bugs could be found simply by looking at the DOM in the developer tools. So in this case there is little benefit to building further separation of MVC than we already have in HTML/CSS/JavaScript. @@ -501,14 +503,14 @@ Building a small, tightly scoped version of the larger system I work on has been #### Small Experiments Make Failure OK -Some of the experiments I was able to do with this stripped-down block language were: +Some of the experiments I was able to do with this stripped-down block language were: -- using HTML5 drag-and-drop, +- using HTML5 drag-and-drop, - running blocks directly by iterating through the DOM calling associated functions, - separating the code that runs cleanly from the HTML DOM, - simplified hit testing while dragging, -- building our own tiny vector and sprite libraries (for the game blocks), and -- "live coding" where the results are shown whenever you change the block script. +- building our own tiny vector and sprite libraries (for the game blocks), and +- "live coding" where the results are shown whenever you change the block script. The thing about experiments is that they do not have to succeed. We tend to gloss over failures and dead ends in our work, where failures are punished instead of treated as important vehicles for learning), but failures are essential if you are going to push forward. While I did get the HTML5 drag-and-drop working, the fact that it isn't supported at all on any mobile browser means it is a non-starter for Waterbear. Separating the code out and running code by iterating through the blocks worked so well that I've already begun bringing those ideas to Waterbear, with excellent improvements in testing and debugging. The simplified hit testing, with some modifications, is also coming back to Waterbear, as are the tiny vector and sprite libraries. Live coding hasn't made it to Waterbear yet, but once the current round of changes stabilizes I may introduce it. diff --git a/build.py b/build.py index 4ad1e3472..681de9eac 100644 --- a/build.py +++ b/build.py @@ -15,6 +15,7 @@ def main(chapters=[], epub=False, pdf=False, html=False, mobi=False, pandoc_epub chapter_dirs = [ 'dagoba', + 'ocr', 'contingent', 'same-origin-policy', 'blockcode', @@ -68,6 +69,7 @@ def main(chapters=[], epub=False, pdf=False, html=False, mobi=False, pandoc_epub ] image_paths = [ + './ocr/ocr-images', './contingent/contingent-images', './same-origin-policy/same-origin-policy-images', './blockcode/blockcode-images', @@ -193,7 +195,7 @@ def build_mobi(): def build_html(chapter_markdowns): run('mkdir -p html/content/pages') temp = 'python _build/preprocessor.py --chapter {chap} --html-refs --html-paths --output={md}.1 --latex {md}' - temp2 = 'pandoc --csl=minutiae/ieee.csl --bibliography=tex/500L.bib -t html -f markdown+citations -o html/content/pages/{basename}.md {md}.1' + temp2 = 'pandoc --csl=minutiae/ieee.csl --mathjax --bibliography=tex/500L.bib -t html -f markdown+citations -o html/content/pages/{basename}.md {md}.1' temp3 = './_build/fix_html_title.sh html/content/pages/{basename}.md' for i, markdown in enumerate(chapter_markdowns): basename = os.path.splitext(os.path.split(markdown)[1])[0] diff --git a/ci/README.rst b/ci/README.rst index 2746a5e6a..549607f52 100644 --- a/ci/README.rst +++ b/ci/README.rst @@ -65,7 +65,7 @@ Copy the tests/ folder from this code base to test_repo and commit it:: cp -r /this/directory/tests /path/to/test_repo/ cd /path/to/test_repo git add tests/ - git commit -m”add tests” + git commit -m "add tests" The repo observer will need its own clone of the code:: @@ -110,7 +110,7 @@ to make a new commit. Go to your master repo and make an arbitrary change:: cd /path/to/test_repo touch new_file git add new_file - git commit -m"new file" new_file + git commit -m "new file" new_file then repo_observer.py will realize that there's a new commit and will notify the dispatcher. You can see the output in their respective shells, so you diff --git a/ci/ci.markdown b/ci/ci.markdown index 8cbe242c4..fd618f655 100644 --- a/ci/ci.markdown +++ b/ci/ci.markdown @@ -179,7 +179,7 @@ Copy the tests folder from this code base to `test_repo` and commit it: $ cp -r /this/directory/tests /path/to/test_repo/ $ cd /path/to/test\_repo $ git add tests/ -$ git commit -m”add tests” +$ git commit -m ”add tests” ``` Now you have a commit in the master repository. @@ -227,7 +227,7 @@ modified this assumption for simplicity. The observer must know which repository to observe. We previously created a clone of our repository at `/path/to/test_repo_clone_obs`. -The repository will use this clone to detect changes. To allow the +The observer will use this clone to detect changes. To allow the repository observer to use this clone, we pass it the path when we invoke the `repo_observer.py` file. The repository observer will use this clone to pull from the main repository. diff --git a/contingent/contingent.markdown b/contingent/contingent.markdown index 474727669..4b1cc29cb 100644 --- a/contingent/contingent.markdown +++ b/contingent/contingent.markdown @@ -19,8 +19,7 @@ things; he loves seeing the spark of wonder and delight in people's eyes when someone shares a novel, surprising, or beautiful idea. Daniel lives in Atlanta with a microbiologist and four aspiring rocketeers._ -Introduction -============ +## Introduction Build systems have long been a standard tool within computer programming. @@ -32,7 +31,7 @@ It not only lets you declare that an output file depends upon one (or more) inputs, but lets you do this recursively. A program, for example, might depend upon an object file -which itself depends upon the corresponding source code:: +which itself depends upon the corresponding source code: ``` prog: main.o @@ -70,8 +69,7 @@ The problem, again, is cross-referencing. Where do cross-references tend to emerge? In text documents, documentation, and printed books! -The Problem: Building Document Systems -====================================== +## The Problem: Building Document Systems Systems to rebuild formatted documents from source texts always seem to do too much work, or too little. @@ -132,7 +130,7 @@ If you later reconsider the tutorial’s chapter title — after all, the word “newcomer” sounds so antique, as if your users are settlers who have just arrived in pioneer Wyoming — then you would edit the first line of `tutorial.rst` -and write something better:: +and write something better: ``` -Newcomers Tutorial @@ -279,8 +277,7 @@ This can happen for many kinds of cross reference that Sphinx supports: chapter titles, section titles, paragraphs, classes, methods, and functions. -Build Systems and Consistency -============================= +## Build Systems and Consistency The problem outlined above is not specific to Sphinx. Not only does it haunt other document systems, like LaTeX, @@ -290,7 +287,7 @@ with the venerable `make` utility, if their assets happen to cross-reference in interesting ways. As the problem is ancient and universal, -its solution is of equally long lineage:: +its solution is of equally long lineage: ```bash $ rm -r _build/ @@ -332,8 +329,7 @@ while performing the fewest possible rebuild steps. While Contingent can be applied to any problem domain, we will run it against a small version of the problem outlined above. -Linking Tasks To Make a Graph -============================= +## Linking Tasks To Make a Graph Any build system needs a way to link inputs and outputs. The three markup texts in our discussion above, @@ -544,8 +540,7 @@ at either end of the edge. But in return for this redundancy, the data structure supports the fast lookup that Contingent needs. -The Proper Use of Classes -========================= +## The Proper Use of Classes You may have been surprised by the absence of classes in the above discussion @@ -637,7 +632,7 @@ and that the nodes themselves in these early examples are simply strings. Coming from other languages and traditions, one might have expected to see -user-defined classes and interfaces for everything in the system:: +user-defined classes and interfaces for everything in the system: ```java Graph g = new ConcreteGraph(); @@ -862,8 +857,7 @@ will eventually have Contingent do for us: the graph `g` captures the inputs and consequences for the various artifacts in our project's documentation. -Learning Connections -==================== +## Learning Connections We now have a way for Contingent to keep track of tasks and the relationships between them. @@ -1311,8 +1305,7 @@ at its disposal, Contingent knows all the things to rebuild if the inputs to any tasks change. -Chasing Consequences -==================== +## Chasing Consequences Once the initial build has run to completion, Contingent needs to monitor the input files for changes. @@ -1542,8 +1535,7 @@ nevertheless returned the same value means that all further downstream tasks were insulated from the change and did not get re-invoked. -Conclusion -========== +## Conclusion There exist languages and programming methodologies under which Contingent would be a suffocating forest of tiny classes diff --git a/functionalDB/functionalDB.markdown b/functionalDB/functionalDB.markdown index e2c32dcc5..174523568 100644 --- a/functionalDB/functionalDB.markdown +++ b/functionalDB/functionalDB.markdown @@ -721,7 +721,7 @@ Our data model is based on accumulation of facts (i.e., datoms) over time. For t ### Query Language -Let's look at an example query in our proposed language. This query asks: "What are the names and birthday of entities who like pizza, speak English, and who have a birthday this month?" +Let's look at an example query in our proposed language. This query asks: "What are the names and birthdays of entities who like pizza, speak English, and who have a birthday this month?" ```clojure { :find [?nm ?bd ] :where [ @@ -1281,12 +1281,12 @@ The twist to the index structure is that now we hold a binding pair of the entit At the end of phase 3 of our example execution, we have the following structure at hand: ```clojure - {[1 "?e"] { - [:likes nil] ["Pizza" nil] - [:name nil] ["USA" "?nm"] - [:speaks nil] ["English" nil] - [:birthday nil] ["July 4, 1776" "?bd"]} - }} +{[1 "?e"]{ + {[:likes nil] ["Pizza" nil]} + {[:name nil] ["USA" "?nm"]} + {[:speaks nil] ["English" nil]} + {[:birthday nil] ["July 4, 1776" "?bd"]} +}} ``` #### Phase 4: Unify and Report diff --git a/ocr/data.csv b/ocr/code/data.csv similarity index 100% rename from ocr/data.csv rename to ocr/code/data.csv diff --git a/ocr/dataLabels.csv b/ocr/code/dataLabels.csv similarity index 100% rename from ocr/dataLabels.csv rename to ocr/code/dataLabels.csv diff --git a/ocr/neural_network_design.py b/ocr/code/neural_network_design.py similarity index 100% rename from ocr/neural_network_design.py rename to ocr/code/neural_network_design.py diff --git a/ocr/nn.json b/ocr/code/nn.json similarity index 100% rename from ocr/nn.json rename to ocr/code/nn.json diff --git a/ocr/ocr.html b/ocr/code/ocr.html similarity index 100% rename from ocr/ocr.html rename to ocr/code/ocr.html diff --git a/ocr/ocr.js b/ocr/code/ocr.js similarity index 100% rename from ocr/ocr.js rename to ocr/code/ocr.js diff --git a/ocr/ocr.py b/ocr/code/ocr.py similarity index 96% rename from ocr/ocr.py rename to ocr/code/ocr.py index 90bdff3ac..fb304fe6e 100644 --- a/ocr/ocr.py +++ b/ocr/code/ocr.py @@ -75,13 +75,13 @@ def train(self, training_data_array): actual_vals = [0] * 10 # actual_vals is a python list for easy initialization and is later turned into an np matrix (2 lines down). actual_vals[data['label']] = 1 output_errors = np.mat(actual_vals).T - np.mat(y2) - hiddenErrors = np.multiply(np.dot(np.mat(self.theta2).T, output_errors), self.sigmoid_prime(sum1)) + hidden_errors = np.multiply(np.dot(np.mat(self.theta2).T, output_errors), self.sigmoid_prime(sum1)) # Step 4: Update weights - self.theta1 += self.LEARNING_RATE * np.dot(np.mat(hiddenErrors), np.mat(data['y0'])) + self.theta1 += self.LEARNING_RATE * np.dot(np.mat(hidden_errors), np.mat(data['y0'])) self.theta2 += self.LEARNING_RATE * np.dot(np.mat(output_errors), np.mat(y1).T) self.hidden_layer_bias += self.LEARNING_RATE * output_errors - self.input_layer_bias += self.LEARNING_RATE * hiddenErrors + self.input_layer_bias += self.LEARNING_RATE * hidden_errors def predict(self, test): y1 = np.dot(np.mat(self.theta1), np.mat(test).T) diff --git a/ocr/server.py b/ocr/code/server.py similarity index 94% rename from ocr/server.py rename to ocr/code/server.py index a40076028..b8a3e77f6 100644 --- a/ocr/server.py +++ b/ocr/code/server.py @@ -24,8 +24,8 @@ class JSONHandler(BaseHTTPServer.BaseHTTPRequestHandler): def do_POST(s): response_code = 200 response = "" - varLen = int(s.headers.get('Content-Length')) - content = s.rfile.read(varLen); + var_len = int(s.headers.get('Content-Length')) + content = s.rfile.read(var_len); payload = json.loads(content); if payload.get('train'): diff --git a/ocr/ocr-images/ann.png b/ocr/ocr-images/ann.png new file mode 100644 index 000000000..6a503c073 Binary files /dev/null and b/ocr/ocr-images/ann.png differ diff --git a/ocr/ocr.markdown b/ocr/ocr.markdown new file mode 100644 index 000000000..01cc6ba14 --- /dev/null +++ b/ocr/ocr.markdown @@ -0,0 +1,779 @@ +title: Optical Character Recognition (OCR) +author: Marina Samuel + +## Introduction + +What if your computer could wash your dishes, do your laundry, cook you dinner, +and clean your home? I think I can safely say that most people would be happy +to get a helping hand! But what would it take for a computer to be able to +perform these tasks, in exactly the same way that humans can? + +The famous computer scientist Alan Turing proposed the Turing Test as a way to +identify whether a machine could have intelligence indistinguishable from that +of a human being. The test involves a human posing questions to two hidden +entities, one human, and the other a machine, and trying to identify which is +which. If the interrogator is unable to identify the machine, then the machine +is considered to have human-level intelligence. + +While there is a lot of controversy surrounding whether the Turing Test is a +valid assessment of intelligence, and whether we can build such intelligent +machines, there is no doubt that machines with some degree of intelligence +already exist. There is currently software that helps robots navigate an office +and perform small tasks, or help those suffering with Alzheimer's. More common +examples of Artificial Intelligence (A.I.) are the way that Google estimates +what you’re looking for when you search for some keywords, or the way that +Facebook decides what to put in your news feed. + +One well known application of A.I. is Optical Character Recognition (OCR). An +OCR system is a piece of software that can take images of handwritten +characters as input and interpret them into machine readable text. While you +may not think twice when depositing a handwritten cheque into a bank machine +that confirms the deposit value, there is some interesting work going on in the +background. This chapter will examine a working example of a simple OCR system +that recognizes numerical digits using an Artificial Neural Network (ANN). But +first, let’s establish a bit more context. + + +## What is Artificial Intelligence? +\label{sec.ocr.ai} +While Turing’s definition of intelligence sounds reasonable, at the end of the +day what constitutes intelligence is fundamentally a philosophical debate. +Computer scientists have, however, categorized certain types of systems and +algorithms into branches of AI. Each branch is used to solve certain sets of +problems. These branches include the following examples, as well as [many +others](http://www-formal.stanford.edu/jmc/whatisai/node2.html): + +- Logical and probabilistic deduction and inference based on some predefined + knowledge of a world. e.g. [Fuzzy + inference](http://www.cs.princeton.edu/courses/archive/fall07/cos436/HIDDEN/Knapp/fuzzy004.htm) + can help a thermostat decide when to turn on the air conditioning when it + detects that the temperature is hot and the atmosphere is humid +- Heuristic search. e.g. Searching can be used to find the best possible next + move in a game of chess by searching all possible moves and choosing the one + that most improves your position +- Machine learning (ML) with feedback models. e.g. Pattern-recognition problems + like OCR. + +In general, ML involves using large data sets to train a system to identify +patterns. The training data sets may be labelled, meaning the system’s expected +outputs are specified for given inputs, or unlabelled meaning expected outputs +are not specified. Algorithms that train systems with unlabelled data are +called _unsupervised_ algorithms and those that train with labelled data are +called _supervised_. Although many ML algorithms and techniques exist for +creating OCR systems, ANNs are one simple approach. + +## Artificial Neural Networks +### What Are ANNs? +\label{sec.ocr.ann} +An ANN is a structure consisting of interconnected nodes that communicate with +one another. The structure and its functionality are inspired by neural +networks found in a biological brain. [Hebbian +Theory](http://www.nbb.cornell.edu/neurobio/linster/BioNB420/hebb.pdf) explains +how these networks can learn to identify patterns by physically altering their +structure and link strengths. Similarly, a typical ANN (shown in +\aosafigref{500l.ocr.ann}) has connections between nodes that have a weight +which is updated as the network learns. The nodes labelled "+1" are called +_biases_. The leftmost blue column of nodes are _input nodes_, the middle +column contains _hidden nodes_, and the rightmost column contains _output +nodes_. There may be many columns of hidden nodes, known as _hidden layers_. + +\aosafigure[360pt]{ocr-images/ann.png}{An Artificial Neural Network}{500l.ocr.ann} + +The values inside all of the circular nodes in \aosafigref{500l.ocr.ann} +represent the output of the node. If we call the output of the $n$th node from +the top in layer $L$ as a $n(L)$ and the connection between the $i$th node in +layer $L$ and the $j$th node in layer $L+1$ as $w^{(L)}_ji$, then the output of +node $a^{(2)}_2$ is: + +$$ +a^{(2)}_2 = f(w^{(1)}_{21}x_1 + w^{(1)}_{22}x_2 + b^{(1)}_{2}) +$$ + +where $f(.)$ is known as the _activation function_ and $b$ is the _bias_. An +activation function is the decision-maker for what type of output a node has. +A bias is an additional node with a fixed output of 1 that may be added to an +ANN to improve its accuracy. We’ll see more details on both of these in + \aosasecref{sec.ocr.feedforward}. + +This type of network topology is called a _feedforward_ neural network because +there are no cycles in the network. ANNs with nodes whose outputs feed into +their inputs are called recurrent neural networks. There are many algorithms +that can be applied to train feedforward ANNs; one commonly used algorithm is +called _backpropagation_. The OCR system we will implement in this chapter will +use backpropagation. + +### How Do We Use ANNs? +Like most other ML approaches, the first step for using backpropagation is to +decide how to transform or reduce our problem into one that can be solved by an +ANN. In other words, how can we manipulate our input data so we can feed it +into the ANN? For the case of our OCR system, we can use the positions of the +pixels for a given digit as input. It is worth noting that, often times, +choosing the input data format is not this simple. If we were analyzing large +images to identify shapes in them, for instance, we may need to pre-process the +image to identify contours within it. These contours would be the input. + +Once we’ve decided on our input data format, what’s next? Since backpropagation +is a supervised algorithm, it will need to be trained with labelled data, as +mentioned in \aosasecref{sec.ocr.ai}. Thus, when passing the pixel positions as training +input, we must also pass the associated digit. This means that we must find or +gather a large data set of drawn digits and associated values. + +The next step is to partition the data set into a training set and validation +set. The training data is used to run the backpropagation algorithm to set the +weights of the ANN. The validation data is used to make predictions using the +trained network and compute its accuracy. If we were comparing the performance +of backpropagation vs. another algorithm on our data, we would [split the +data](http://www-group.slac.stanford.edu/sluo/Lectures/stat_lecture_files/sluo2006lec7.pdf) +into 50% for training, 25% for comparing performance of the 2 algorithms +(validation set) and the final 25% for testing accuracy of the chosen algorithm +(test set). Since we’re not comparing algorithms, we can group one of the +25% sets as part of the training set and use 75% of the data to train the +network and 25% for validating that it was trained well. + +The purpose of identifying the accuracy of the ANN is two-fold. First, it is to +avoid the problem of _overfitting_. Overfitting occurs when the network has a +much higher accuracy on predicting the training set than the validation set. +Overfitting tells us that the chosen training data does not generalize well +enough and needs to be refined. Secondly, testing the accuracy of several +different numbers of hidden layers and hidden nodes helps in designing the most +optimal ANN size. An optimal ANN size will have enough hidden nodes and layers +to make accurate predictions but also as few nodes/connections as possible to +reduce computational overhead that may slow down training and predictions. Once +the optimal size has been decided and the network has been trained, it’s ready +to make predictions! + +## Design Decisions in a Simple OCR System +\label{sec.ocr.decisions} +In the last few paragraphs we’ve gone over some of the basics of feedforward +ANNs and how to use them. Now it’s time to talk about how we can build an OCR +system. + +First off, we must decide what we want our system to be able to do. To keep +things simple, let’s allow users to draw a single digit and be able to train +the OCR system with that drawn digit or to request that the system predict what +the drawn digit is. While an OCR system could run locally on a single machine, +having a client-server setup gives much more flexibility. It makes +crowd-sourced training of an ANN possible and allows powerful servers to handle +intensive computations. + +Our OCR system will consist of 5 main components, divided into 5 files. There +will be: + +- a client (`ocr.js`) +- a server (`server.py`) +- a simple user interface (`ocr.html`) +- an ANN trained via backpropagation (`ocr.py`) +- an ANN design script (`neural_network_design.py`) + +The user interface will be simple: a canvas to draw digits on and buttons to +either train the ANN or request a prediction. The client will gather the drawn +digit, translate it into an array, and pass it to the server to be processed +either as a training sample or as a prediction request. The server will simply +route the training or prediction request by making API calls to the ANN module. +The ANN module will train the network with an existing data set on its first +initialization. It will then save the ANN weights to a file and re-load them on +subsequent startups. This module is where the core of training and prediction +logic happens. Finally, the design script is for experimenting with different +hidden node counts and deciding what works best. Together, these pieces give us +a very simplistic, but functional OCR system. + +Now that we've thought about how the system will work at a high level, it's +time to put the concepts into code! + +### A Simple Interface (`ocr.html`) +As mentioned earlier, the first step is to gather data for training the +network. We could upload a sequence of hand-written digits to the server, but +that would be awkward. Instead, we could have users actually handwrite the +digits on the page using an HTML canvas. We could then give them a couple of +options to either train or test the network, where training the network also +involves specifying what digit was drawn. This way it is possible to easily +outsource the data collection by pointing people to a website to receive their +input. Here’s some HTML to get us started. + +```html + + + + + + +
+

OCR Demo

+ +
+

Digit:

+ + + +
+
+ + +``` + +### An OCR Client (`ocr.js`) +Since a single pixel on an HTML canvas might be hard to see, we can represent a +single pixel for the ANN input as a square of 10x10 real pixels. Thus the real +canvas is 200x200 pixels and it is represented by a 20x20 canvas from the +perspective of the ANN. The variables below will help us keep track of these +measurements. + + +```javascript +var ocrDemo = { + CANVAS_WIDTH: 200, + TRANSLATED_WIDTH: 20, + PIXEL_WIDTH: 10, // TRANSLATED_WIDTH = CANVAS_WIDTH / PIXEL_WIDTH +``` + +We can then outline the pixels in the new representation so they are easier to +see. Here we have a blue grid generated by `drawGrid()`. + +```javascript + drawGrid: function(ctx) { + for (var x = this.PIXEL_WIDTH, y = this.PIXEL_WIDTH; + x < this.CANVAS_WIDTH; x += this.PIXEL_WIDTH, + y += this.PIXEL_WIDTH) { + ctx.strokeStyle = this.BLUE; + ctx.beginPath(); + ctx.moveTo(x, 0); + ctx.lineTo(x, this.CANVAS_WIDTH); + ctx.stroke(); + + ctx.beginPath(); + ctx.moveTo(0, y); + ctx.lineTo(this.CANVAS_WIDTH, y); + ctx.stroke(); + } + }, +``` + +We also need to store the data drawn on the grid in a form that can be sent to +the server. For simplicity, we can have an array called `data` which labels an +uncoloured, black pixel as `0` and a coloured white pixel as `1`. We also need +some mouse listeners on the canvas so we know when to call `fillSquare()` to +colour a pixel white while a user is drawing a digit. These listeners should +keep track of whether we are in a drawing state and then call `fillSquare()` to +do some simple math and decide which pixels need to be filled in. + +```javascript + onMouseMove: function(e, ctx, canvas) { + if (!canvas.isDrawing) { + return; + } + this.fillSquare(ctx, + e.clientX - canvas.offsetLeft, e.clientY - canvas.offsetTop); + }, + + onMouseDown: function(e, ctx, canvas) { + canvas.isDrawing = true; + this.fillSquare(ctx, + e.clientX - canvas.offsetLeft, e.clientY - canvas.offsetTop); + }, + + onMouseUp: function(e) { + canvas.isDrawing = false; + }, + + fillSquare: function(ctx, x, y) { + var xPixel = Math.floor(x / this.PIXEL_WIDTH); + var yPixel = Math.floor(y / this.PIXEL_WIDTH); + this.data[((xPixel - 1) * this.TRANSLATED_WIDTH + yPixel) - 1] = 1; + + ctx.fillStyle = '#ffffff'; + ctx.fillRect(xPixel * this.PIXEL_WIDTH, yPixel * this.PIXEL_WIDTH, + this.PIXEL_WIDTH, this.PIXEL_WIDTH); + }, +``` + +Now we’re getting closer to the juicy stuff! We need a function that prepares +training data to be sent to the server. Here we have a relatively straight +forward `train()` function that does some error checking on the data to be sent, +adds it to `trainArray` and sends it off by calling `sendData()`. An interesting +design worth noting here is the use of `trainingRequestCount`, `trainArray`, +and `BATCH_SIZE`. + +```javascript + train: function() { + var digitVal = document.getElementById("digit").value; + if (!digitVal || this.data.indexOf(1) < 0) { + alert("Please type and draw a digit value in order to train the network"); + return; + } + this.trainArray.push({"y0": this.data, "label": parseInt(digitVal)}); + this.trainingRequestCount++; + + // Time to send a training batch to the server. + if (this.trainingRequestCount == this.BATCH_SIZE) { + alert("Sending training data to server..."); + var json = { + trainArray: this.trainArray, + train: true + }; + + this.sendData(json); + this.trainingRequestCount = 0; + this.trainArray = []; + } + }, +``` + +What’s happening here is that `BATCH_SIZE` is some pre-defined constant for how +much training data a client will keep track of before it sends a batched +request to the server to be processed by the OCR. The main reason to batch +requests is to avoid overwhelming the server with many requests at once. If +many clients exist (e.g. many users are on the `ocr.html` page training the +system), or if another layer existed in the client that takes scanned drawn +digits and translated them to pixels to train the network, a `BATCH_SIZE` of 1 +would result in many, unnecessary requests. This approach is good because it +gives more flexibility to the client, however, in practice, batching should +also take place on the server, when needed. A denial of service (DoS) attack +could occur in which a malicious client purposely sends many requests to the +server to overwhelm it so that it breaks down. + +We will also need a `test()` function. Similar to `train()`, it should do a +simple check on the validity of the data and send it off. For `test()`, +however, no batching occurs since users should be able to request a prediction +and get immediate results. + +```javascript + test: function() { + if (this.data.indexOf(1) < 0) { + alert("Please draw a digit in order to test the network"); + return; + } + var json = { + image: this.data, + predict: true + }; + this.sendData(json); + }, +``` + +Finally, we will need some functions to make an HTTP POST request, receive a +response, and handle any potential errors along the way. + +```javascript + receiveResponse: function(xmlHttp) { + if (xmlHttp.status != 200) { + alert("Server returned status " + xmlHttp.status); + return; + } + var responseJSON = JSON.parse(xmlHttp.responseText); + if (xmlHttp.responseText && responseJSON.type == "test") { + alert("The neural network predicts you wrote a \'" + + responseJSON.result + '\''); + } + }, + + onError: function(e) { + alert("Error occurred while connecting to server: " + e.target.statusText); + }, + + sendData: function(json) { + var xmlHttp = new XMLHttpRequest(); + xmlHttp.open('POST', this.HOST + ":" + this.PORT, false); + xmlHttp.onload = function() { this.receiveResponse(xmlHttp); }.bind(this); + xmlHttp.onerror = function() { this.onError(xmlHttp) }.bind(this); + var msg = JSON.stringify(json); + xmlHttp.setRequestHeader('Content-length', msg.length); + xmlHttp.setRequestHeader("Connection", "close"); + xmlHttp.send(msg); + } +``` + +### A Server (`server.py`) + +Despite being a small server that simply relays information, we still need to +consider how to receive and handle the HTTP requests. First we need to decide +what kind of HTTP request to use. In the last section, the client is using +POST, but why did we decide on this? Since data is being sent to the server, a +PUT or POST request makes the most sense. We only need to send a json body and +no URL parameters. So in theory, a GET request could have worked as well but +would not make sense semantically. The choice between PUT and POST, however, is +a long, on-going debate among programmers; KNPLabs summarizes the issues [with +humour](https://knpuniversity.com/screencast/rest/put-versus-post). + +Another consideration is whether to send the "train" vs. "predict" requests to +different endpoints (e.g. `http://localhost/train` and `http://localhost/predict`) +or the same endpoint which then processes the data separately. In this case, we +can go with the latter approach since the difference between what is done with +the data in each case is minor enough to fit into a short if statement. In +practice, it would be better to have these as separate endpoints if the server +were to do any more detailed processing for each request type. This decision, +in turn impacted what server error codes were used when. For example, a 400 +"Bad Request" error is sent when neither "train" or "predict" is specified in +the payload. If separate endpoints were used instead, this would not be an +issue. The processing done in the background by the OCR system may fail for any +reason and if it's not handled correctly within the server, a 500 "Internal +Server Error" is sent. Again, if the endpoints were separated, there would have +been more room to go into detail to send more appropriate errors. For example, +identifying that an internal server error was actually caused by a bad request. + +Finally, we need to decide when and where to initialize the OCR system. A good +approach would be to initialize it within `server.py` but before the server is +started. This is because on first run, the OCR system needs to train the +network on some pre-existing data the first time it starts and this may take a +few minutes. If the server started before this processing was complete, any +requests to train or predict would throw an exception since the OCR object +would not yet have been initialized, given the current implementation. Another +possible implementation could create some inaccurate initial ANN to be used for +the first few queries while the new ANN is asynchronously trained in the +background. This alternative approach does allow the ANN to be used +immediately, but the implementation is more complex and it would only save on +time on server startup if the servers are reset. This type of implementation +would be more beneficial for an OCR service that requires high availability. + +Here we have the majority of our server code in one short function that handles +POST requests. + +```python + def do_POST(s): + response_code = 200 + response = "" + var_len = int(s.headers.get('Content-Length')) + content = s.rfile.read(var_len); + payload = json.loads(content); + + if payload.get('train'): + nn.train(payload['trainArray']) + nn.save() + elif payload.get('predict'): + try: + response = { + "type":"test", + "result":nn.predict(str(payload['image'])) + } + except: + response_code = 500 + else: + response_code = 400 + + s.send_response(response_code) + s.send_header("Content-type", "application/json") + s.send_header("Access-Control-Allow-Origin", "*") + s.end_headers() + if response: + s.wfile.write(json.dumps(response)) + return +``` + +### Designing a Feedforward ANN (`neural_network_design.py`) +\label{sec.ocr.feedforward} +When designing a feedforward ANN, there are a few factors we must consider. The +first is what activation function to use. We mentioned activation functions +earlier as the decision-maker for a node’s output. The type of the decision an +activation function makes will help us decide which one to use. In our case, we +will be designing an ANN that outputs a value between 0 and 1 for each digit +(0-9). Values closer to 1 would mean the ANN predicts this is the drawn digit +and values closer to 0 would mean it’s predicted to not be the drawn digit. +Thus, we want an activation function that would have outputs either close to 0 +or close to 1. We also need a function that is differentiable because we will +need the derivative for our backpropagation computation. A commonly used +function in this case is the sigmoid because it satisfies both these +constraints. StatSoft provides a [nice +list](http://www.fmi.uni-sofia.bg/fmi/statist/education/textbook/eng/glosa.html) +of common activation functions and their properties. + +A second factor to consider is whether we want to include biases. We've +mentioned biases a couple of times before but haven't really talked about what +they are or why we use them. Let's try to understand this by going back to how +the output of a node is computed in \aosafigref{500l.ocr.ann}. Suppose we had a single input +node and a single output node, our output formula would be $y = f(wx)$, where $y$ +is the output, $f()$ is the activation function, $w$ is the weight for the link +between the nodes, and $x$ is the variable input for the node. The bias is +essentially a node whose output is always $1$. This would change the output +formula to $y = f(wx + b)$ where $b$ is the weight of the connection between the +bias node and the next node. If we consider $w$ and $b$ as constants and $x$ as a +variable, then adding a bias adds a constant to our linear function input to +$f(.)$. + +Adding the bias therefore allows for a shift in the $y$-intercept and in general +gives more flexibility for the output of a node. It's often good practice to +include biases, especially for ANNs with a small number of inputs and outputs. +Biases allow for more flexibility in the output of the ANN and thus provide the +ANN with more room for accuracy. Without biases, we’re less likely to make +correct predictions with our ANN or would need more hidden nodes to make more +accurate predictions. + +Other factors to consider are the number of hidden layers and the number of +hidden nodes per layer. For larger ANNs with many inputs and outputs, these +numbers are decided by trying different values and testing the network's +performance. In this case, the performance is measured by training an ANN of a +given size and seeing what percentage of the validation set is classified +correctly. In most cases, a single hidden layer is sufficient for decent +performance, so we only experiment with the number of hidden nodes here. + +```python +# Try various number of hidden nodes and see what performs best +for i in xrange(5, 50, 5): + nn = OCRNeuralNetwork(i, data_matrix, data_labels, train_indices, False) + performance = str(test(data_matrix, data_labels, test_indices, nn)) + print "{i} Hidden Nodes: {val}".format(i=i, val=performance) +``` + +Here we initialize an ANN with between 5 to 50 hidden nodes in increments of 5. +We then call the `test()` function. + +```python +def test(data_matrix, data_labels, test_indices, nn): + avg_sum = 0 + for j in xrange(100): + correct_guess_count = 0 + for i in test_indices: + test = data_matrix[i] + prediction = nn.predict(test) + if data_labels[i] == prediction: + correct_guess_count += 1 + + avg_sum += (correct_guess_count / float(len(test_indices))) + return avg_sum / 100 +``` + +The inner loop is counting the number of correct classifications which are then +divided by the number of attempted classifications at the end. This gives a +ratio or percentage accuracy for the ANN. Since each time an ANN is trained, +its weights may be slightly different, we repeat this process 100 times in the +outer loop so we can take an average of this particular ANN configuration's +accuracy. In our case, a sample run of `neural_network_design.py` looks like the +following: + +``` +PERFORMANCE +----------- +5 Hidden Nodes: 0.7792 +10 Hidden Nodes: 0.8704 +15 Hidden Nodes: 0.8808 +20 Hidden Nodes: 0.8864 +25 Hidden Nodes: 0.8808 +30 Hidden Nodes: 0.888 +35 Hidden Nodes: 0.8904 +40 Hidden Nodes: 0.8896 +45 Hidden Nodes: 0.8928 +``` + +From this output we can conclude that 15 hidden nodes would be most optimal. +Adding 5 nodes from 10 to 15 gets us ~1% more accuracy, whereas improving the +accuracy by another 1% would require adding another 20 nodes. Increasing the +hidden node count also increases computational overhead. So it would take +networks with more hidden nodes longer to be trained and to make predictions. +Thus we choose to use the last hidden node count that resulted in a dramatic +increase in accuracy. Of course, it’s possible when designing an ANN that +computational overhead is no problem and it's top priority to have the most +accurate ANN possible. In that case it would be better to choose 45 hidden +nodes instead of 15. + +### Core OCR Functionality + +In this section we’ll talk about how the actual training occurs via +backpropagation, how we can use the network to make predictions, and other key +design decisions for core functionality. + +#### Training via Backpropagation (`ocr.py`) + +The backpropagation algorithm, briefly mentioned earlier, is used to train our +ANN. It consists of 4 main steps that are repeated for every sample in the +training set, updating the ANN weights each time. + +First, we initialize the weights to small (between -1 and 1) random values. In +our case, we initialize them to values between -0.06 and 0.06 and store them in +matrices `theta1`, `theta2`, `input_layer_bias`, and `hidden_layer_bias`. Since +every node in a layer links to every node in the next layer we can create a +matrix that has m rows and n columns where n is the number of nodes in one +layer and m is the number of nodes in the adjacent layer. This matrix would +represent all the weights for the links between these two layers. Here theta1 +has 400 columns for our 20x20 pixel inputs and `num_hidden_nodes` rows. +Likewise, `theta2` represents the links between the hidden layer and output +layer. It has `num_hidden_nodes` columns and `NUM_DIGITS` (`10`) rows. The +other two vectors (1 row), `input_layer_bias` and `hidden_layer_bias` represent +the biases. + +```python + def _rand_initialize_weights(self, size_in, size_out): + return [((x * 0.12) - 0.06) for x in np.random.rand(size_out, size_in)] +``` + +```python + self.theta1 = self._rand_initialize_weights(400, num_hidden_nodes) + self.theta2 = self._rand_initialize_weights(num_hidden_nodes, 10) + self.input_layer_bias = self._rand_initialize_weights(1, + num_hidden_nodes) + self.hidden_layer_bias = self._rand_initialize_weights(1, 10) + +``` + +The second step is _forward propagation_, which is essentially computing the +node outputs as described in \aosasecref{sec.ocr.ann}, layer by layer starting from +the input nodes. Here, `y0` is an array of size 400 with the inputs we wish to +use to train the ANN. We multiply `theta1` by `y0` transposed so that we have two +matrices with sizes `(num_hidden_nodes x 400) * (400 x 1)` and have a resulting +vector of outputs for the hidden layer of size num_hidden_nodes. We then add +the bias vector and apply the vectorized sigmoid activation function to this +output vector, giving us `y1`. `y1` is the output vector of our hidden layer. The +same process is repeated again to compute `y2` for the output nodes. `y2` is now +our output layer vector with values representing the likelihood that their +index is the drawn number. For example if someone draws an 8, the value of `y2` +at the 8th index will be the largest if the ANN has made the correct +prediction. However, 6 may have a higher likelihood than 1 of being the drawn +digit since it looks more similar to 8 and is more likely to use up the same +pixels to be drawn as the 8. `y2` becomes more accurate with each additional +drawn digit the ANN is trained with. + +```python + # The sigmoid activation function. Operates on scalars. + def _sigmoid_scalar(self, z): + return 1 / (1 + math.e ** -z) +``` + +```python + y1 = np.dot(np.mat(self.theta1), np.mat(data['y0']).T) + sum1 = y1 + np.mat(self.input_layer_bias) # Add the bias + y1 = self.sigmoid(sum1) + + y2 = np.dot(np.array(self.theta2), y1) + y2 = np.add(y2, self.hidden_layer_bias) # Add the bias + y2 = self.sigmoid(y2) +``` + +The third step is _back propagation_, which involves computing the errors at the +output nodes then at every intermediate layer back towards the input. Here we +start by creating an expected output vector, `actual_vals`, with a `1` at the index +of the digit that represents the value of the drawn digit and `0`s otherwise. The +vector of errors at the output nodes, `output_errors`, is computed by subtracting +the actual output vector, `y2`, from `actual_vals`. For every hidden layer +afterwards, we compute two components. First, we have the next layer’s +transposed weight matrix multiplied by its output errors. Then we have the +derivative of the activation function applied to the previous layer. We then +perform an element-wise multiplication on these two components, giving a vector +of errors for a hidden layer. Here we call this `hidden_errors`. + +```python + actual_vals = [0] * 10 + actual_vals[data['label']] = 1 + output_errors = np.mat(actual_vals).T - np.mat(y2) + hidden_errors = np.multiply(np.dot(np.mat(self.theta2).T, output_errors), + self.sigmoid_prime(sum1)) +``` + +Weight updates that adjust the ANN weights based on the errors computed +earlier. Weights are updated at each layer via matrix multiplication. The error +matrix at each layer is multiplied by the output matrix of the previous layer. +This product is then multiplied by a scalar called the learning rate and added +to the weight matrix. The learning rate is a value between 0 and 1 that +influences the speed and accuracy of learning in the ANN. Larger learning rate +values will generate an ANN that learns quickly but is less accurate, while +smaller values will will generate an ANN that learns slower but is more +accurate. In our case, we have a relatively small value for learning rate, 0.1. +This works well since we do not need the ANN to be immediately trained in order +for a user to continue making train or predict requests. Biases are updated by +simply multiplying the learning rate by the layer’s error vector. + +```python + self.theta1 += self.LEARNING_RATE * np.dot(np.mat(hidden_errors), + np.mat(data['y0'])) + self.theta2 += self.LEARNING_RATE * np.dot(np.mat(output_errors), + np.mat(y1).T) + self.hidden_layer_bias += self.LEARNING_RATE * output_errors + self.input_layer_bias += self.LEARNING_RATE * hidden_errors +``` + +#### Testing a Trained Network (`ocr.py`) + +Once an ANN has been trained via backpropagation, it is fairly straightforward +to use it for making predictions. As we can see here, we start by computing the +output of the ANN, `y2`, exactly the way we did in step 2 of backpropagation. +Then we look for the index in the vector with the maximum value. This index is +the digit predicted by the ANN. + +``` + def predict(self, test): + y1 = np.dot(np.mat(self.theta1), np.mat(test).T) + y1 = y1 + np.mat(self.input_layer_bias) # Add the bias + y1 = self.sigmoid(y1) + + y2 = np.dot(np.array(self.theta2), y1) + y2 = np.add(y2, self.hidden_layer_bias) # Add the bias + y2 = self.sigmoid(y2) + + results = y2.T.tolist()[0] + return results.index(max(results)) +``` + +#### Other Design Decisions (`ocr.py`) +Many resources are available online that go into greater detail on the +implementation of backpropagation. One good resource is from a [course by the +University of +Williamette](http://www.willamette.edu/~gorr/classes/cs449/backprop.html). It +goes over the steps of backpropagation and then explains how it can be +translated into matrix form. While the amount of computation using matrices is +the same as using loops, the benefit is that the code is simpler and easier to +read with fewer nested loops. As we can see, the entire training process is +written in under 25 lines of code using matrix algebra. + +As mentioned in the introduction of \aosasecref{sec.ocr.decisions}, persisting +the weights of the ANN means we do not lose the progress made in training it +when the server is shut down or abruptly goes down for any reason. We persist +the weights by writing them as JSON to a file. On startup, the OCR loads the +ANN’s saved weights to memory. The save function is not called internally by +the OCR but is up to the server to decide when to perform a save. In our case, +the server saves the weights after each update. This is a quick and simple +solution but it is not optimal since writing to disk is time consuming. This +also prevents us from handling multiple concurrent requests since there is no +mechanism to prevent simultaneous writes to the same file. In a more +sophisticated server, saves could perhaps be done on shutdown or once every few +minutes with some form of locking or a timestamp protocol to ensure no data +loss. + +```python + def save(self): + if not self._use_file: + return + + json_neural_network = { + "theta1":[np_mat.tolist()[0] for np_mat in self.theta1], + "theta2":[np_mat.tolist()[0] for np_mat in self.theta2], + "b1":self.input_layer_bias[0].tolist()[0], + "b2":self.hidden_layer_bias[0].tolist()[0] + }; + with open(OCRNeuralNetwork.NN_FILE_PATH,'w') as nnFile: + json.dump(json_neural_network, nnFile) + + def _load(self): + if not self._use_file: + return + + with open(OCRNeuralNetwork.NN_FILE_PATH) as nnFile: + nn = json.load(nnFile) + self.theta1 = [np.array(li) for li in nn['theta1']] + self.theta2 = [np.array(li) for li in nn['theta2']] + self.input_layer_bias = [np.array(nn['b1'][0])] + self.hidden_layer_bias = [np.array(nn['b2'][0])] +``` + +## Conclusion +Now that we’ve learned about AI, ANNs, backpropagation, and building an +end-to-end OCR system, let’s recap the highlights of this chapter and the big +picture. + +We started off the chapter by giving background on AI, ANNs, and roughly what +we will be implementing. We discussed what AI is and examples of how it’s used. +We saw that AI is essentially a set of algorithms or problem-solving approaches +that can provide an answer to a question in a similar manner as a human would. +We then took a look at the structure of a Feedforward ANN. We learned that +computing the output at a given node was as simple as summing the products of +the outputs of the previous nodes and their connecting weights. We talked about +how to use an ANN by first formatting the input and partitioning the data into +training and validation sets. + +Once we had some background, we started talking about creating a web-based, +client-server system that would handle user requests to train or test the OCR. +We then discussed how the client would interpret the drawn pixels into an array +and perform an HTTP request to the OCR server to perform the training or +testing. We discussed how our simple server read requests and how to design an +ANN by testing performance of several hidden node counts. We finished off by +going through the core training and testing code for backpropagation. + +Although we’ve built a seemingly functional OCR system, this chapter simply +scratches the surface of how a real OCR system might work. More sophisticated +OCR systems could have pre-processed inputs, use hybrid ML algorithms, have +more extensive design phases, or other further optimizations. diff --git a/sampler/sampler.markdown b/sampler/sampler.markdown index e3c76ba01..a967b8ee6 100644 --- a/sampler/sampler.markdown +++ b/sampler/sampler.markdown @@ -1,6 +1,8 @@ title: A Rejection Sampler author: Jessica B. Hamrick +_Jess is a Ph.D. student at UC Berkeley where she studies human cognition by combining probabilistic models from machine learning with behavioral experiments from cognitive science. In her spare time, Jess is a core contributor to IPython and Jupyter. She also holds a B.S. and M.Eng. in Computer Science from MIT._ + ## Introduction Frequently, in computer science and engineering, we run into problems @@ -452,7 +454,7 @@ implement the PMF or PDF from the beginning, anyway. Formally, the multinomial distribution has the following equation: $$ -p(\mathbf{x}; \mathbf{p}) = \frac{(\sum_{i=1}^k x_i)!}{x_1!\cdots{}x_k!}p_1^{x_1}\cdots{}p_k^{x_k}, +p(\mathbf{x}; \mathbf{p}) = \frac{(\sum_{i=1}^k x_i)!}{x_1!\cdots{}x_k!}p_1^{x_1}\cdots{}p_k^{x_k} $$ where $\mathbf{x}=[x_1, \ldots{}, x_k]$ is a vector of length $k$ @@ -468,14 +470,14 @@ the gamma function rather than factorial, so we will rewrite the equation using $\Gamma$: $$ -p(\mathbf{x}; \mathbf{p}) = \frac{\Gamma((\sum_{i=1}^k x_i)+1)}{\Gamma(x_1+1)\cdots{}\Gamma(x_k+1)}p_1^{x_1}\cdots{}p_k^{x_k}, +p(\mathbf{x}; \mathbf{p}) = \frac{\Gamma((\sum_{i=1}^k x_i)+1)}{\Gamma(x_1+1)\cdots{}\Gamma(x_k+1)}p_1^{x_1}\cdots{}p_k^{x_k} $$ #### Working with Log Values Before getting into the actual code needed to implement the equation -above, I want to emphasize one of the *the most important design -decisions* when writing code with probabilities: working with +above, I want to emphasize one of the the most important design +decisions when writing code with probabilities: working with log values. What this means is that rather than working directly with probabilities $p(x)$, we should be working with *log*-probabilities, $\log{p(x)}$. This is because probabilities can get very small very @@ -936,7 +938,7 @@ def _bonus_log_pmf(self, bonus): return self.bonus_dist.log_pmf(x) ``` -We can now create our distrbution as follows: +We can now create our distribution as follows: ```python >>> import numpy as np @@ -1094,7 +1096,7 @@ function in our `DamageDistribution`. This is because we actually do not know what the PMF should be! This would be the equation: $$ -\sum_{{item}_1, \ldots{}, {item}_m}p({damage}\ |\ {item}_1,\ldots{},{item}_m)p({item}_1)\cdots{}p({item}_m) +\sum_{{item}_1, \ldots{}, {item}_m} p(\mathrm{damage} \vert \mathrm{item}_1,\ldots{},\mathrm{item}_m)p(\mathrm{item}_1)\cdots{}p(\mathrm{item}_m) $$ What this equation says is that we would need to compute the @@ -1163,7 +1165,7 @@ the general case: (e.g., using dictionaries as the output of `MagicItemDistribution.sample`) while still exposing the less clear but more efficient and purely numeric version of those functions - (e.g., `MagicItemDistribution._sample_stats`). + \linebreak (e.g., `MagicItemDistribution._sample_stats`). Additionally, we've seen how sampling from a probability distribution can be useful both for producing single random values (e.g., diff --git a/tex/500L.tex b/tex/500L.tex index de6b61c59..b06ab1c7a 100644 --- a/tex/500L.tex +++ b/tex/500L.tex @@ -260,7 +260,7 @@ \mainmatter -\include{dagoba} +\include{ocr} \bibliographystyle{alpha} diff --git a/tex/contingent.tex b/tex/contingent.tex new file mode 100644 index 000000000..ee128c5ec --- /dev/null +++ b/tex/contingent.tex @@ -0,0 +1,1421 @@ +\begin{aosachapter}{Contingent: A Fully Dynamic Build System}{s:contingent}{Brandon Rhodes and Daniel Rocco} + +\emph{Brandon Rhodes started using Python in the late 1990s, and for 17 +years has maintained the PyEphem library for amateur astronomers. He +works at Dropbox, has taught Python programming courses for corporate +clients, consulted on projects like the New England Wildflower Society's +``Go Botany'' Django site, and will be the chair of the PyCon conference +in 2016 and 2017. Brandon believes that well-written code is a form of +literature, that beautifully formatted code is a work of graphic design, +and that correct code is one of the most transparent forms of thought.} + +\emph{Daniel Rocco loves Python, coffee, craft, stout, object and system +design, bourbon, teaching, trees, and Latin guitar. Thrilled that he +gets to write Python for a living, he is always on the lookout for +opportunities to learn from others in the community and to contribute +back by sharing knowledge. He is a frequent speaker at PyAtl on +introductory topics, testing, design, and shiny things; he loves seeing +the spark of wonder and delight in people's eyes when someone shares a +novel, surprising, or beautiful idea. Daniel lives in Atlanta with a +microbiologist and four aspiring rocketeers.} + +\aosasecti{Introduction}\label{introduction} + +Build systems have long been a standard tool within computer +programming. + +The standard \texttt{make} build system, for which its author won the +ACM Software System Award, was first developed in~1976. It not only lets +you declare that an output file depends upon one (or more) inputs, but +lets you do this recursively. A~program, for example, might depend upon +an object file which itself depends upon the corresponding source code: + +\begin{verbatim} + prog: main.o + cc -o prog main.o + + main.o: main.c + cc -C -o main.o main.c +\end{verbatim} + +Should \texttt{make} discover, upon its next invocation, that the +\texttt{main.c} source code file now has a more recent modify time than +\texttt{main.o}, then it will not only rebuild the \texttt{main.o} +object file but will also rebuild \texttt{prog} itself. + +Build systems are a common semester project posed for undergraduate +computer science students - not only because build systems are used in +nearly all software projects, but because their construction involves +fundamental data structures and algorithms involving directed graphs +(which this chapter will later discuss in more detail). With decades of +use and practice behind build systems, one might expect them to have +become completely general-purpose and ready for even the most +extravagant demands. + +But, in fact, one kind of common interaction between build artifacts --- +the problem of dynamic cross-referencing --- is handled so poorly by +most build systems that in this chapter we are inspired to not only +rehearse the standard solution and data structures used classically to +solve the \texttt{make} problem, but to extend that solution +dramatically to a far more demanding domain. + +The problem, again, is cross-referencing. Where do cross-references tend +to emerge? In text documents, documentation, and printed books! + +\aosasecti{The Problem: Building Document +Systems}\label{the-problem-building-document-systems} + +Systems to rebuild formatted documents from source texts always seem to +do too much work, or too little. + +They do too much work when they respond to a minor edit by making you +wait for unrelated chapters to be re-parsed and re-formatted. But they +can also rebuild too little, leaving you with an inconsistent final +product. + +Consider \href{http://sphinx-doc.org/}{Sphinx}, the document builder +that is used for both the official Python language documentation and +many other projects in the Python community. A~Sphinx project's +\texttt{index.rst} will usually include a table of contents: + +\begin{verbatim} + Table of Contents + ================= + + .. toctree:: + + install.rst + tutorial.rst + api.rst +\end{verbatim} + +This list of chapter filenames tells Sphinx to include a link to each of +the three named chapters when it builds the \texttt{index.html} output +file. It will also include links to any sections within each chapter. +Stripped of its markup, the text that results from the above title and +\texttt{toctree} command might~be: + +\begin{verbatim} + Table of Contents + + • Installation + + • Newcomers Tutorial + • Hello, World + • Adding Logging + + • API Reference + • Handy Functions + • Obscure Classes +\end{verbatim} + +This table of contents, as you can see, is a mash-up of information from +four different files. While its basic order and structure come from +\texttt{index.rst}, the actual title of each chapter and section is +pulled from the three chapter source files themselves. + +If you later reconsider the tutorial's chapter title --- after all, the +word ``newcomer'' sounds so antique, as if your users are settlers who +have just arrived in pioneer Wyoming --- then you would edit the first +line of \texttt{tutorial.rst} and write something better: + +\begin{verbatim} + -Newcomers Tutorial + +Beginners Tutorial + ================== + + Welcome to the tutorial! + This text will take you through the basics of... +\end{verbatim} + +When you are ready to rebuild, Sphinx will do exactly the right thing! +It will rebuild both the tutorial chapter itself, and also rebuild the +index. (Piping the output into \texttt{cat} makes Sphinx announce each +rebuilt file on a separate line, instead of using bare carriage returns +to repeatedly overwrite a single line with these progress updates.) + +\begin{verbatim} + $ make html | cat + writing output... [ 50%] index + writing output... [100%] tutorial +\end{verbatim} + +Because Sphinx chose to rebuild both documents, not only will +\texttt{tutorial.html} now feature its new title up at the top, but the +output \texttt{index.html} will display the updated chapter title in the +table of contents. Sphinx has rebuilt everything so that the output is +consistent. + +What if your edit to \texttt{tutorial.rst} is more minor? + +\begin{verbatim} + Beginners Tutorial + ================== + + -Welcome to the tutorial! + +Welcome to our project tutorial! + This text will take you through the basics of... +\end{verbatim} + +In this case there is no need to rebuild \texttt{index.html} because +this minor edit to the interior of a paragraph does not change any of +the information in the table of contents. But it turns out that Sphinx +is not quite as clever as it might have at first appeared! It will go +ahead and perform the redundant work of rebuilding \texttt{index.html} +even though the resulting contents will be exactly the same. + +\begin{verbatim} + writing output... [ 50%] index + writing output... [100%] tutorial +\end{verbatim} + +You can run \texttt{diff} on the ``before'' and ``after'' versions of +\texttt{index.html} to confirm that your small edit has had zero effect +on the project front page --- yet Sphinx made you wait while it was +rebuilt anyway. + +You might not even notice the extra rebuild effort for small documents +that are easy to compile. But the delay to your workflow can become +significant when you are making frequent tweaks and edits to documents +that are long, complex, or that involve the generation of multimedia +like plots or animations. While Sphinx is at least making an effort not +to rebuild every chapter when you make a single change --- it has not, +for example, rebuilt \texttt{install.html} or \texttt{api.html} in +response to your \texttt{tutorial.rst} edit --- it is doing more than is +necessary. + +But it turns out that Sphinx does something even worse: it sometimes +does too little, leaving you with inconsistent output that could be +noticed by users. + +To see one of Sphinx's simplest failure modes, first add a cross +reference to the top of your API documentation: + +\begin{verbatim} + API Reference + ============= + + +Before reading this, try reading our :doc:`tutorial`! + + + The sections below list every function + and every single class and method offered... +\end{verbatim} + +With its usual caution as regards the table of contents, Sphinx will +dutifully rebuild both this API reference document as well as the +\texttt{index.html} home page of your project: + +\begin{verbatim} + writing output... [ 50%] api + writing output... [100%] index +\end{verbatim} + +In the \texttt{api.html} output file you can confirm that Sphinx has +included the attractive human-readable title of the tutorial chapter +into the cross reference's anchor tag: + +\begin{verbatim} +

Before reading this, try reading our + + Beginners Tutorial + !

+\end{verbatim} + +What if you now make another edit to the title at the top of the +\texttt{tutorial.rst} file? You will have invalidated \emph{three} +output files: + +\begin{aosaenumerate} +\def\labelenumi{\arabic{enumi}.} +\item + The title at the top of \texttt{tutorial.html} is now out of date, so + the file needs to be rebuilt. +\item + The table of contents in \texttt{index.html} still has the old title, + so that document needs to be rebuilt. +\item + The embedded cross reference in the first paragraph of + \texttt{api.html} still has the old chapter title, and also needs to + be rebuilt. +\end{aosaenumerate} + +What does Sphinx do? + +\begin{verbatim} + writing output... [ 50%] index + writing output... [100%] tutorial +\end{verbatim} + +Whoops. + +Only two files were rebuilt, not three. Sphinx has failed to correctly +rebuild your documentation. + +If you now push your HTML to the web, users will see the old title in +the cross reference at the top of \texttt{api.html} but then a different +title --- the new one --- once the link has carried them to +\texttt{tutorial.html} itself. This can happen for many kinds of cross +reference that Sphinx supports: chapter titles, section titles, +paragraphs, classes, methods, and functions. + +\aosasecti{Build Systems and +Consistency}\label{build-systems-and-consistency} + +The problem outlined above is not specific to Sphinx. Not only does it +haunt other document systems, like LaTeX, but it can even plague +projects that are simply trying to direct compilation steps with the +venerable \texttt{make} utility, if their assets happen to +cross-reference in interesting ways. + +As the problem is ancient and universal, its solution is of equally long +lineage: + +\begin{verbatim} + $ rm -r _build/ + $ make html +\end{verbatim} + +If you remove all of the output, you are guaranteed a complete rebuild! +Some projects even alias \texttt{rm} \texttt{-r} a target named +\texttt{clean} so that only a quick \texttt{make} \texttt{clean} is +necessary to wipe the slate. + +By eliminating every copy of every intermediate or output asset, a hefty +\texttt{rm} \texttt{-r} is able to force the build to start over again +with nothing cached --- with no memory of its earlier state that could +possibly lead to a stale product! + +But could we develop a better approach? + +What if your build system were a persistent process that noticed every +chapter title, every section title, and every cross referenced phrase as +it passed from the source code of one document into the text of another? +Its decisions about whether to rebuild other documents after a change to +a single source file could be precise, instead of mere guesses, and +correct, instead of leaving the output in an inconsistent state. + +The result would be a system like the old static \texttt{make} tool, but +which learned the dependencies between files as they were built --- that +added and removed dependencies dynamically as cross references were +added, updated, and then later deleted. + +In the sections that follow we will construct such a tool in Python, +named Contingent, that guarantees correctness in the presence of dynamic +dependencies while performing the fewest possible rebuild steps. While +Contingent can be applied to any problem domain, we will run it against +a small version of the problem outlined above. + +\aosasecti{Linking Tasks To Make a +Graph}\label{linking-tasks-to-make-a-graph} + +Any build system needs a way to link inputs and outputs. The three +markup texts in our discussion above, for example, each produce a +corresponding HTML output file. The most natural way to express these +relationships is as a collection of boxes and arrows --- or, in +mathematician terminology, \emph{nodes} and \emph{edges} to form a +\emph{graph} (\aosafigref{500l.contingent.graph}.) + +\aosafigure[240pt]{contingent-images/figure1.png}{Three files generated by parsing three input texts.}{500l.contingent.graph} + +Each language in which a programmer might tackle writing a build system +will offer various data structures with which such a graph of nodes and +edges might be represented. + +How could we represent such a graph in Python? + +The Python language gives priority to four generic data structures by +giving them direct support in the language syntax. You can create new +instances of these big-four data structures by simply typing their +literal representation into your source code, and their four type +objects are available as built-in symbols that can be used without being +imported. + +The \textbf{tuple} is a read-only sequence used to hold heterogeneous +data --- each slot in a tuple typically means something different. Here, +a tuple (e.g.~holds together a hostname and port number, and would lose +its meaning if the elements were re-ordered: + +\begin{verbatim} +('dropbox.com', 443) +\end{verbatim} + +The \textbf{list} is a mutable sequence used to hold homogenous data --- +each item usually has the same structure and meaning as its peers. Lists +can be used either to preserve data's original input order, or can be +rearranged or sorted to establish a new and more useful order. + +\begin{verbatim} +['C', 'Awk', 'TCL', 'Python', 'JavaScript'] +\end{verbatim} + +The \textbf{set} does not preserve order. Sets remember only whether a +given value has been added, not how many times, and are therefore the +go-to data structure for removing duplicates from a data stream. For +example, the following two sets, once the language has built them, will +each have three elements: + +\begin{verbatim} +{3, 4, 5} +{3, 4, 5, 4, 4, 3, 5, 4, 5, 3, 4, 5} +\end{verbatim} + +The \textbf{dict} is an associative data structure for storing values +accessible by a key. Dicts let the programmer chose the key by which +each value is indexed, instead of using automatic integer indexing like +the tuple and list. The lookup is backed by a hash table, which means +that dict key lookup runs at the same speed whether the dict has a dozen +or a million keys! + +\begin{verbatim} +{'ssh': 22, 'telnet': 23, 'domain': 53, 'http': 80} +\end{verbatim} + +A key to Python's flexibility is that these four data structures are +composable. The programmer can arbitrarily nest them inside each other +to produce more complex data stores whose rules and syntax remain the +simple ones of the underlying tuples, lists, sets, and dicts. + +Given that each of our graph edges needs to know at least its origin +node and its destination node, the simplest possible representation +would be a tuple. The top edge in Figure~1 might look like: + +\begin{verbatim} + ('tutorial.rst', 'tutorial.html') +\end{verbatim} + +How can we store several edges? While our initial impulse might be to +simply throw all of our edge tuples into a list, that would have +disadvantages. A~list is careful to maintain order, but it is not +meaningful to talk about an absolute order for the edges in a graph. And +a list would be perfectly happy to hold several copies of exactly the +same edge, even though we only want it to be possible to draw a single +arrow between \texttt{tutorial.rst} and \texttt{tutorial.html}. The +correct choice is thus the set, which would have us represent +\aosafigref{500l.contingent.graph} as: + +\begin{verbatim} + {('tutorial.rst', 'tutorial.html'), + ('index.rst', 'index.html'), + ('api.rst', 'api.html')} +\end{verbatim} + +This would allow quick iteration across all of our edges, fast insert +and delete operations for a single edge, and a quick way to check +whether a particular edge was present. + +Unfortunately, those are not the only operations we need. + +A build system like Contingent needs to understand the relationship +between a given node and all the nodes connected to it. For example, +when \texttt{api.rst} changes, Contingent needs to know which assets are +affected by that change, if any, in order to minimize the work performed +while also ensuring a complete build. To answer this question --- ``what +nodes are downstream from \texttt{api.rst}?'' --- we need to examine the +\emph{outgoing} edges from \texttt{api.rst}. But building the dependency +graph requires that Contingent be concerned with a node's \emph{inputs} +as well. What inputs were used, for example, when the build system +assembled the output document \texttt{tutorial.html}? It is by watching +the input to each node that Contingent can know that \texttt{api.html} +depends on \texttt{api.rst} but that \texttt{tutorial.html} does not. As +sources change and rebuilds occur, Contingent rebuilds the incoming +edges of each changed node to remove potentially stale edges and +re-learn which resources a task uses this time around. + +Our set-of-tuples does not make answering either of these questions +easy. If we needed to know the relationship between \texttt{api.html} +and the rest of the graph, we would need to traverse the entire set +looking for edges that start or end at the \texttt{api.html} node. + +An associative data structure like Python's dict would make these chores +easier by allowing direct lookup of all the edges from a particular +node: + +\begin{verbatim} + {'tutorial.rst': {('tutorial.rst', 'tutorial.html')}, + 'tutorial.html': {('tutorial.rst', 'tutorial.html')}, + 'index.rst': {('index.rst', 'index.html')}, + 'index.html': {('index.rst', 'index.html')}, + 'api.rst': {('api.rst', 'api.html')}, + 'api.html': {('api.rst', 'api.html')}} +\end{verbatim} + +Looking up the edges of a particular node would now be blazingly fast, +at the cost of having to store every edge twice: once in a set of +incoming edges, and once in a set of outgoing edges. But the edges in +each set would have to be examined manually to see which are incoming +and which are outgoing. It is also slightly redundant to keep naming the +node over and over again in its set of edges. + +The solution to both of these objections is to place incoming and +outgoing edges in their own separate data structures, which will also +absolve us of having to mention the node over and over again for every +one of the edges in which it is involved. + +\begin{verbatim} + incoming = { + 'tutorial.html': {'tutorial.rst'}, + 'index.html': {'index.rst'}, + 'api.html': {'api.rst'}, + } + + outgoing = { + 'tutorial.rst': {'tutorial.html'}, + 'index.rst': {'index.html'}, + 'api.rst': {'api.html'}, + } +\end{verbatim} + +Notice that \texttt{outgoing} represents, directly in Python syntax, +exactly what we drew in \aosafigref{500l.contingent.graph} earlier: the +source documents on the left will be transformed by the build system +into the output documents on the right. For this simple example each +source points to only one output --- all the output sets have only one +element --- but we will see examples shortly where a single input node +has multiple downstream consequences. + +Every edge in this dictionary-of-sets data structure does get +represented twice, once as an outgoing edge from one node +(\texttt{tutorial.rst} → \texttt{tutorial.html}) and again as an +incoming edge to the other (\texttt{tutorial.html} ← +\texttt{tutorial.rst}). These two representations capture precisely the +same relationship, just from the opposite perspectives of the two nodes +at either end of the edge. But in return for this redundancy, the data +structure supports the fast lookup that Contingent needs. + +\aosasecti{The Proper Use of Classes}\label{the-proper-use-of-classes} + +You may have been surprised by the absence of classes in the above +discussion of Python data structures. After all, classes are a frequent +mechanism for structuring applications and a hardly less frequent +subject of heated debate among their adherents and detractors. Classes +were once thought important enough that entire educational curricula +were designed around them, and the majority of popular programming +languanges include dedicated syntax for defining and using them. + +But it turns out that classes are often orthogonal to the question of +data structure design. Rather than offering us an entirely alternative +data modeling paradigm, classes simply repeat data structures that we +have already seen: + +\begin{aosaitemize} + +\item + A class instance is \emph{implemented} as a dict. +\item + A class instance is \emph{used} like a mutable tuple. +\end{aosaitemize} + +The class offers key lookup into its attribute dictionary through a +prettier syntax, where you get to say \texttt{graph.incoming} instead of +\texttt{graph{[}"incoming"{]}}. But, in practice, class instances are +almost never used as generic key-value stores. Instead, they are used to +organize related but heterogeneous data by attribute name, with +implementation details encapsulated behind a consistent and memorable +interface. + +So instead of putting a hostname and a port number together in a tuple +and having to remember later which came first and which came second, you +create an \texttt{Address} class whose instances each have a +\texttt{host} and a \texttt{port} attribute. You can then pass +\texttt{Address} objects around where otherwise you would have had +anonymous tuples. Code becomes easier to read and easier to write. But +using a class instance does not really change any of the questions we +faced above when doing data design: it just provides a prettier and less +anonymous container. + +The true value of classes, then, is not that they change the science of +data design. The value of classes is that they let you \emph{hide} your +data design from the rest of a program! + +Successful application design hinges upon our ability to exploit the +powerful built-in data structures Python offers us while minimizing the +volume of details we are required to keep in our heads at any one time. +Classes provide the mechanism for resolving this apparent quandary: used +effectively, a class provides a \emph{facade} around some small subset +of the system's overall design. When working within one subset --- a +\texttt{Graph}, for example --- we can forget the implementation details +of other subsets as long as we can remember their interfaces. In this +way, programmers often find themselves navigating among several levels +of abstraction in the course of writing a system, now working with the +specific data model and implementation details for a particular +subsystem, now connecting higher-level concepts through their +interfaces. + +For example, from the outside, code can simply ask for a new +\texttt{Graph} instance: + +\begin{verbatim} +>>> from contingent import graphlib +>>> g = graphlib.Graph() +\end{verbatim} + +without needing to understand the details of how \texttt{Graph} works. +Code that is simply using the graph sees only interface verbs --- the +method calls --- when manipulating a graph, as when an edge is added or +some other operation performed: + +\begin{verbatim} +>>> g.add_edge('index.rst', 'index.html') +>>> g.add_edge('tutorial.rst', 'tutorial.html') +>>> g.add_edge('api.rst', 'api.html') +\end{verbatim} + +Careful readers will have noticed that we added edges to our graph +without explicitly creating ``node'' and ``edge'' objects, and that the +nodes themselves in these early examples are simply strings. Coming from +other languages and traditions, one might have expected to see +user-defined classes and interfaces for everything in the system: + +\begin{verbatim} + Graph g = new ConcreteGraph(); + Node indexRstNode = new StringNode("index.rst"); + Node indexHtmlNode = new StringNode("index.html"); + Edge indexEdge = new DirectedEdge(indexRstNode, indexHtmlNode); + g.addEdge(indexEdge); +\end{verbatim} + +The Python language and community explicitly and intentionally emphasize +using simple, generic data structures to solve problems, instead of +creating custom classes for every minute detail of the problem we want +to tackle. This is one facet of the notion of ``Pythonic'' solutions +that you may have read about. Pythonic solutions try to minimize +syntactic overhead and leverage Python's powerful built-in tools and +extensive standard library. + +With these considerations in mind, let's return to the \texttt{Graph} +class, examining its design and implmentation to see the interplay +between data structures and class interfaces. When a new \texttt{Graph} +instance is constructed, a pair of dictionaries has already been built +to store edges using the logic we outlined in the previous section: + +\begin{verbatim} +class Graph: + """A directed graph of the relationships among build tasks.""" + + def __init__(self): + self._inputs_of = defaultdict(set) + self._consequences_of = defaultdict(set) +\end{verbatim} + +The leading underscore in front of the attribute names +\texttt{\_inputs\_of} and \texttt{\_consequences\_of} is a common +convention in the Python community to signal that an attribute is +private. This convention is one way the community suggests that +programmers pass messages and warnings through space and time to each +other. Recognizing the need to signal differences among public versus +internal object attributes, the community adopted the single leading +underscore as a concise and fairly consistent indicator to other +programmers, including our future selves, that the attribute is best +treated as part of the invisible internal machinery of the class. + +Why are we using a ``defaultdict'' instead of a standard dict? A common +problem when composing dicts with other data structures is handling +missing keys. With a normal dict, retrieving a key that does not exist +raises a \texttt{KeyError}: + +\begin{verbatim} +>>> consequences_of = {} +>>> consequences_of['index.rst'].add('index.html') +Traceback (most recent call last): + ... +KeyError: 'index.rst' +\end{verbatim} + +Using a normal dict requires special checks throughout the code to +handle this specific case, for example when adding a new edge: + +\begin{verbatim} + # Special case to handle “we have not seen this task yet”: + + if input_task not in self._consequences_of: + self._consequences_of[input_task] = set() + + self._consequences_of[input_task].add(consequence_task) +\end{verbatim} + +This need is so common that Python includes a special utility, the +defaultdict, which lets you provide a function that returns a value for +absent keys. When we ask about an edge that the \texttt{Graph} hasn't +yet seen, we will get back an empty \texttt{set} instead of an +exception: + +\begin{verbatim} +>>> from collections import defaultdict +>>> consequences_of = defaultdict(set) +>>> consequences_of['api.rst'] +set() +\end{verbatim} + +Structuring our implementation this way means that each key's first use +can look identical to second-and-subsequent-times that a particular key +is used: + +\begin{verbatim} +>>> consequences_of['index.rst'].add('index.html') +>>> 'index.html' in consequences_of['index.rst'] +True +\end{verbatim} + +Given these techniques, let's examine the implementation of +\texttt{add\_edge}, which we earlier used to build the graph for +\aosafigref{500l.contingent.graph}. + +\begin{verbatim} + def add_edge(self, input_task, consequence_task): + """Add an edge: `consequence_task` uses the output of `input_task`.""" + self._consequences_of[input_task].add(consequence_task) + self._inputs_of[consequence_task].add(input_task) +\end{verbatim} + +This method hides the fact that two, not one, storage steps are required +for each new edge so that we know about it in both directions. And +notice how \texttt{add\_edge()} does not know or care whether either +node has been seen before. Because the inputs and consequences data +structures are each a \texttt{defaultdict(set)}, the +\texttt{add\_edge()} method remains blissfully ignorant as to the +novelty of a node --- the \texttt{defaultdict} takes care of the +difference by creating a new \texttt{set} object on the fly. As we saw +above, \texttt{add\_edge()} would be three times longer had we not used +\texttt{defaultdict}. More importantly, it would be more difficult to +understand and reason about the resulting code. This implementation +demonstrates a Pythonic approach to problems: simple, direct, and +concise. + +Callers should also be given a simple way to visit every edge without +having to learn how to traverse our data structure: + +\begin{verbatim} + def edges(self): + """Return all edges as ``(input_task, consequence_task)`` tuples.""" + return [(a, b) for a in self.sorted(self._consequences_of) + for b in self.sorted(self._consequences_of[a])] +\end{verbatim} + +The \texttt{Graph.sorted()} method, if you want to examine it later, +makes an attempt to sort the nodes in case they have a natural sort +order (such as alphabetical) that can provide a stable output order for +the user. + +By using this traversal method we can see that, following our three +``add'' method calls earlier, \texttt{g} now represents the same graph +that we saw in Figure~1. + +\begin{verbatim} +>>> from pprint import pprint +>>> pprint(g.edges()) +[('api.rst', 'api.html'), + ('index.rst', 'index.html'), + ('tutorial.rst', 'tutorial.html')] +\end{verbatim} + +Since we now have a real live Python object, and not just a figure, we +can ask it interesting questions! For example, when Contingent is +building a blog from source files, it will need to know things like +``What depends on \texttt{api.rst}?'' when the content of +\texttt{api.rst} changes: + +\begin{verbatim} +>>> g.immediate_consequences_of('api.rst') +['api.html'] +\end{verbatim} + +This \texttt{Graph} is telling Contingent that, when \texttt{api.rst} +changes, \texttt{api.html} is now stale and must be rebuilt. How about +\texttt{index.html}? + +\begin{verbatim} +>>> g.immediate_consequences_of('index.html') +[] +\end{verbatim} + +An empty list has been returned, signalling that \texttt{index.html} is +at the right edge of the graph and so nothing further needs to be +rebuilt if it changes. This query can be expressed very simply thanks to +the work that has already gone in to laying out our data: + +\begin{verbatim} + def immediate_consequences_of(self, task): + """Return the tasks that use `task` as an input.""" + return self.sorted(self._consequences_of[task]) +\end{verbatim} + +\begin{verbatim} + >>> from contingent.rendering import as_graphviz + >>> open('figure1.dot', 'w').write(as_graphviz(g)) and None +\end{verbatim} + +\aosafigref{500l.contingent.graph} ignored one of the most important +relationships that we discovered in the opening section of our chapter: +the way that document titles appear in the table of contents. Let's fill +in this detail. We will create a node for each title string that needs +to be generated by parsing an input file and then passed to one of our +other routines: + +\begin{verbatim} +>>> g.add_edge('api.rst', 'api-title') +>>> g.add_edge('api-title', 'index.html') +>>> g.add_edge('tutorial.rst', 'tutorial-title') +>>> g.add_edge('tutorial-title', 'index.html') +\end{verbatim} + +The result is a graph (\aosafigref{500l.contingent.graph2}) that could +properly handle rebuilding the table of contents that we discussed in +the opening of this chapter. + +\aosafigure[240pt]{contingent-images/figure2.png}{Being prepared to rebuild `index.html` whenever any title that it mentions gets changed.}{500l.contingent.graph2} + +This manual walk-through illustrates what we will eventually have +Contingent do for us: the graph \texttt{g} captures the inputs and +consequences for the various artifacts in our project's documentation. + +\aosasecti{Learning Connections}\label{learning-connections} + +We now have a way for Contingent to keep track of tasks and the +relationships between them. If we look more closely at Figure 2, +however, we see that it is actually a little hand wavy and vague: +\emph{how} is \texttt{api.html} produced from \texttt{api.rst}? How do +we know that \texttt{index.html} needs the title from the tutorial? And +how is this dependency resolved? + +Our intuitive notion of these ideas served when we were constructing +consequences graphs by hand, but unfortunately computers are not +terribly intuitive, so we will need to be more precise about what we +want. + +What are the steps required to produce output from sources? How are +these steps defined and executed? And how can Contingent know the +connections between them? + +In Contingent, build tasks are modeled as functions plus arguments. The +functions define actions that a particular project understands how to +perform. The arguments provide the specifics: \emph{which} source +document should be read, \emph{which} blog title is needed. As they are +running, these functions may in turn invoke \emph{other} task functions, +passing whatever arguments they need answers for. + +To see how this works, we will actually now implement the documentation +builder described at the beginning of the chapter. In order to prevent +ourselves from wallowing around in a bog of details, for this +illustration we will work with simplified input and output document +formats. Our input documents will consist of a title on the first line, +with the remainder of the text forming the body. Cross references will +simply be source file names enclosed in back ticks, which on output are +replaced with the title from the corresponding document in the output. + +Here is the content of our example \texttt{index.txt}, \texttt{api.txt}, +and \texttt{tutorial.txt}, illustrating titles, document bodies, and +cross-references from our little document format: + +\begin{verbatim} +>>> index = """ +... Table of Contents +... ----------------- +... * `tutorial.txt` +... * `api.txt` +... """ + +>>> tutorial = """ +... Beginners Tutorial +... ------------------ +... Welcome to the tutorial! +... We hope you enjoy it. +... """ + +>>> api = """ +... API Reference +... ------------- +... You might want to read +... the `tutorial.txt` first. +... """ +\end{verbatim} + +Now that we have some source material to work with, what functions would +a Contingent-based blog builder need? + +In the simplistic examples above, the HTML output files proceed directly +from the source, but in a realistic system, turning source into markup +involves several steps: reading the raw text from disk, parsing the text +to a convenient internal representation, processing any directives the +author may have specified, resolving cross-references or other external +dependencies (such as include files), and applying one or more view +transformations to convert the internal representation to its output +form. + +Contingent manages tasks by grouping them into a \texttt{Project}, a +sort of build system busybody that injects itself into the middle of the +build process, noting every time one task talks to another to construct +a graph of the relationships between all the tasks. + +\begin{verbatim} +>>> from contingent.projectlib import Project, Task +>>> project = Project() +>>> task = project.task +\end{verbatim} + +A build system for the example given at the beginning of the chapter +might involve a few basic tasks. + +Our \texttt{read()} task will pretend to read the files from disk. Since +we really defined the source text in variables, all it needs to do is +convert from a filename to the corresponding text. + +\begin{verbatim} + >>> filesystem = {'index.txt': index, + ... 'tutorial.txt': tutorial, + ... 'api.txt': api} + ... + >>> @task + ... def read(filename): + ... return filesystem[filename] +\end{verbatim} + +The \texttt{parse()} task interprets the raw text of the file contents +according to the specification of our document format. Our format is +very simple: the title of the document appears on the first line, and +the rest of the content is considered the document's body. + +\begin{verbatim} + >>> @task + ... def parse(filename): + ... lines = read(filename).strip().splitlines() + ... title = lines[0] + ... body = '\n'.join(lines[2:]) + ... return title, body +\end{verbatim} + +Because the format is so simple, the parser is a little silly, +admittedly, but it illustrates the interpretive responsibilities that +parsers are required to carry out. Parsing in general is a very +interesting subject and many books have been written either partially or +completely dedicated to it. In a system like Sphinx, the parser must +understand the many markup tokens, directives, and commands defined by +the system, transforming the input text into something the rest of the +system can work with. + +Notice the connection point between \texttt{parse()} and \texttt{read()} +--- the first task in parsing is to pass the filename it has been given +to \texttt{read()}, which finds and returns the contents of that file. + +The \texttt{title\_of()} task, given a source file name, returns the +document's title: + +\begin{verbatim} + >>> @task + ... def title_of(filename): + ... title, body = parse(filename) + ... return title +\end{verbatim} + +This task nicely illustrates the separation of responsibilities between +the parts of a document processing system. The \texttt{title\_of()} +function works directly from an in-memory representation of a document +--- in this case, a tuple --- instead of taking it upon itself to +re-parse the entire document again just to find the title. The +\texttt{parse()} function alone produces the in-memory representation, +in accordance with the contract of the system specification, and the +rest of the blog builder processing functions like \texttt{title\_of()} +simply use its output as their authority. + +If you are coming from an orthodox object-oriented tradition, this +function-oriented design may look a little weird. In an OO solution, +\texttt{parse()} would return some sort of \texttt{Document} object that +has \texttt{title\_of()} as a method or property. In fact, Sphinx works +exactly this way: its \texttt{Parser} subsystem produces a ``Docutils +document tree'' object for the other parts of the system to use. + +Contingent is not opinionated with regard to these differing design +paradigms and supports either approach equally well. For this chapter we +are keeping things simple. + +The final task, \texttt{render()}, turns the in-memory representation of +a document into an output form. It is, in effect, the inverse of +\texttt{parse()}. Whereas \texttt{parse()} takes an input document +conforming to a specification and converts it to an in-memory +representation, \texttt{render()} takes an in-memory representation and +produces an output document conforming to some specification. + +\begin{verbatim} + >>> import re + >>> + >>> LINK = '{}' + >>> PAGE = '

{}

\n

\n{}\n

' + >>> + >>> def make_link(match): + ... filename = match.group(1) + ... return LINK.format(filename, title_of(filename)) + ... + >>> @task + ... def render(filename): + ... title, body = parse(filename) + ... body = re.sub(r'`([^`]+)`', make_link, body) + ... return PAGE.format(title, body) +\end{verbatim} + +Here is an example run that will invoke every stage of the above logic +--- rendering \texttt{tutorial.txt} to produce its output: + +\begin{verbatim} +>>> print(render('tutorial.txt')) +

Beginners Tutorial

+

+Welcome to the tutorial! +We hope you enjoy it. +

+\end{verbatim} + +\aosafigref{500l.contingent.graph3} illustrates the task graph that +transitively connects all the tasks required to produce the output, from +reading the input file, parsing and transforming the document, and +rendering the result: + +\aosafigure[240pt]{contingent-images/figure3.png}{A task graph.}{500l.contingent.graph3} + +It turns out that \aosafigref{500l.contingent.graph3} was not hand-drawn +for this chapter, but has been generated directly from Contingent! +Building this graph is possible for the \texttt{Project} object because +it maintains its own call stack, similar to the stack of live execution +frames that Python maintains to remember which function to continue +running when the current one returns. + +Every time that a new task is invoked, Contingent can assume that it has +been called --- and that its output will be used --- by the task +currently at the top of the stack. Maintaining the stack will require +that several extra steps surround the invocation of a task~~\emph{T}: + +\begin{aosaenumerate} +\def\labelenumi{\arabic{enumi}.} + +\item + Push \emph{T} onto the stack. +\item + Execute \emph{T}, letting it call any other tasks it needs. +\item + Pop \emph{T} off the stack. +\item + Return its result. +\end{aosaenumerate} + +To intercept task calls, the \texttt{Project} leverages a key Python +feature: \emph{function decorators}. A~decorator is allowed to process +or transform a function at the moment that it is being defined. The +\texttt{Project.task} decorator uses this opportunity to package every +task inside another function, a \emph{wrapper}, which allows a clean +separation of responsibilities between the wrapper --- which will worry +about graph and stack management on behalf of the Project --- and our +task functions that focus on document processing. Here is what the +\texttt{task} decorator boilerplate looks like: + +\begin{verbatim} + from functools import wraps + + def task(function): + @wraps(function) + def wrapper(*args): + # wrapper body, that will call function() + return wrapper +\end{verbatim} + +This is an entirely typical Python decorator declaration. It can then be +applied to a function by naming it after a \texttt{@} character atop the +\texttt{def} that creates the function: + +\begin{verbatim} + @task + def title_of(filename): + title, body = parse(filename) + return title +\end{verbatim} + +When this definition is complete, the name \texttt{title\_of} will refer +to the wrapped version of the function. The wrapper can access the +original version of the function via the name \texttt{function}, calling +it at the appropriate time. The body of the Contingent wrapper runs +something like this: + +\begin{verbatim} + def task(function): + @wraps(function) + def wrapper(*args): + task = Task(wrapper, args) + + if self.task_stack: + self._graph.add_edge(task, self.task_stack[-1]) + + self._graph.clear_inputs_of(task) + self._task_stack.append(task) + try: + value = function(*args) + finally: + self._task_stack.pop() + + return value + return wrapper +\end{verbatim} + +This wrapper performs several crucial maintenance steps: + +\begin{aosaenumerate} +\def\labelenumi{\arabic{enumi}.} +\item + Packages the task --- a function plus its arguments --- into a small + object for convenience. The \texttt{wrapper} here names the wrapped + version of the task function. +\item + If this task has been invoked by a current task that is already + underway, add an edge capturing the fact that this task is an input to + the already-running task. +\item + Forget whatever we might have learned last time about the task, since + it might make new decisions this time --- if the source text of the + API guide no longer mentions the Tutorial, for example, then its + \texttt{render()} will no longer ask for the \texttt{title\_of()} the + Tutorial document. +\item + Push this task onto the top of the task stack in case it decides, in + its turn, to invoke further tasks in the course of doing its work. +\item + Invoke the task inside of a \texttt{try...finally} block that ensures + we correctly remove the finished task from the stack even if it dies + by raising an exception. +\item + Return the task's return value, so that callers of this wrapper will + not be able to tell that they have not simply invoked the plain task + function itself. +\end{aosaenumerate} + +Steps 4 and 5 maintain the task stack itself, which is then used by step +2 to perform the consequences tracking that is our whole reason for +building a task stack in the first place. + +Since each task gets surrounded by its own copy of the wrapper function, +the mere invocation and execution of the normal stack of tasks will +produce a graph of relationships as an invisible side effect. That is +why we were careful to use the wrapper around every one of the +processing steps we defined: + +\begin{verbatim} + @task + def read(filename): + # body of read + + @task + def parse(filename): + # body of parse + + @task + def title_of(filename): + # body of title_of + + @task + def render(filename): + # body of render +\end{verbatim} + +Thanks to these wrappers, when we called \texttt{parse('tutorial.txt')} +the decorator learned the connection between \texttt{parse} and +\texttt{read}. We can ask about the relationship by building another +\texttt{Task} tuple and asking what the consequences would be if its +output value changed: + +\begin{verbatim} +>>> task = Task(read, ('tutorial.txt',)) +>>> print(task) +read('tutorial.txt') +>>> project._graph.immediate_consequences_of(task) +[parse('tutorial.txt')] +\end{verbatim} + +The consequence of re-reading the \texttt{tutorial.txt} file and finding +its contents have changed is that we need to re-execute the +\texttt{parse()} routine for that document. What happens if we render +the entire set of documents? Will Contingent be able to learn the entire +build process with its interrelationships? + +\begin{verbatim} +>>> for filename in 'index.txt', 'tutorial.txt', 'api.txt': +... print(render(filename)) +... print('=' * 30) +... +

Table of Contents

+

+* Beginners Tutorial +* API Reference +

+============================== +

Beginners Tutorial

+

+Welcome to the tutorial! +We hope you enjoy it. +

+============================== +

API Reference

+

+You might want to read +the Beginners Tutorial first. +

+============================== +\end{verbatim} + +It worked! From the output, we can see that our transform substituted +the docuent titles for the directives in our source docuents, indicating +that Contingent was able to discover the connections between the various +tasks needed to build our documents. + +\aosafigure[240pt]{contingent-images/figure4.png}{The complete set of relationships + between our input files and our HTML outputs.}{500l.contingent.graph4} + +By watching one task invoke another through the \texttt{task} wrapper +machinery, \texttt{Project} has automatically learned the graph of +inputs and consequences. Since it has a complete consequences graph at +its disposal, Contingent knows all the things to rebuild if the inputs +to any tasks change. + +\aosasecti{Chasing Consequences}\label{chasing-consequences} + +Once the initial build has run to completion, Contingent needs to +monitor the input files for changes. When the user finishes a new edit +and runs ``Save,'' both the \texttt{read()} method and its consequences +need to be invoked. + +This will require us to walk the graph in the opposite order from the +one in which it was created. It was built, you will recall, by calling +\texttt{render()} for the API Reference and having that call +\texttt{parse()} which finally invoked the \texttt{read()} task. Now we +go in the other direction: we know that \texttt{read()} will now return +new content, and we need to figure out what consequences lie downstream. + +The process of compiling consequences is a recursive one, as each +consequence can itself have further tasks that depended on it. We could +perform this recursion manually through repeated calls to the graph +(note that we are here taking advantage of the fact that the Python +prompt saves the last value displayed under the name \texttt{\_} for use +in the subsequent expression): + +\begin{verbatim} +>>> task = Task(read, ('api.txt',)) +>>> project._graph.immediate_consequences_of(task) +[parse('api.txt')] +>>> t1, = _ +>>> project._graph.immediate_consequences_of(t1) +[render('api.txt'), title_of('api.txt')] +>>> t2, t3 = _ +>>> project._graph.immediate_consequences_of(t2) +[] +>>> project._graph.immediate_consequences_of(t3) +[render('index.txt')] +>>> t4, = _ +>>> project._graph.immediate_consequences_of(t4) +[] +\end{verbatim} + +This recursive task of looking repeatedly for immediate consequences and +only stopping when we arrive at tasks with no further consequences is a +basic enough graph operation that it is supported directly by a method +on the \texttt{Graph} class: + +\begin{verbatim} +>>> # Secretly adjust pprint to a narrower-than-usual width: +>>> _pprint = pprint +>>> pprint = lambda x: _pprint(x, width=40) +>>> pprint(project._graph.recursive_consequences_of([task])) +[parse('api.txt'), + render('api.txt'), + title_of('api.txt'), + render('index.txt')] +\end{verbatim} + +In fact, \texttt{recursive\_consequences\_of()} tries to be a bit +clever. If a particular task appears repeatedly as a downstream +consequence of several other tasks, then it is careful to only mention +it once in the output list, and to move it close to the end so that it +appears only after the tasks that are its inputs. This intelligence is +powered by the classic depth-first implementation of a topological sort, +an algorithm which winds up being fairly easy to write in Python through +a hidden a recursive helper function. Check out the \texttt{graphlib.py} +source code for the details. + +If upon detecting a change we are careful to re-run every task in the +recursive consequences, then Contingent will be able to avoid rebuilding +too little. Our second challenge, however, was to avoid rebuilding too +much. Refer again to Figure~4. We want to avoid rebuilding all three +documents every time that \texttt{tutorial.txt} is changed, since most +edits will probably not affect its title but only its body. How can this +be accomplished? + +The solution is to make graph recomputation dependent on caching. When +stepping forward through the recursive consequences of a change, we will +only invoke tasks whose inputs are different than last time. + +This optimization will involve a final data structure. We will give the +\texttt{Project} a \texttt{\_todo} set with which to remember every task +for which at least one input value has changed, and that therefore +requires re-execution. Because only tasks in \texttt{\_todo} are +out-of-date, the build process can skip running any other tasks unless +they appear there. + +Again, Python's convenient and unified design makes these features very +easy to code. Because task objects are hashable, \texttt{\_todo} can +simply be a set that remembers task items by identity --- guaranteeing +that a task never appears twice --- and the \texttt{\_cache} of return +values from previous runs can be a dict with tasks as keys. + +More precisely, the rebuild step must keep looping as long as +\texttt{\_todo} is non-empty. During each loop, it should: + +\begin{aosaitemize} +\item + Call \texttt{recursive\_consequences\_of()} and pass in every task + listed in \texttt{\_todo}. The return value will be a list of not only + the \texttt{\_todo} tasks themselves, but also every task downstream + of them --- every task, in other words, that could possibly need + re-execution if the outputs come out different this time. +\item + For each task in the list, check whether it is listed in + \texttt{\_todo}. If not, then we can skip running it, because none of + the tasks that we have re-invoked upstream of it has produced a new + return value that would require the task's recomputation. +\item + But for any task that is indeed listed in \texttt{\_todo} by the time + we reach it, we need to ask it to re-run and re-compute its return + value. If the task wrapper function detects that this return value + does not match the old cached value, then its downstream tasks will be + automatically added to \texttt{\_todo} before we reach them in the + list of recursive consequences. +\end{aosaitemize} + +By the time we reach the end of the list, every task that could possibly +need to be re-run should in fact have been re-run. But just in case, we +will check \texttt{\_todo} and try again if it is not yet empty. Even +for very rapidly changing dependency trees, this should quickly settle +out. Only a cycle --- where, for example, task \emph{A} needs the output +of task \emph{B} which itself needs the output of task \emph{A} --- +could keep the builder in an infinite loop, and only if their return +values never stabilize. Fortunately, real-world build tasks are +typically without cycles. + +Let us trace the behavior of this system through an example. + +Suppose you edit \texttt{tutorial.txt} and change both the title and the +body content. We can simulate this by modifying the value in our +\texttt{filesystem} dict: + +\begin{verbatim} +>>> filesystem['tutorial.txt'] = """ +... The Coder Tutorial +... ------------------ +... This is a new and improved +... introductory paragraph. +... """ +\end{verbatim} + +Now that the contents have changed, we can ask the Project to re-run the +\texttt{read()} task by using its \texttt{cache\_off()} context manager +that temporarily disables its willingness to return its old cached +result for a given task and argument: + +\begin{verbatim} +>>> with project.cache_off(): +... text = read('tutorial.txt') +\end{verbatim} + +The new tutorial text has now been read into the cache. How many +downstream tasks will need to be re-executed? + +To help us answer this question, the \texttt{Project} class supports a +simple tracing facility that will tell us which tasks are executed in +the course of a rebuild. Since the above change to \texttt{tutorial.txt} +affects both its body and its title, everything downstream will need to +be re-computed: + +\begin{verbatim} +>>> project.start_tracing() +>>> project.rebuild() +>>> print(project.stop_tracing()) +calling parse('tutorial.txt') +calling render('tutorial.txt') +calling title_of('tutorial.txt') +calling render('api.txt') +calling render('index.txt') +\end{verbatim} + +Looking back at \aosafigref{500l.contingent.graph4}, you can see that, +as expected, this is every task that is an immediate or downstream +consequence of \texttt{read('tutorial.txt')}. + +But what if we edit it again, but this time leave the title the same? + +\begin{verbatim} +>>> filesystem['tutorial.txt'] = """ +... The Coder Tutorial +... ------------------ +... Welcome to the coder tutorial! +... It should be read top to bottom. +... """ +>>> with project.cache_off(): +... text = read('tutorial.txt') +\end{verbatim} + +This small, limited change should have no effect on the other documents. + +\begin{verbatim} +>>> project.start_tracing() +>>> project.rebuild() +>>> print(project.stop_tracing()) +calling parse('tutorial.txt') +calling render('tutorial.txt') +calling title_of('tutorial.txt') +\end{verbatim} + +Success! Only one document got rebuilt. The fact that +\texttt{title\_of()}, given a new input document, nevertheless returned +the same value means that all further downstream tasks were insulated +from the change and did not get re-invoked. + +\aosasecti{Conclusion}\label{conclusion} + +There exist languages and programming methodologies under which +Contingent would be a suffocating forest of tiny classes giving useless +and verbose names to every concept in the problem domain. + +When programming Contingent in Python, however, we skipped the creation +of a dozen classes that could have existed, like \texttt{TaskArgument} +and \texttt{CachedResult} and \texttt{ConsequenceList}. We instead drew +upon Python's strong tradition of solving generic problems with generic +data structures, resulting in code that repeatedly uses a small set of +ideas from the core data structures tuple, list, set, and dict. + +But does this not cause a problem? + +Generic data structures are also, by their nature, anonymous. Our +\texttt{project.\_cache} is a set. So is every collection of upstream +and downstream nodes inside the \texttt{Graph}. Are we in danger of +seeing generic \texttt{set} error messages and not knowing whether to +look in the project or the graph implementation for the error? + +In fact, we are not in danger! + +Thanks to the careful discipline of encapsulation --- of only allowing +\texttt{Graph} code to touch the graph's sets, and \texttt{Project} code +to touch the project's set --- there will never be ambiguity if a set +operation returns an error during a later phase of the project. The name +of the innermost executing method at the moment of the error will +necessarily direct us to exactly the class, and set, involved in the +mistake. There is no need to create a subclass of \texttt{set} for every +possible application of the data type, so long as we put that +conventional underscore in front of data structure attributes and then +are careful not to touch them from code outside of the class. + +Contingent demonstrates how crucial the Facade pattern, from the epochal +\emph{Design Patterns} book, is for a well-designed Python program. Not +every data structure and fragment of data in a Python program gets to be +its own class. Instead, classes are used sparingly, at conceptual pivots +in the code where a big idea --- like the idea of a dependency graph --- +can be wrapped up into a Facade that hides the details of the simple +generic data structures that lie beneath it. + +Code outside of the Facade names the big concepts that it needs and the +operations that it wants to perform. Inside of the Facade, the +programmer manipulates the small and convenient moving parts of the +Python programming language to make the operations happen. + +\end{aosachapter} diff --git a/tex/ocr.tex b/tex/ocr.tex new file mode 100644 index 000000000..818e730ba --- /dev/null +++ b/tex/ocr.tex @@ -0,0 +1,851 @@ +\begin{aosachapter}{Optical Character Recognition (OCR)}{s:ocr}{Marina Samuel} + +\aosasecti{Introduction}\label{introduction} + +What if your computer could wash your dishes, do your laundry, cook you +dinner, and clean your home? I think I can safely say that most people +would be happy to get a helping hand! But what would it take for a +computer to be able to perform these tasks, in exactly the same way that +humans can? + +The famous computer scientist Alan Turing proposed the Turing Test as a +way to identify whether a machine could have intelligence +indistinguishable from that of a human being. The test involves a human +posing questions to two hidden entities, one human, and the other a +machine, and trying to identify which is which. If the interrogator is +unable to identify the machine, then the machine is considered to have +human-level intelligence. + +While there is a lot of controversy surrounding whether the Turing Test +is a valid assessment of intelligence, and whether we can build such +intelligent machines, there is no doubt that machines with some degree +of intelligence already exist. There is currently software that helps +robots navigate an office and perform small tasks, or help those +suffering with Alzheimer's. More common examples of Artificial +Intelligence (A.I.) are the way that Google estimates what you're +looking for when you search for some keywords, or the way that Facebook +decides what to put in your news feed. + +One well known application of A.I. is Optical Character Recognition +(OCR). An OCR system is a piece of software that can take images of +handwritten characters as input and interpret them into machine readable +text. While you may not think twice when depositing a handwritten cheque +into a bank machine that confirms the deposit value, there is some +interesting work going on in the background. This chapter will examine a +working example of a simple OCR system that recognizes numerical digits +using an Artificial Neural Network (ANN). But first, let's establish a +bit more context. + +\aosasecti{What is Artificial +Intelligence?}\label{what-is-artificial-intelligence} + +\label{sec.ocr.ai} While Turing's definition of intelligence sounds +reasonable, at the end of the day what constitutes intelligence is +fundamentally a philosophical debate. Computer scientists have, however, +categorized certain types of systems and algorithms into branches of AI. +Each branch is used to solve certain sets of problems. These branches +include the following examples, as well as +\href{http://www-formal.stanford.edu/jmc/whatisai/node2.html}{many +others}: + +\begin{aosaitemize} + +\item + Logical and probabilistic deduction and inference based on some + predefined knowledge of a world. e.g. + \href{http://www.cs.princeton.edu/courses/archive/fall07/cos436/HIDDEN/Knapp/fuzzy004.htm}{Fuzzy + inference} can help a thermostat decide when to turn on the air + conditioning when it detects that the temperature is hot and the + atmosphere is humid +\item + Heuristic search. e.g.~Searching can be used to find the best possible + next move in a game of chess by searching all possible moves and + choosing the one that most improves your position +\item + Machine learning (ML) with feedback models. e.g.~Pattern-recognition + problems like OCR. +\end{aosaitemize} + +In general, ML involves using large data sets to train a system to +identify patterns. The training data sets may be labelled, meaning the +system's expected outputs are specified for given inputs, or unlabelled +meaning expected outputs are not specified. Algorithms that train +systems with unlabelled data are called \emph{unsupervised} algorithms +and those that train with labelled data are called \emph{supervised}. +Although many ML algorithms and techniques exist for creating OCR +systems, ANNs are one simple approach. + +\aosasecti{Artificial Neural Networks}\label{artificial-neural-networks} + +\aosasectii{What Are ANNs?}\label{what-are-anns} + +\label{sec.ocr.ann} An ANN is a structure consisting of interconnected +nodes that communicate with one another. The structure and its +functionality are inspired by neural networks found in a biological +brain. +\href{http://www.nbb.cornell.edu/neurobio/linster/BioNB420/hebb.pdf}{Hebbian +Theory} explains how these networks can learn to identify patterns by +physically altering their structure and link strengths. Similarly, a +typical ANN (shown in \aosafigref{500l.ocr.ann}) has connections between +nodes that have a weight which is updated as the network learns. The +nodes labelled ``+1'' are called \emph{biases}. The leftmost blue column +of nodes are \emph{input nodes}, the middle column contains \emph{hidden +nodes}, and the rightmost column contains \emph{output nodes}. There may +be many columns of hidden nodes, known as \emph{hidden layers}. + +\aosafigure[360pt]{ocr-images/ann.png}{An Artificial Neural Network}{500l.ocr.ann} + +The values inside all of the circular nodes in \aosafigref{500l.ocr.ann} +represent the output of the node. If we call the output of the $n$th +node from the top in layer $L$ as a $n(L)$ and the connection between +the $i$th node in layer $L$ and the $j$th node in layer $L+1$ as +$w^{(L)}_ji$, then the output of node $a^{(2)}_2$ is: + +\[ +a^{(2)}_2 = f(w^{(1)}_{21}x_1 + w^{(1)}_{22}x_2 + b^{(1)}_{2}) +\] + +where $f(.)$ is known as the \emph{activation function} and $b$ is the +\emph{bias}. An activation function is the decision-maker for what type +of output a node has. A bias is an additional node with a fixed output +of 1 that may be added to an ANN to improve its accuracy. We'll see more +details on both of these in \aosasecref{sec.ocr.feedforward}. + +This type of network topology is called a feedforward neural network +because there are no cycles in the network. ANNs with nodes whose +outputs feed into their inputs are called recurrent neural networks. +There are many algorithms that can be applied to train feedforward ANNs; +one commonly used algorithm is called \emph{backpropagation}. The OCR +system we will implement in this chapter will use backpropagation. + +\aosasectii{How Do We Use ANNs?}\label{how-do-we-use-anns} + +Like most other ML approaches, the first step for using backpropagation +is to decide how to transform or reduce our problem into one that can be +solved by an ANN. In other words, how can we manipulate our input data +so we can feed it into the ANN? For the case of our OCR system, we can +use the positions of the pixels for a given digit as input. It is worth +noting that, often times, choosing the input data format is not this +simple. If we were analyzing large images to identify shapes in them, +for instance, we may need to pre-process the image to identify contours +within it. These contours would be the input. + +Once we've decided on our input data format, what's next? Since +backpropagation is a supervised algorithm, it will need to be trained +with labelled data, as mentioned in \aosasecref{sec.ocr.ai}. Thus, when +passing the pixel positions as training input, we must also pass the +associated digit. This means that we must find or gather a large data +set of drawn digits and associated values. + +The next step is to partition the data set into a training set and +validation set. The training data is used to run the backpropagation +algorithm to set the weights of the ANN. The validation data is used to +make predictions using the trained network and compute its accuracy. If +we were comparing the performance of backpropagation vs.~another +algorithm on our data, we would +\href{http://www-group.slac.stanford.edu/sluo/Lectures/stat_lecture_files/sluo2006lec7.pdf}{split +the data} into 50\% for training, 25\% for comparing performance of the +2 algorithms (validation set) and the final 25\% for testing accuracy of +the chosen algorithm (test set). Since we're not comparing algorithms, +we can group one of the 25\% sets as part of the training set and use +75\% of the data to train the network and 25\% for validating that it +was trained well. + +The purpose of identifying the accuracy of the ANN is two-fold. First, +it is to avoid the problem of \emph{overfitting}. Overfitting occurs +when the network has a much higher accuracy on predicting the training +set than the validation set. Overfitting tells us that the chosen +training data does not generalize well enough and needs to be refined. +Secondly, testing the accuracy of several different numbers of hidden +layers and hidden nodes helps in designing the most optimal ANN size. An +optimal ANN size will have enough hidden nodes and layers to make +accurate predictions but also as few nodes/connections as possible to +reduce computational overhead that may slow down training and +predictions. Once the optimal size has been decided and the network has +been trained, it's ready to make predictions! + +\aosasecti{Design Decisions in a Simple OCR +System}\label{design-decisions-in-a-simple-ocr-system} + +\label{sec.ocr.decisions} In the last few paragraphs we've gone over +some of the basics of feedforward ANNs and how to use them. Now it's +time to talk about how we can build an OCR system. + +First off, we must decide what we want our system to be able to do. To +keep things simple, let's allow users to draw a single digit and be able +to train the OCR system with that drawn digit or to request that the +system predict what the drawn digit is. While an OCR system could run +locally on a single machine, having a client-server setup gives much +more flexibility. It makes crowd-sourced training of an ANN possible and +allows powerful servers to handle intensive computations. + +Our OCR system will consist of 5 main components, divided into 5 files. +There will be: + +\begin{aosaitemize} + +\item + a client (\texttt{ocr.js}) +\item + a server (\texttt{server.py}) +\item + a simple user interface (\texttt{ocr.html}) +\item + an ANN trained via backpropagation (\texttt{ocr.py}) +\item + an ANN design script (\texttt{neural\_network\_design.py}) +\end{aosaitemize} + +The user interface will be simple: a canvas to draw digits on and +buttons to either train the ANN or request a prediction. The client will +gather the drawn digit, translate it into an array, and pass it to the +server to be processed either as a training sample or as a prediction +request. The server will simply route the training or prediction request +by making API calls to the ANN module. The ANN module will train the +network with an existing data set on its first initialization. It will +then save the ANN weights to a file and re-load them on subsequent +startups. This module is where the core of training and prediction logic +happens. Finally, the design script is for experimenting with different +hidden node counts and deciding what works best. Together, these pieces +give us a very simplistic, but functional OCR system. + +Now that we've thought about how the system will work at a high level, +it's time to put the concepts into code! + +\aosasectii{A Simple Interface +(\texttt{ocr.html})}\label{a-simple-interface-ocr.html} + +As mentioned earlier, the first step is to gather data for training the +network. We could upload a sequence of hand-written digits to the +server, but that would be awkward. Instead, we could have users actually +handwrite the digits on the page using an HTML canvas. We could then +give them a couple of options to either train or test the network, where +training the network also involves specifying what digit was drawn. This +way it is possible to easily outsource the data collection by pointing +people to a website to receive their input. Here's some HTML to get us +started. + +\begin{verbatim} + + + + + + +

+

OCR Demo

+ +
+

Digit:

+ + + +
+
+ + +\end{verbatim} + +\aosasectii{An OCR Client (\texttt{ocr.js})}\label{an-ocr-client-ocr.js} + +Since a single pixel on an HTML canvas might be hard to see, we can +represent a single pixel for the ANN input as a square of 10x10 real +pixels. Thus the real canvas is 200x200 pixels and it is represented by +a 20x20 canvas from the perspective of the ANN. The variables below will +help us keep track of these measurements. + +\begin{verbatim} +var ocrDemo = { + CANVAS_WIDTH: 200, + TRANSLATED_WIDTH: 20, + PIXEL_WIDTH: 10, // TRANSLATED_WIDTH = CANVAS_WIDTH / PIXEL_WIDTH +\end{verbatim} + +We can then outline the pixels in the new representation so they are +easier to see. Here we have a blue grid generated by +\texttt{drawGrid()}. + +\begin{verbatim} + drawGrid: function(ctx) { + for (var x = this.PIXEL_WIDTH, y = this.PIXEL_WIDTH; + x < this.CANVAS_WIDTH; x += this.PIXEL_WIDTH, + y += this.PIXEL_WIDTH) { + ctx.strokeStyle = this.BLUE; + ctx.beginPath(); + ctx.moveTo(x, 0); + ctx.lineTo(x, this.CANVAS_WIDTH); + ctx.stroke(); + + ctx.beginPath(); + ctx.moveTo(0, y); + ctx.lineTo(this.CANVAS_WIDTH, y); + ctx.stroke(); + } + }, +\end{verbatim} + +We also need to store the data drawn on the grid in a form that can be +sent to the server. For simplicity, we can have an array called +\texttt{data} which labels an uncoloured, black pixel as \texttt{0} and +a coloured white pixel as \texttt{1}. We also need some mouse listeners +on the canvas so we know when to call \texttt{fillSquare()} to colour a +pixel white while a user is drawing a digit. These listeners should keep +track of whether we are in a drawing state and then call +\texttt{fillSquare()} to do some simple math and decide which pixels +need to be filled in. + +\begin{verbatim} + onMouseMove: function(e, ctx, canvas) { + if (!canvas.isDrawing) { + return; + } + this.fillSquare(ctx, + e.clientX - canvas.offsetLeft, e.clientY - canvas.offsetTop); + }, + + onMouseDown: function(e, ctx, canvas) { + canvas.isDrawing = true; + this.fillSquare(ctx, + e.clientX - canvas.offsetLeft, e.clientY - canvas.offsetTop); + }, + + onMouseUp: function(e) { + canvas.isDrawing = false; + }, + + fillSquare: function(ctx, x, y) { + var xPixel = Math.floor(x / this.PIXEL_WIDTH); + var yPixel = Math.floor(y / this.PIXEL_WIDTH); + this.data[((xPixel - 1) * this.TRANSLATED_WIDTH + yPixel) - 1] = 1; + + ctx.fillStyle = '#ffffff'; + ctx.fillRect(xPixel * this.PIXEL_WIDTH, yPixel * this.PIXEL_WIDTH, + this.PIXEL_WIDTH, this.PIXEL_WIDTH); + }, +\end{verbatim} + +Now we're getting closer to the juicy stuff! We need a function that +prepares training data to be sent to the server. Here we have a +relatively straight forward \texttt{train()} function that does some +error checking on the data to be sent, adds it to \texttt{trainArray} +and sends it off by calling \texttt{sendData()}. An interesting design +worth noting here is the use of \texttt{trainingRequestCount}, +\texttt{trainArray}, and \texttt{BATCH\_SIZE}. + +\begin{verbatim} + train: function() { + var digitVal = document.getElementById("digit").value; + if (!digitVal || this.data.indexOf(1) < 0) { + alert("Please type and draw a digit value in order to train the network"); + return; + } + this.trainArray.push({"y0": this.data, "label": parseInt(digitVal)}); + this.trainingRequestCount++; + + // Time to send a training batch to the server. + if (this.trainingRequestCount == this.BATCH_SIZE) { + alert("Sending training data to server..."); + var json = { + trainArray: this.trainArray, + train: true + }; + + this.sendData(json); + this.trainingRequestCount = 0; + this.trainArray = []; + } + }, +\end{verbatim} + +What's happening here is that \texttt{BATCH\_SIZE} is some pre-defined +constant for how much training data a client will keep track of before +it sends a batched request to the server to be processed by the OCR. The +main reason to batch requests is to avoid overwhelming the server with +many requests at once. If many clients exist (e.g.~many users are on the +\texttt{ocr.html} page training the system), or if another layer existed +in the client that takes scanned drawn digits and translated them to +pixels to train the network, a \texttt{BATCH\_SIZE} of 1 would result in +many, unnecessary requests. This approach is good because it gives more +flexibility to the client, however, in practice, batching should also +take place on the server, when needed. A denial of service (DoS) attack +could occur in which a malicious client purposely sends many requests to +the server to overwhelm it so that it breaks down. + +We will also need a \texttt{test()} function. Similar to +\texttt{train()}, it should do a simple check on the validity of the +data and send it off. For \texttt{test()}, however, no batching occurs +since users should be able to request a prediction and get immediate +results. + +\begin{verbatim} + test: function() { + if (this.data.indexOf(1) < 0) { + alert("Please draw a digit in order to test the network"); + return; + } + var json = { + image: this.data, + predict: true + }; + this.sendData(json); + }, +\end{verbatim} + +Finally, we will need some functions to make an HTTP POST request, +receive a response, and handle any potential errors along the way. + +\begin{verbatim} + receiveResponse: function(xmlHttp) { + if (xmlHttp.status != 200) { + alert("Server returned status " + xmlHttp.status); + return; + } + var responseJSON = JSON.parse(xmlHttp.responseText); + if (xmlHttp.responseText && responseJSON.type == "test") { + alert("The neural network predicts you wrote a \'" + + responseJSON.result + '\''); + } + }, + + onError: function(e) { + alert("Error occurred while connecting to server: " + e.target.statusText); + }, + + sendData: function(json) { + var xmlHttp = new XMLHttpRequest(); + xmlHttp.open('POST', this.HOST + ":" + this.PORT, false); + xmlHttp.onload = function() { this.receiveResponse(xmlHttp); }.bind(this); + xmlHttp.onerror = function() { this.onError(xmlHttp) }.bind(this); + var msg = JSON.stringify(json); + xmlHttp.setRequestHeader('Content-length', msg.length); + xmlHttp.setRequestHeader("Connection", "close"); + xmlHttp.send(msg); + } +\end{verbatim} + +\aosasectii{A Server (\texttt{server.py})}\label{a-server-server.py} + +Despite being a small server that simply relays information, we still +need to consider how to receive and handle the HTTP requests. First we +need to decide what kind of HTTP request to use. In the last section, +the client is using POST, but why did we decide on this? Since data is +being sent to the server, a PUT or POST request makes the most sense. We +only need to send a json body and no URL parameters. So in theory, a GET +request could have worked as well but would not make sense semantically. +The choice between PUT and POST, however, is a long, on-going debate +among programmers; KNPLabs summarizes the issues +\href{https://knpuniversity.com/screencast/rest/put-versus-post}{with +humour}. + +Another consideration is whether to send the ``train'' vs. ``predict'' +requests to different endpoints (e.g. \texttt{http://localhost/train} +and \texttt{http://localhost/predict}) or the same endpoint which then +processes the data separately. In this case, we can go with the latter +approach since the difference between what is done with the data in each +case is minor enough to fit into a short if statement. In practice, it +would be better to have these as separate endpoints if the server were +to do any more detailed processing for each request type. This decision, +in turn impacted what server error codes were used when. For example, a +400 ``Bad Request'' error is sent when neither ``train'' or ``predict'' +is specified in the payload. If separate endpoints were used instead, +this would not be an issue. The processing done in the background by the +OCR system may fail for any reason and if it's not handled correctly +within the server, a 500 ``Internal Server Error'' is sent. Again, if +the endpoints were separated, there would have been more room to go into +detail to send more appropriate errors. For example, identifying that an +internal server error was actually caused by a bad request. + +Finally, we need to decide when and where to initialize the OCR system. +A good approach would be to initialize it within \texttt{server.py} but +before the server is started. This is because on first run, the OCR +system needs to train the network on some pre-existing data the first +time it starts and this may take a few minutes. If the server started +before this processing was complete, any requests to train or predict +would throw an exception since the OCR object would not yet have been +initialized, given the current implementation. Another possible +implementation could create some inaccurate initial ANN to be used for +the first few queries while the new ANN is asynchronously trained in the +background. This alternative approach does allow the ANN to be used +immediately, but the implementation is more complex and it would only +save on time on server startup if the servers are reset. This type of +implementation would be more beneficial for an OCR service that requires +high availability. + +Here we have the majority of our server code in one short function that +handles POST requests. + +\begin{verbatim} + def do_POST(s): + response_code = 200 + response = "" + var_len = int(s.headers.get('Content-Length')) + content = s.rfile.read(var_len); + payload = json.loads(content); + + if payload.get('train'): + nn.train(payload['trainArray']) + nn.save() + elif payload.get('predict'): + try: + response = { + "type":"test", + "result":nn.predict(str(payload['image'])) + } + except: + response_code = 500 + else: + response_code = 400 + + s.send_response(response_code) + s.send_header("Content-type", "application/json") + s.send_header("Access-Control-Allow-Origin", "*") + s.end_headers() + if response: + s.wfile.write(json.dumps(response)) + return +\end{verbatim} + +\aosasectii{Designing a Feedforward ANN +(\texttt{neural\_network\_design.py})}\label{designing-a-feedforward-ann-neuralux5fnetworkux5fdesign.py} + +\label{sec.ocr.feedforward} When designing a feedforward ANN, there are +a few factors we must consider. The first is what activation function to +use. We mentioned activation functions earlier as the decision-maker for +a node's output. The type of the decision an activation function makes +will help us decide which one to use. In our case, we will be designing +an ANN that outputs a value between 0 and 1 for each digit (0-9). Values +closer to 1 would mean the ANN predicts this is the drawn digit and +values closer to 0 would mean it's predicted to not be the drawn digit. +Thus, we want an activation function that would have outputs either +close to 0 or close to 1. We also need a function that is differentiable +because we will need the derivative for our backpropagation computation. +A commonly used function in this case is the sigmoid because it +satisfies both these constraints. StatSoft provides a +\href{http://www.fmi.uni-sofia.bg/fmi/statist/education/textbook/eng/glosa.html}{nice +list} of common activation functions and their properties. + +A second factor to consider is whether we want to include biases. We've +mentioned biases a couple of times before but haven't really talked +about what they are or why we use them. Let's try to understand this by +going back to how the output of a node is computed in +\aosafigref{500l.ocr.ann}. Suppose we had a single input node and a +single output node, our output formula would be $y = f(wx)$, where $y$ +is the output, $f()$ is the activation function, $w$ is the weight for +the link between the nodes, and $x$ is the variable input for the node. +The bias is essentially a node whose output is always $1$. This would +change the output formula to $y = f(wx + b)$ where $b$ is the weight of +the connection between the bias node and the next node. If we consider +$w$ and $b$ as constants and $x$ as a variable, then adding a bias adds +a constant to our linear function input to $f(.)$. + +Adding the bias therefore allows for a shift in the $y$-intercept and in +general gives more flexibility for the output of a node. It's often good +practice to include biases, especially for ANNs with a small number of +inputs and outputs. Biases allow for more flexibility in the output of +the ANN and thus provide the ANN with more room for accuracy. Without +biases, we're less likely to make correct predictions with our ANN or +would need more hidden nodes to make more accurate predictions. + +Other factors to consider are the number of hidden layers and the number +of hidden nodes per layer. For larger ANNs with many inputs and outputs, +these numbers are decided by trying different values and testing the +network's performance. In this case, the performance is measured by +training an ANN of a given size and seeing what percentage of the +validation set is classified correctly. In most cases, a single hidden +layer is sufficient for decent performance, so we only experiment with +the number of hidden nodes here. + +\begin{verbatim} +# Try various number of hidden nodes and see what performs best +for i in xrange(5, 50, 5): + nn = OCRNeuralNetwork(i, data_matrix, data_labels, train_indices, False) + performance = str(test(data_matrix, data_labels, test_indices, nn)) + print "{i} Hidden Nodes: {val}".format(i=i, val=performance) +\end{verbatim} + +Here we initialize an ANN with between 5 to 50 hidden nodes in +increments of 5. We then call the \texttt{test()} function. + +\begin{verbatim} +def test(data_matrix, data_labels, test_indices, nn): + avg_sum = 0 + for j in xrange(100): + correct_guess_count = 0 + for i in test_indices: + test = data_matrix[i] + prediction = nn.predict(test) + if data_labels[i] == prediction: + correct_guess_count += 1 + + avg_sum += (correct_guess_count / float(len(test_indices))) + return avg_sum / 100 +\end{verbatim} + +The inner loop is counting the number of correct classifications which +are then divided by the number of attempted classifications at the end. +This gives a ratio or percentage accuracy for the ANN. Since each time +an ANN is trained, its weights may be slightly different, we repeat this +process 100 times in the outer loop so we can take an average of this +particular ANN configuration's accuracy. In our case, a sample run of +\texttt{neural\_network\_design.py} looks like the following: + +\begin{verbatim} +PERFORMANCE +----------- +5 Hidden Nodes: 0.7792 +10 Hidden Nodes: 0.8704 +15 Hidden Nodes: 0.8808 +20 Hidden Nodes: 0.8864 +25 Hidden Nodes: 0.8808 +30 Hidden Nodes: 0.888 +35 Hidden Nodes: 0.8904 +40 Hidden Nodes: 0.8896 +45 Hidden Nodes: 0.8928 +\end{verbatim} + +From this output we can conclude that 15 hidden nodes would be most +optimal. Adding 5 nodes from 10 to 15 gets us \textasciitilde{}1\% more +accuracy, whereas improving the accuracy by another 1\% would require +adding another 20 nodes. Increasing the hidden node count also increases +computational overhead. So it would take networks with more hidden nodes +longer to be trained and to make predictions. Thus we choose to use the +last hidden node count that resulted in a dramatic increase in accuracy. +Of course, it's possible when designing an ANN that computational +overhead is no problem and it's top priority to have the most accurate +ANN possible. In that case it would be better to choose 45 hidden nodes +instead of 15. + +\aosasectii{Core OCR Functionality}\label{core-ocr-functionality} + +In this section we'll talk about how the actual training occurs via +backpropagation, how we can use the network to make predictions, and +other key design decisions for core functionality. + +\aosasectiii{Training via Backpropagation +(\texttt{ocr.py})}\label{training-via-backpropagation-ocr.py} + +The backpropagation algorithm, briefly mentioned earlier, is used to +train our ANN. It consists of 4 main steps that are repeated for every +sample in the training set, updating the ANN weights each time. + +First, we initialize the weights to small (between -1 and 1) random +values. In our case, we initialize them to values between -0.06 and 0.06 +and store them in matrices \texttt{theta1}, \texttt{theta2}, +\texttt{input\_layer\_bias}, and \texttt{hidden\_layer\_bias}. Since +every node in a layer links to every node in the next layer we can +create a matrix that has m rows and n columns where n is the number of +nodes in one layer and m is the number of nodes in the adjacent layer. +This matrix would represent all the weights for the links between these +two layers. Here theta1 has 400 columns for our 20x20 pixel inputs and +\texttt{num\_hidden\_nodes} rows. Likewise, \texttt{theta2} represents +the links between the hidden layer and output layer. It has +\texttt{num\_hidden\_nodes} columns and \texttt{NUM\_DIGITS} +(\texttt{10}) rows. The other two vectors (1 row), +\texttt{input\_layer\_bias} and \texttt{hidden\_layer\_bias} represent +the biases. + +\begin{verbatim} + def _rand_initialize_weights(self, size_in, size_out): + return [((x * 0.12) - 0.06) for x in np.random.rand(size_out, size_in)] +\end{verbatim} + +\begin{verbatim} + self.theta1 = self._rand_initialize_weights(400, num_hidden_nodes) + self.theta2 = self._rand_initialize_weights(num_hidden_nodes, 10) + self.input_layer_bias = self._rand_initialize_weights(1, + num_hidden_nodes) + self.hidden_layer_bias = self._rand_initialize_weights(1, 10) +\end{verbatim} + +The second step is \emph{forward propagation}, which is essentially +computing the node outputs as described in \aosasecref{sec.ocr.ann}, +layer by layer starting from the input nodes. Here, \texttt{y0} is an +array of size 400 with the inputs we wish to use to train the ANN. We +multiply \texttt{theta1} by \texttt{y0} transposed so that we have two +matrices with sizes \texttt{(num\_hidden\_nodes x 400) * (400 x 1)} and +have a resulting vector of outputs for the hidden layer of size +num\_hidden\_nodes. We then add the bias vector and apply the vectorized +sigmoid activation function to this output vector, giving us +\texttt{y1}. \texttt{y1} is the output vector of our hidden layer. The +same process is repeated again to compute \texttt{y2} for the output +nodes. \texttt{y2} is now our output layer vector with values +representing the likelihood that their index is the drawn number. For +example if someone draws an 8, the value of \texttt{y2} at the 8th index +will be the largest if the ANN has made the correct prediction. However, +6 may have a higher likelihood than 1 of being the drawn digit since it +looks more similar to 8 and is more likely to use up the same pixels to +be drawn as the 8. \texttt{y2} becomes more accurate with each +additional drawn digit the ANN is trained with. + +\begin{verbatim} + # The sigmoid activation function. Operates on scalars. + def _sigmoid_scalar(self, z): + return 1 / (1 + math.e ** -z) +\end{verbatim} + +\begin{verbatim} + y1 = np.dot(np.mat(self.theta1), np.mat(data['y0']).T) + sum1 = y1 + np.mat(self.input_layer_bias) # Add the bias + y1 = self.sigmoid(sum1) + + y2 = np.dot(np.array(self.theta2), y1) + y2 = np.add(y2, self.hidden_layer_bias) # Add the bias + y2 = self.sigmoid(y2) +\end{verbatim} + +The third step is \emph{back propagation}, which involves computing the +errors at the output nodes then at every intermediate layer back towards +the input. Here we start by creating an expected output vector, +\texttt{actual\_vals}, with a \texttt{1} at the index of the digit that +represents the value of the drawn digit and \texttt{0}s otherwise. The +vector of errors at the output nodes, \texttt{output\_errors}, is +computed by subtracting the actual output vector, \texttt{y2}, from +\texttt{actual\_vals}. For every hidden layer afterwards, we compute two +components. First, we have the next layer's transposed weight matrix +multiplied by its output errors. Then we have the derivative of the +activation function applied to the previous layer. We then perform an +element-wise multiplication on these two components, giving a vector of +errors for a hidden layer. Here we call this \texttt{hidden\_errors}. + +\begin{verbatim} + actual_vals = [0] * 10 + actual_vals[data['label']] = 1 + output_errors = np.mat(actual_vals).T - np.mat(y2) + hidden_errors = np.multiply(np.dot(np.mat(self.theta2).T, output_errors), + self.sigmoid_prime(sum1)) +\end{verbatim} + +Weight updates that adjust the ANN weights based on the errors computed +earlier. Weights are updated at each layer via matrix multiplication. +The error matrix at each layer is multiplied by the output matrix of the +previous layer. This product is then multiplied by a scalar called the +learning rate and added to the weight matrix. The learning rate is a +value between 0 and 1 that influences the speed and accuracy of learning +in the ANN. Larger learning rate values will generate an ANN that learns +quickly but is less accurate, while smaller values will will generate an +ANN that learns slower but is more accurate. In our case, we have a +relatively small value for learning rate, 0.1. This works well since we +do not need the ANN to be immediately trained in order for a user to +continue making train or predict requests. Biases are updated by simply +multiplying the learning rate by the layer's error vector. + +\begin{verbatim} + self.theta1 += self.LEARNING_RATE * np.dot(np.mat(hidden_errors), + np.mat(data['y0'])) + self.theta2 += self.LEARNING_RATE * np.dot(np.mat(output_errors), + np.mat(y1).T) + self.hidden_layer_bias += self.LEARNING_RATE * output_errors + self.input_layer_bias += self.LEARNING_RATE * hidden_errors +\end{verbatim} + +\aosasectiii{Testing a Trained Network +(\texttt{ocr.py})}\label{testing-a-trained-network-ocr.py} + +Once an ANN has been trained via backpropagation, it is fairly +straightforward to use it for making predictions. As we can see here, we +start by computing the output of the ANN, \texttt{y2}, exactly the way +we did in step 2 of backpropagation. Then we look for the index in the +vector with the maximum value. This index is the digit predicted by the +ANN. + +\begin{verbatim} + def predict(self, test): + y1 = np.dot(np.mat(self.theta1), np.mat(test).T) + y1 = y1 + np.mat(self.input_layer_bias) # Add the bias + y1 = self.sigmoid(y1) + + y2 = np.dot(np.array(self.theta2), y1) + y2 = np.add(y2, self.hidden_layer_bias) # Add the bias + y2 = self.sigmoid(y2) + + results = y2.T.tolist()[0] + return results.index(max(results)) +\end{verbatim} + +\aosasectiii{Other Design Decisions +(\texttt{ocr.py})}\label{other-design-decisions-ocr.py} + +Many resources are available online that go into greater detail on the +implementation of backpropagation. One good resource is from a +\href{http://www.willamette.edu/~gorr/classes/cs449/backprop.html}{course +by the University of Williamette}. It goes over the steps of +backpropagation and then explains how it can be translated into matrix +form. While the amount of computation using matrices is the same as +using loops, the benefit is that the code is simpler and easier to read +with fewer nested loops. As we can see, the entire training process is +written in under 25 lines of code using matrix algebra. + +As mentioned in the introduction of \aosasecref{sec.ocr.decisions}, +persisting the weights of the ANN means we do not lose the progress made +in training it when the server is shut down or abruptly goes down for +any reason. We persist the weights by writing them as JSON to a file. On +startup, the OCR loads the ANN's saved weights to memory. The save +function is not called internally by the OCR but is up to the server to +decide when to perform a save. In our case, the server saves the weights +after each update. This is a quick and simple solution but it is not +optimal since writing to disk is time consuming. This also prevents us +from handling multiple concurrent requests since there is no mechanism +to prevent simultaneous writes to the same file. In a more sophisticated +server, saves could perhaps be done on shutdown or once every few +minutes with some form of locking or a timestamp protocol to ensure no +data loss. + +\begin{verbatim} + def save(self): + if not self._use_file: + return + + json_neural_network = { + "theta1":[np_mat.tolist()[0] for np_mat in self.theta1], + "theta2":[np_mat.tolist()[0] for np_mat in self.theta2], + "b1":self.input_layer_bias[0].tolist()[0], + "b2":self.hidden_layer_bias[0].tolist()[0] + }; + with open(OCRNeuralNetwork.NN_FILE_PATH,'w') as nnFile: + json.dump(json_neural_network, nnFile) + + def _load(self): + if not self._use_file: + return + + with open(OCRNeuralNetwork.NN_FILE_PATH) as nnFile: + nn = json.load(nnFile) + self.theta1 = [np.array(li) for li in nn['theta1']] + self.theta2 = [np.array(li) for li in nn['theta2']] + self.input_layer_bias = [np.array(nn['b1'][0])] + self.hidden_layer_bias = [np.array(nn['b2'][0])] +\end{verbatim} + +\aosasecti{Conclusion}\label{conclusion} + +Now that we've learned about AI, ANNs, backpropagation, and building an +end-to-end OCR system, let's recap the highlights of this chapter and +the big picture. + +We started off the chapter by giving background on AI, ANNs, and roughly +what we will be implementing. We discussed what AI is and examples of +how it's used. We saw that AI is essentially a set of algorithms or +problem-solving approaches that can provide an answer to a question in a +similar manner as a human would. We then took a look at the structure of +a Feedforward ANN. We learned that computing the output at a given node +was as simple as summing the products of the outputs of the previous +nodes and their connecting weights. We talked about how to use an ANN by +first formatting the input and partitioning the data into training and +validation sets. + +Once we had some background, we started talking about creating a +web-based, client-server system that would handle user requests to train +or test the OCR. We then discussed how the client would interpret the +drawn pixels into an array and perform an HTTP request to the OCR server +to perform the training or testing. We discussed how our simple server +read requests and how to design an ANN by testing performance of several +hidden node counts. We finished off by going through the core training +and testing code for backpropagation. + +Although we've built a seemingly functional OCR system, this chapter +simply scratches the surface of how a real OCR system might work. More +sophisticated OCR systems could have pre-processed inputs, use hybrid ML +algorithms, have more extensive design phases, or other further +optimizations. + +\end{aosachapter} diff --git a/tex/sampler.tex b/tex/sampler.tex index 001ed15d5..0f70ce69f 100644 --- a/tex/sampler.tex +++ b/tex/sampler.tex @@ -484,12 +484,12 @@ \aosasectiii{Working with Log Values}\label{working-with-log-values} Before getting into the actual code needed to implement the equation -above, I want to emphasize one of the \emph{the most important design -decisions} when writing code with probabilities: working with log -values. What this means is that rather than working directly with -probabilities $p(x)$, we should be working with -\emph{log}-probabilities, $\log{p(x)}$. This is because probabilities -can get very small very quickly, resulting in underflow errors. +above, I want to emphasize one of the the most important design +decisions when writing code with probabilities: working with log values. +What this means is that rather than working directly with probabilities +$p(x)$, we should be working with \emph{log}-probabilities, +$\log{p(x)}$. This is because probabilities can get very small very +quickly, resulting in underflow errors. To motivate this, consider that probabilities must range between 0 and 1 (inclusive). NumPy has a useful function, \texttt{finfo}, that will tell @@ -1116,7 +1116,7 @@ the equation: \[ -\sum_{{item}_1, \ldots{}, {item}_m}p({damage}\ |\ {item}_1,\ldots{},{item}_m)p({item}_1)\cdots{}p({item}_m) +\sum_{{item}_1, \ldots{}, {item}_m} p(\mathrm{damage} \vert \mathrm{item}_1,\ldots{},\mathrm{item}_m)p(\mathrm{item}_1)\cdots{}p(\mathrm{item}_m) \] What this equation says is that we would need to compute the probability