-
The detection system makes a few assumptions about the image. It assumes that a collumn of questions has at most 39 questions with 5 possible answers each and that there is at least 1 fully filled collumn. It also assumes that there is some level of irrelevant information at the top of the form to crop out. If the form was to be nothing but question boxes from top to bottom then many would be cut off. These assumptions are true for all of the images given, but might not neccesarily be true for others.
-
The detection process has a couple steps, image transformations to filter out irrelevant information and simplify shapes, line detection to find the borders of the question collumns, corner detection using the lines to find the corners of the question collumns, finding a loop by traveling through lines and corners to find the lines that make up the borders of the question collumn, and finally use the assumed question numbers to estimate the size of a single question box and use that to find the location of every box.
-
The image transformation step is a combination of blur filters, followed by combinations of min and max filters and then an edge detection filter. This specific combination of filters was found over experimentation to remove irrelevant artifacts in the image and to consolidate the question boxes together to form 3 large rectangles 1 for each question collumn. The goal was to use min filters to remove artifacts, max filters to consolidate question boxes, then min filters again to reduce things back to near their original size. The idea was that things removed completely from the image will not be restored by max filters, and empty spaces filled between pixels by max filters will not be restored by min filters. Then after edge detection is done we should have 3 rectangular boxes made up of 4 lines each. I then move into the detection of lines where I detect vertical and horizontal lines separately.
Some images are converted to binary to improve visibility:
Image After Blur:
Image After First Min Filter:
Image After Max Filter:
Image After Second Min Filter:
Image After Edge detection:
-
Line detection is done by going through all of the pixels in the image until a white pixel is found (The background is black and the lines are white). After finding a white pixel I start looking for a lines. Iterating through pixels continues like normal, but everytime I count the number of iterations where I am unable to find a white pixel in a row and if exceeds some threshold the line has reached its end. This is done to allow small gaps in the line to be bridged. If the number of pixels in the line exceeds some threshold then it is decided to be an actual line and added to the detected lines. There are a couple of additional complexities added on to this general process. First when looking for white pixels past the first I allow a certain level of forgiveness in the dimension opposite to the dimension we are looking for lines in, ex. horizontal lines can vary somewhat in the vertical dimension. This makes it so the lines do not have to be perfectly straight. This is done by having a saved position of the last white pixel found, and when looking for the next white pixel I look a bit in the opposite dimension both directions for a white pixel, if one is found then the algorithm works like normal with a white pixel being properly detected, but the position of the last white pixel is set to this new pixels location. This allows lines to vary quite a bit over time in the opposite dimension as long as there is a string of pixels somewhat connecting it. The second complexity is detecting if a similar line was already seen. By comparing the start and end positions of lines I detect if there is some overlap between the two lines. If so I take the longest lines between the 2 lines and a new line that is formed by combining the two lines together. It may seems like the combined line will always be the longest but sometimes the second line is completely within the bounds of the first line, so when forming a line from the first lines start position to the second lines end position it is actually a shorter line. After all lines are detected I run the process of combining lines again with all of the detected lines this time to consolidate lines. After this any lines that are close to the edge of the image are ignored because the edge of the image tends to be considered an edge by the edge detection and lines are detected there.
-
After the lines are detected I go through the lines and compare all horizontal lines to all vertical lines to see if they intersect. This intersection is done with some forgiveness, so if the lines would intersect if they just traveled a bit further in their direction then it is still counted as an intersection. These intersection points are then the various corners of the question collumns.
-
As corners are detected they are added to 2 different dictionaries that function as graphs. One points edge keys to the corners they are a part of and the other points corner keys to the edges that make them up. After all corners are detected and graphs are populated another graph is created to point corners to adjacent corners. This is done by taking a corner and traversing its edges, then from these edges traverse the corners they connect to, all of these corners are directly adjacent to the start corner since they are 1 edge away from each other. This process is done until all edges have been traversed leaving us with a full graph from corners to adjacent corners. With this we can detect loops by traversing from corner to corner without going backwards until we reach the corner we started at. If all corners are traversed without finding a loop then there is no loop.
Lines and corners detected to be in loops:
- Now that I have these corner loops I can define boxes from these corners. These boxes will be the collumns of questions. By finding the collumn with the largest height I can find the collumn with the assumed 39 questions with 5 answers each. With this I can divide the size to get an estimated question box size. Then using the height of each collumn I can find how many questions it has. With this information I can estimate the center of all of the answer boxes by stepping the estimated width 4 times per question for stepping the estimated height the number of questions times. These locations are then saved and outputted to be used in my teammates' work.
Detected Box Center:
The color of the centers changes as going up in question number to show that they are also detected in the correct order.
- As a part of preprocessing, the image is first converted to an inverted binary image and Min and Max filters (applied by my teammate to improve performance) are applied to filter out all other image artifacts other than the filled boxes. The intuition behind detecting the marked answers is that the number of pixels, in each box should be a certain threshold (150 pixels) to be considered as marked. Thus using the locations provided by my teammates' work above, I started counting the number of pixels in each box in each question. If the box with the maximum number of pixels in a question is greater than 150 then I considered that an answer is marked in the question and continued to further examine the boxes, else I assumed that the question is left unmarked. Once I established that a question was marked, I considered the box with the maximum number of pixels as first the marked box. Now for the consequent second or third marked box, I checked if any boxes have a pixel count of plus or minus 30 (initially a greater number which was updated by my teammate for better performance) of the pixel count of the first marked box or if any of them have a pixel count greater than the threshold, thus indicating the presence of more than one marked answers. All the boxes considered as marked are stored in a list of marked answers and presented in the output.
Preprocessed Image:
-
The intuition behind detecting handwritten characters, is to check for the presence of pixels greater than a certain threshold to left of the first box detected for each question. By trial and error, I was able to roughly figure out the location of all handwritten characters in the image, which was 150 pixels to the left. The area obtained by shifting the view of the system 150 pixels left of the first box, could be considered as the area of any handwritten characters. Again, I obtained the count of pixels in this area and if the number was greater than 85 pixels, then I concluded that there was the presence of a handwritten character which I stored by adding the index of the question to a list. The threshold of 85 was obtained by trial and error. Finally, when displaying the output, the questions with handwritten characters are identified with an X after displaying the marked answers detected by the system.
-
Once I obtain the marked answers from the above steps, the boxes which are considered as answers are highlighted in the original image. To highlight, I've highlighted the box perimeters with green color and a thickness of 5 pixels.
Original image with marked answers highlighted:
- The major improvements that could be made would be to eliminate the assumptions. The assumption about the number of questions and answers can be removed by detection one of the boxes within the collumns and using its size to find the number of questions and number of answers. This could possibly be dont by after find the bounds of the collumns doing an earlier edge detection with the boxes still formed and using the bounds of the collumn to isolate an answer box. For the other assumption it could potentially be solved by leaving the top included and potentially use the fact that the answer collumns are more vertical than horizontal to ignore the artifacts at the top that are more horizontal than vertical. This however, introduces a new assumption. The last improvement that could be made is speed. The line detection is quite expensive and could probably be done in a more efficient way to speed things up. Also, the convolutions at the beggining are fairly expensive and perhaps the same effect could be achieved with less of them.
We decided on the simplest approach which was to embed a barcode at the bottom of the answer sheet Each answer was hot-encoded, for say if the answer is "AD", it will be represented by 10010 binary, this way each answer sheet has 85 answers which in turn makes a binary number of 425(85*5) in length.
While embedding the binary as a barcode, each 0 is a black line and each 1 is a white line, and each barcode has '1010' on each side to help identify with the extraction part
The extraction is very simple, since we know the exact co-ordinates of where the barcode should be in the image, we can simply ballpark where to look in the answer sheet and simply work on one single line of pixels to identify the barcode binary and convert it back to actual answers
Our program is able to achieve 100% accuracy on the images given. We wrote ground truths for all of the images other than the blank form, and also verified that the output from blank form is in fact empty. This by no means makes it perfect. I anticipate that if given a form that minorly violates the assumptions such as having 40 intstead of 39 questions in a collumn it will break apart. Also, a form with horizontal artifacts, like a vertical line on the left side of the form near the question collumn, could likely result in some inaccuracies. This is because the system in place for dealing with offset centers is skewed towards the vertical since the offset tends to be vertical. However, there are a few instances of large horizontal offsets that do work out so it is possible it would work. The prgram takes approximately 1 minute to run per image, I tried to get the run time down as much as possible, but could not do any better.









