CAPTCHAs (Completely Automated Public Turing test to tell Computers and Humans Apart) are popular ways of preventing bots from attempting to log on to systems by extensively searching the password space. In its traditional form, an image is given which contains a few characters (sometimes with some obfuscation thrown in). The challenge is to identify what those characters are and in what order. In this project, we wish to crack these challenges.
So I haved developed a computer program to automatically solve traditional CAPTCHA challenges by identifying characters in obfuscated images using image processing and machine learning.
- Initial image
- Find corner pixel with max frequency to get the background colour of the image
- Change the background to white
- Now Dilate the image to remove the stray lines
- Convert the image to grayscale
- Segment image into 3 characters
- Iterate over the columns of the image
- Check the frequency of number of non white pixels in each column
- This will help to get the start and end coordinates of the character
- Here, We have used window size of 30 pixel to detect characters and ignore any remaining noise (small stray lines) after dilation
- Get the bounding box of the three characters
- We found 37 such images where our method was not able to segment images in three characters out of 2000 images.
- So we divide the image equally in three segment of size 150x150 by leaving margin of 15 pixel in the beginning and 10 pixel in the following two.
- Extracting each character from the image using the bounding box we get from the above approach.
- Now we resize the image into 30x30 pixel from 150x150
- Now we flatten the image to convert into 1D array
- This the the feature vector of one character
- Using dictionary we encode each character to a numeric label from 0 to 23 {'ALPHA' : 0, 'BETA' : 1, 'CHI' : 2, 'DELTA' : 3, 'EPSILON': 4, 'ETA' : 5, 'GAMMA' : 6, 'IOTA' : 7, 'KAPPA' : 8, 'LAMDA': 9, 'MU' :10, 'NU' : 11, 'OMEGA' : 12, 'OMICRON':13, 'PHI' : 14, 'PI' : 15, 'PSI' : 16, 'RHO' : 17, 'SIGMA' : 18, 'TAU' : 19, 'THETA' : 20, 'UPSILON' : 21, 'XI' : 22, 'ZETA': 23}
- Using dictionary we decode each character from a numeric label. {0 : 'ALPHA', 1: 'BETA', 2: 'CHI', 3: 'DELTA', 4: 'EPSILON', 5: 'ETA', 6: 'GAMMA', 7: 'IOTA', 8: 'KAPPA', 9: 'LAMDA', 10: 'MU', 11: 'NU', 12: 'OMEGA', 13: 'OMICRON', 14: 'PHI', 15: 'PI', 16: 'PSI', 17: 'RHO', 18: 'SIGMA', 19: 'TAU', 20: 'THETA', 21: 'UPSILON', 22: 'XI', 23: 'ZETA'}
- For trying we have used Logistics Regression with 5000 iterations.