2. CAPTCHA
Completely Automated Public Turing Test to tell Computers and Humans Apart
Why are they interesting?
o Harder than normal text recognition
On par with handwriting recognition,
reading damaged text
o Techniques translate well to other problems
Facial recognition (Gonzaga, 2002)
Weed identification (Yang, 2000)
o Near infinite data sets
Easier to avoid over-fitting
3. Hypothesis
CAPTCHA recognition can be
accomplished to a high degree
of accuracy using machine
learning methods with minimal
preprocessing of inputs.
4. Methods
Tools
o JCaptcha
o Image Processing
Learning Methods Segmentation Methods
o Feed-forward Neural o Overlapping
Nets o Whitespace
o Self-Organizing Maps o K-Means
o K-Means
o Cluster Classification
5. JCaptcha
o Open-source CAPTCHA
generation software
o Highly configurable
Can produce CAPTCHAs of
many levels of difficulty
o Check it out at:
http://jcaptcha.sourceforge.net
6. Image Processing
Sparse Image
Represents Images as unbounded set of pixels
Each pixel is a value between 0 and 1 and a
coordinate pair
Center each image before turning into a matrix of
0s and 1s
Original After Transformation
8. Self-Organizing Maps
Training Collection
Initialize N buckets to For many inputs
random values
Sort each input into
For each input the bucket it most
Find the bucket that is closely matches
“closest” to the input For each bucket and each
Adjust the “closest” character
bucket to more closely Calculate the
match the input using probability of that
exponential average character going into
that bucket.
10. Overlapping Segmentation
• Divide image into
fixed number of
overlapping tiles of
the same size
• In our case, 20 x 20
pixels with a 50%
overlap
• Discard chunks
under a certain size Note: This is a B with
part of it cut off, not
and chunks that are an E. Therein lies the
all white rub.
11. Whitespace Segmentation
• Iterate through the
image from left to
right—segment
when a full column
of whitespace is
encountered
• Works perfectly for
well-spaced text
17. Experiment 2
ML Method: Contains … ?
Neural Net
A: 0 or 1
Topology: B : 0 or 1
C: 0 or 1
400 Nodes
Fully connected
50 Nodes
7 Nodes
D: 0 or 1
E: 0 or 1
400 inputs F: 0 or 1
50 node hidden layer G: 0 or 1
7 outputs
Inputs:
Single letter CATPCHAs
Random fonts
Letters A-G
19. Experiment 2 Results
Past a certain
number of nodes
in the hidden
layer, the
topology ceases
to have a huge
impact on
accuracy.
Neural Net Accuracy vs. Size of Hidden Layer
20. Experiment 3
ML Method: ML Method:
SOM Neural Net
Topology: Topology:
500 buckets Fully connected
400 inputs
1000 node hidden layer
7 outputs
Inputs:
4 letter CATPCHAs
Fandom fonts
Letters A-G
26. What it all means
• Increasing number of characters
dramatically decreases total accuracy
because segmentation quality decreases
• True positive rate goes down when
segmentation quality decreases
• Hence, better segmentation is the key
27. Future Work
Improved Segmentation
o Wirescreen segmentation
o Ensemble techniques
Improved True Positive Rates with Current
System
o Ensemble techniques
New problems
o Handwriting recognition
o Bot net of doom