You’re probably come across the image above quite a number of times. It’s usually used when filling up a form to validate that you are a human being and not a spam bot. However did you know that it is also a service that helps to digitize books, newspapers and old time radio shows? Well, neither did I.
About 200 million CAPTCHAs are solved by humans around the world every day. In each case, roughly ten seconds of human time are being spent. Individually, that’s not a lot of time, but in aggregate these little puzzles consume more than 150,000 hours of work each day. That effort is spent into reading books.
Physically books which are being digitised are scanned and read using “Optical Character Recognition” (OCR). However OCR is not perfect and so the words which it can’t identify are being used in reCAPTCHA for a human to identify.
So how does it detect a bot if it can’t identify the word itself? Well it does so by presenting you with two words, the word it doesn’t know and a word which is knows. If the user solves the word for which the system knows, then the system assumes the answer is correct for the new word. Pretty neat ah?
Play a part in digitising the books
If you run a website that suffers from problems with spam, you can put reCAPTCHA on your site. For some applications (such as WordPress and Mediawiki), there are plugins that allow you to use reCAPTCHA without writing any code. There are also easy-to-use code for common web programming languages such as PHP.