Free Republic
Browse · Search
News/Activism
Topics · Post Article

Skip to comments.

Antispam Word Jumbles to Help Digitize Books [putting spamchecks to good use]
TechNewsWorld ^ | 05/25/07 11:25 AM PT | Katherine Noyes

Posted on 05/25/2007 5:40:47 PM PDT by Gondring

Antispam Word Jumbles to Help Digitize Books

A Carnegie Mellon University project is using CAPTCHA -- or Completely Automated Public Turing Test to Tell Computers and Humans Apart -- tests to digitize books. Three hundred Web sites have already signed up to use the technology. About 60 million CAPTCHA tests are solved every day.

Web surfers all too familiar with the distorted-letter tests that accompany so many site registration forms today can now take heart -- the time they spend on those tests is being put to good use.

Thanks to a project at Carnegie Mellon University, a new version of those pesky CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) tests makes the technology work double-duty: Not only does it continue to distinguish between legitimate human users and malevolent spam programs, it also uses the results to aid in the digitization of books for the Internet Archive.

A Carnegie Mellon team led by Luis von Ahn, an assistant professor of computer science and recipient of a MacArthur Foundation genius grant, developed the new tests, dubbed "reCAPTCHAs," which were launched on Wednesday.

Helping OCR

Optical character recognition (OCR) technology used to digitize printed text is often confounded by underlined text, scribbles and fuzzy or otherwise poorly printed letters.

ReCAPTCHA tests work by asking users to type in one distorted but known word along with one that has stumped an optical character recognition (OCR) system working on a digitization project. If the user inputs the known word correctly, then the system has greater confidence that he or she has deciphered the problematic word correctly too.

Each unknown word is submitted to multiple users; if several enter the same translation, the system assumes it is correct.

In this way, the new tests continue to distinguish between humans and machines because they use text that OCR systems have already failed to read. However, they also contribute to book digitization projects by helping OCR systems convert printed text into computer-readable letters.

Wasting 150,000 Hours a Day

Von Ahn worked on the original CAPTCHA technology for Yahoo (Nasdaq: YHOO) Latest News about Yahoo, and was astounded to later learn that 60 million of the tests are solved every day by people around the world. "When I first found this out, I was quite proud of myself and the impact my research has had," von Ahn told TechNewsWorld.

"But then I started feeling bad: Each time a CAPTCHA is solved, 10 seconds of human time are basically wasted," von Ahn explained. "If you multiply that by 60 million, you get that humanity as a whole wastes about 150,000 hours every day solving CAPTCHAs. That's a lot of time!"

Inspired to come up with additional ways the technology could do something useful for humanity, von Ahn then had the idea of helping to digitize books.

By Thursday night, about 300 Web sites had signed up to use the technology and 20,000 words had been digitized, von Ahn said. One of the first books being tackled is John Dewey's Psychology, he added.

Strength in Numbers

By tapping into the collective power of thousands of computer users worldwide, reCAPTCHA technology is similar to the distributed computing SETI@home project, through which users donate their computers' spare processing time to help process the enormous volumes of radio signals from space that get recorded by radio telescopes around the globe.

With support from Intel (Nasdaq: INTC), von Ahn's team has developed a free, Web-based service that allows individual webmasters to install reCAPTCHAs to protect their sites. Individuals can also use the technology to protect their own e-mail addresses.

'The Spirit of Web 2.0'

"ReCAPTCHA is a brilliant idea and implementation," Jason Dowdell, operator of media and technology blog MarketingShift, told TechNewsWorld.

"Far too many entrepreneurs have built applications that solve only one problem," Dowdell added. "Von Ahn has built a platform that is incredibly simple at its core yet provides the opportunity to meet some very large challenges -- that's the spirit of Web 2.0."


TOPICS: Culture/Society; Miscellaneous; Technical
KEYWORDS:

"But then I started feeling bad: Each time a CAPTCHA is solved, 10 seconds of human time are basically wasted," von Ahn explained. "If you multiply that by 60 million, you get that humanity as a whole wastes about 150,000 hours every day solving CAPTCHAs. That's a lot of time!"

Although this isn't spammail, it's an indicator of how much waste is generated by deceptive/unwanted techniques on the 'net.

This highlights one of the big examples of how ineffective the Republican admninistration/legislative branches have been, with their "Opt-Out" ideas on spam, etc., rather than sticking to conservative principles and requiring opt-in, etc.

Thank goodness for some of the waste being put to good use now, at least.


BTW, I hire illegal aliens to type my CAPTCHA tests. There, I hijacked my own thread! :-)

1 posted on 05/25/2007 5:40:49 PM PDT by Gondring
[ Post Reply | Private Reply | View Replies]

To: Gondring

So if I understand this correctly, people solve the little tests, jumbled letters, and now those solved tests are actually going to be whole digital books down the road?


2 posted on 05/25/2007 5:49:02 PM PDT by padre35 (we are surrounded that simplifies things-Chesty Puller)
[ Post Reply | Private Reply | To 1 | View Replies]

To: padre35

As I understand it, they are scanning in old books, and then using the words that the scanner can’t read as test words. When lots of responses are the same for the word, then they know what the word was that the OCR scanner couldn’t read.


3 posted on 05/25/2007 5:57:40 PM PDT by Gondring (I'll give up my right to die when hell freezes over my dead body!)
[ Post Reply | Private Reply | To 2 | View Replies]

To: Gondring
"...humanity as a whole wastes about 150,000 hours every day solving CAPTCHAs. That's a lot of time!"

Yeah, but think of the number of hours wasted every day by all the people watching Oprah. Now that's depressing!

4 posted on 05/25/2007 5:58:02 PM PDT by randog (What the...?!)
[ Post Reply | Private Reply | To 1 | View Replies]

To: Gondring

Thanks for the clue cause I had none! Now are these unknown words going to be added to various sites to enhance what the OCR can’t read now?


5 posted on 05/25/2007 6:05:48 PM PDT by jwh_Denver (Press 1 for English. Not me, I press 2 and say I can't speak Spanish.)
[ Post Reply | Private Reply | To 3 | View Replies]

To: randog
I have no time for Oprah.






I FReep. :-)
6 posted on 05/25/2007 6:06:26 PM PDT by Gondring (I'll give up my right to die when hell freezes over my dead body!)
[ Post Reply | Private Reply | To 4 | View Replies]

To: Gondring

Bump for later reading on CAPTCHA


7 posted on 05/25/2007 6:06:35 PM PDT by Kevmo (Duncan Hunter just needs one Rudy G Campaign Video: http://www.youtube.com/watch?v=RVBtPIrEleM)
[ Post Reply | Private Reply | To 1 | View Replies]

To: Gondring
So, let me see if I get this right. The amount of spam is lower today because computer robots cannot read these things, only humans. So we are going to take the human effort in reading these things and use that to teach computers how to read them, albeit for the good reason of digitizing books.

Umm, what's to stop this "learning" from passing along to the spam robots, this allowing them to pass the test and send even more spam?

8 posted on 05/25/2007 6:47:38 PM PDT by T. P. Pole
[ Post Reply | Private Reply | To 1 | View Replies]

To: padre35
So if I understand this correctly, people solve the little tests, jumbled letters, and now those solved tests are actually going to be whole digital books down the road?

You're close. The little tests (CAPTCHAs) are how Web logins are handled now. This proposal would substitute images of scannable book text for the puzzles. Each scannable passage would be repeated to several different users, so their answers could be compared. Instead of having to have a computer scan the text, the logon users' answers would "vote" on the correct text corresponding to the image.

This technique would be used to input Gutenberg bibles and other texts that would be difficult to scan with conventional means.

9 posted on 05/25/2007 6:49:14 PM PDT by BlazingArizona
[ Post Reply | Private Reply | To 2 | View Replies]

To: BlazingArizona

I have entered CAPTCHA verifiers many times in the past, but I would be seriously peeved if I was presented with text that was truly unclear as a requirement for me to place an order, pay a bill, etc. I would also assume they will not clutter up snippets from Gutenberg bibles with wavy lines and varied colors, so if I am a smart CAPTCHA respondent I will just ignore anything that isn’t cluttered up like that.


10 posted on 05/25/2007 7:15:41 PM PDT by drlevy88
[ Post Reply | Private Reply | To 9 | View Replies]

To: Gondring; padre35

Very interesting article! What they are doing is taking the human readings of distorted letters and weighing in on what most people thought the letters were - thereby establishing a pattern for interpreting distorted letters.

I’m a translator and do some paleography, so I have the occasion to spend time with writing that is very difficult to read (from the point of view of my century). I have always wondered how that problem was going to be solved technologically, because I knew it would be, someday.

I think this could have considerable impact on many areas. It’s a brilliant idea.


11 posted on 05/25/2007 7:23:16 PM PDT by livius
[ Post Reply | Private Reply | To 1 | View Replies]

To: T. P. Pole

Spammers already know how to exploit this technique and have been doing so for at least 6 years.

They program their bots to find captchas, say the ones to get into Yahoo chat rooms, then copy the captcha image and repost it on their porn websites. The spammers’ porn websites’ users would in order to get to free porn go and solve the captchas, the spammers then take the answer the porn surfer gave and plug it into the target website and also get a correct answer.

When the spammers have several thousand porn sites simultaneously doing this, captcha technology can be defeated with an over 90% efficiency rate.


12 posted on 05/25/2007 10:27:52 PM PDT by JerseyHighlander
[ Post Reply | Private Reply | To 8 | View Replies]

Disclaimer: Opinions posted on Free Republic are those of the individual posters and do not necessarily represent the opinion of Free Republic or its management. All materials posted herein are protected by copyright law and the exemption for fair use of copyrighted works.

Free Republic
Browse · Search
News/Activism
Topics · Post Article

FreeRepublic, LLC, PO BOX 9771, FRESNO, CA 93794
FreeRepublic.com is powered by software copyright 2000-2008 John Robinson