This is a pretty old posting from 2009 I just recently discovered in my “drafts” directory. Nowadays there are probably easier and more elegant ways of defeating a captcha, but for old times sake, here is my simple approach.
———————–
Eclectic and Marko were so kind as to “provide” me a captcha to play around with. Took me a few days of poking around and googling but in the end it was easier than I had thought. As long as there aren’t and logic errors in the code (e.g. bad or no session handling) you probably won’t get around some kind of OCR. As OCR software I decided to use gocr because it is free, runs under linux, and it is fairly easy to train to specific needs. Because I knew which libraries were being used to create the captcha images, it was possible for me to build a testing area. This just speeds things up a bit, the process would have worked just as well off the original website. First off: the spambot in action -> http://captcha.dopefish.de/spambot.php, and the website it accesses: http://captcha.dopefish.de/
Now I’ll describe the steps I took to defeat the captcha. Look at what happens on failed and successful inputs, first write a script that works if you enter the solution manually. I used the following 2 php functions for getting and posting stuff (and keeping the session intact)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | $ch = curl_init(); function get_url( $ch, $url, $timeout = 5 ) { $url = str_replace( "&", "&", urldecode(trim($url)) ); $cookie = tempnam ("/tmp", "CURLCOOKIE"); curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" ); curl_setopt( $ch, CURLOPT_URL, $url ); curl_setopt( $ch, CURLOPT_COOKIEFILE, $cookie ); curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie ); curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true ); curl_setopt( $ch, CURLOPT_ENCODING, "" ); curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true ); curl_setopt( $ch, CURLOPT_AUTOREFERER, true ); curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false ); # required for https urls curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout ); curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout ); curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 ); $content = curl_exec( $ch ); $response = curl_getinfo( $ch ); return array( $content, $response ); } function post_url( $ch, $url, $postvars, $timeout = 5 ) { $url = str_replace( "&", "&", urldecode(trim($url)) ); $cookie = tempnam ("/tmp", "CURLCOOKIE"); curl_setopt( $ch, CURLOPT_POST ,1); curl_setopt( $ch, CURLOPT_POSTFIELDS , $postvars); curl_setopt( $ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; rv:1.7.3) Gecko/20041001 Firefox/0.10.1" ); curl_setopt( $ch, CURLOPT_URL, $url ); curl_setopt( $ch, CURLOPT_COOKIEFILE, $cookie ); curl_setopt( $ch, CURLOPT_COOKIEJAR, $cookie ); curl_setopt( $ch, CURLOPT_FOLLOWLOCATION, true ); curl_setopt( $ch, CURLOPT_ENCODING, "" ); curl_setopt( $ch, CURLOPT_RETURNTRANSFER, true ); curl_setopt( $ch, CURLOPT_AUTOREFERER, true ); curl_setopt( $ch, CURLOPT_SSL_VERIFYPEER, false ); # required for https urls curl_setopt( $ch, CURLOPT_CONNECTTIMEOUT, $timeout ); curl_setopt( $ch, CURLOPT_TIMEOUT, $timeout ); curl_setopt( $ch, CURLOPT_MAXREDIRS, 10 ); $content = curl_exec( $ch ); $response = curl_getinfo( $ch ); return array( $content, $response ); } |
Now train a gocr database for the images. Obviously it get’s better the more you train it.
Since curl is taking care of session handling, we can use the get_url() function for downloading the captcha image. I pipe it through this shell command to make it easier for gocr to read:
1 2 3 4 5 6 7 8 9 10 | convert \ -quality 100 \ -blur 0.7 \ -level 9%,60%,100.0 \ -monochrome \ -negate \ -blur 0.1 \ -gamma 1 \ -depth 8 \ -compress none \ |
It turnes this:
into this:
Since the valid captcha result is always the same length, we can check if gocr matched all the chars. If it looks good we can use post_url() to continue our session and throw all the fields at the form and submit it. See, wasn’t that hard. Most of the time is spent training gocr and converting the image into something easier to read. It doesn’t solve 100% of the images, more like 80-90%, but still better than nothing ;-).
Yeah ‘provide’ you with a captcha… My mailbox exploded with spam :p