论坛首页 > 学习杂记

爬虫——识别篇（短篇）

lvcham · 发表于 2022-03-24 00:33:59 · 学习杂记

验证码识别

验证码和爬虫之间的爱恨情仇？

反爬机制：验证码
反反爬应对：识别验证码图片中的数据，用于模拟登录操作

识别验证码的操作：

人工肉眼识别（不推荐，有的验证码根本无法确定）
第三方自动识别（推荐，此处选ddddocr，有条件的同学可以另选）

关于ddddocr（带带弟弟OCR）：

安装：pip install ddddocr

PS：因为网络原因，准备了以下链接

豆瓣源：pip install -i http://pypi.douban.com/simple --trusted-host pypi.douban.com ddddocr
清华源：pip install -i https://pypi.tuna.tsinghua.edu.cn/simple --trusted-host pypi.douban.com ddddocr

简易步骤：https://blog.csdn.net/jiahuiandxuehui/article/details/119089944
更多详见：https://github.com/sml2h3/ddddocr

PS：理论上目前所有验证码都可识别，是否成功看运气

实战：识别古诗文网登录页面中的验证码实战：识别古诗文网登录页面中的验证码

将验证码图片进行本地下载
调用ddddocr进行图片数据识别

本篇代码实战内容：

import requests
from lxml import etree
import ddddocr
import time

if __name__ == "__main__":

        headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'
    }
        url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
        page_text = requests.get(url=url,headers=headers).text
        #爬取页面数据
        tree=etree.HTML(page_text)
        code_img_src = 'https://so.gushiwen.cn'+tree.xpath('//*[<span>@id=</span>"imgCode"]/<span>@src')[0]#CTL{n}</span>        print(code_img_src)
        #获取图片地址
        img_data = requests.get(url=code_img_src,headers=headers).content
        #获取图片数据
        with open('./code.jpg','wb') as fp:
                fp.write(img_data)
        #存储到本地
        begin=time.time()
        ocr = ddddocr.DdddOcr()
        with open('./code.jpg', 'rb') as f:
                img_bytes = f.read()
        res = ocr.classification(img_bytes)
        finish=time.time()
        print("结果：")
        print(res)
        print("用时：%s 秒" % str(finish-begin))

PS：

封神台的验证码图片是在云端接口点击登录临时调用的，每次图片地址比较随机，因为不知道规律，无法正确下载请求图
另外，封神台验证码图片分为原图、乱序缺口大图和碎片小图，大图需要正确组合才能使用，ddddocr目前无法成功识别此类

本篇文章是2020年Python爬虫全套课程（学完可做项目）(29P-31P)的个人学习笔记，本篇内容较为简单，会使用验证码识别库即可

打赏我,让我更有动力~

1 条回复 | 直到 2022-4-1 | 863 次浏览

criys
发表于 2022-4-1

识别还是很有用的

评论列表

加载数据中...

编写评论内容

登录后才可发表内容