大师兄的Python学习笔记(二十五): 爬虫(六)

大师兄的Python学习笔记(二十四): 爬虫(五)
大师兄的Python学习笔记(二十六): 爬虫(七)

七、识别验证码

1. 识别简单图形验证码
  • 通常由4位字母或数字组成。


  • 需要使用tesserocr库做图像文字识别。
1.1 在windows下安装tesserocr库
  • tesserocr库在windows下需要tesseract的支持,所以需要先安装tesseract, 点击此处下载并安装。
  • tesseract添加到环境变量。

  • 点击下载tesserocr库
  • 使用pip install ./tesserocr-2.4.0-cp37-cp37m-win_amd64.whl安装,具体位置根据你的文件位置调整。
1.2 识别方法

1) tesserocr.image_to_text(image)

  • 识别图形中的文字。

2) tesserocr.file_to_text(image)

  • 识别图片文件中的文字,效果不如image_to_text(image)
>>>import tesserocr
>>>import os
>>>from PIL import Image

>>>path = os.path.join("d:\\","sample_code","Graphical_verification_code.jpg")
>>>image = Image.open(path)
>>>image = image.convert('L') # 转为灰度图像
>>>result = tesserocr.image_to_text(image)
>>>print(result)
D5Qe
2. 识别极验验证码
  • 极验验证码是现在大部分网站的验证方式https://www.geetest.com/
  • 主要以点按、滑动、选字、选图、识字组词等方式验证。
  • 可以尝试使用Selenium库模拟页面行为通过验证。
  • 极验验证码是在不停升级的,所谓道高一尺魔高一丈...
2.1 识别环境

以极客官网后台登录https://auth.geetest.com/login为例,需要考虑三种情况:

  • 模拟点击
  • 滑动拼图到缺口
  • 模拟托块滑动
2.2 模拟点击
  • 模拟点击比较简单,直接用Selenium包模拟点击即可。


>>>from selenium import webdriver
>>>from selenium.webdriver.support.wait import WebDriverWait
>>>from selenium.webdriver.support import expected_conditions as EC
>>>from selenium.webdriver.common.by import By

>>>class Geetest_sample():
>>>    def __init__(self,user_name,password,url='https://auth.geetest.com/login'):
>>>        self.url = url
>>>        self.user_name = user_name
>>>        self.password = password
>>>        self.browser = webdriver.Firefox()
>>>        self.wait = WebDriverWait(self.browser,10)

>>>    def get_button(self):
>>>        # 定位按键验证码元素
>>>        button = >>>self.wait.until(EC.element_to_be_clickable((By.CLASS_NAME,'geetest_radar_tip_content')))
>>>        return button

>>>    def sort_username_password(self):
>>>        # 输入用户名密码
>>>        input_username = self.browser.find_element(By.CSS_SELECTOR,'.ivu-input')
>>>        input_password = self.browser.find_element(By.CSS_SELECTOR,'[placeholder~=请输入密码]')
>>>        input_username.send_keys(self.user_name)
>>>        input_password.send_keys(self.password)

>>>    def crack_geek(self):
>>>        # 执行点击
>>>        self.browser.get(self.url)
>>>        self.sort_username_password()
>>>        button = self.get_button()
>>>        button.click()

>>>if __name__ == '__main__':
>>>    gs = Geetest_sample('test','test')
>>>    gs.crack_geek()
2.3 滑动拼图到缺口
  • 关键在于使用边缘检测算法找到缺口位置,之后将滑块移动到缺口位置。
  • 滑块动作需要模拟人的动作,比如先加速再减速。


>>>from selenium import webdriver
>>>from selenium.webdriver import ActionChains
>>>from selenium.webdriver.support.wait import WebDriverWait
>>>from selenium.webdriver.support import expected_conditions as EC
>>>from selenium.webdriver.common.by import By
>>>from PIL import Image
>>>from io import BytesIO
>>>import time

>>>class Geetest_sample():
>>>    def __init__(self,user_name,password,url='https://auth.geetest.com/login'):
>>>        self.url = url
>>>        self.user_name = user_name
>>>        self.password = password
>>>        self.browser = webdriver.Firefox()
>>>        self.wait = WebDriverWait(self.browser,10)

>>>    def get_button(self):
>>>        button = self.wait.until(EC.element_to_be_clickable((By.CLASS_NAME,'geetest_radar_tip_content')))
>>>        return button

>>>    def sort_username_password(self):
>>>        input_username = self.browser.find_element(By.CSS_SELECTOR,'.ivu-input')
>>>        input_password = self.browser.find_element(By.CSS_SELECTOR,'[placeholder~=请输入密码]')
>>>        input_username.send_keys(self.user_name)
>>>        input_password.send_keys(self.password)

>>>    def get_screenshot(self):
        # 获得完整的图片
>>>        screenshot = self.browser.get_screenshot_as_png()
>>>        screenshot = Image.open(BytesIO(screenshot))
>>>        return screenshot

>>>    def get_img_position(self):
>>>        # 获取验证码图片
>>>        img = self.wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME,'geetest_canvas_img'))) # 如果图片验证码出现
>>>        time.sleep(2) # 模拟人的反应
>>>        location = img[0].location
>>>        size = img[0].size
>>>        top,bottom,left,right = location.get('y'),location.get('y') + size.get('height'),location.get('x'),location.get('x') + size.get('width')
>>>        return top,bottom,left,right

>>>    def get_geetest_image(self):
>>>        # 获取验证码位置
>>>        top,bottom,left,right = self.get_img_position()
>>>        screenshot = self.get_screenshot()
>>>        captcha = screenshot.crop((left,top,right,bottom))
>>>        return captcha

>>>    def get_slider(self):
>>>        # 获取滑块
>>>        slider = self.wait.until(EC.element_to_be_clickable((By.CLASS_NAME,'geetest_slider_button')))
>>>        return slider

>>>    def is_pixel_equal(self,image1,image2,x,y):
>>>        # 判断图片是否相同
>>>        pixel1 = image1.load()[x,y]
>>>        pixel2 = image2.load()[x,y]
>>>        threshold = 60
>>>        if abs(pixel1[0] - pixel2[0]) < threshold and abs(pixel1[1]-pixel2[1]) < threshold and abs(pixel1[2]-pixel2[2]) < threshold:
>>>            return True
>>>        else:
>>>            return False

>>>    def get_gap(self,image1,image2):
>>>        # 获取缺口位置
>>>        left = 60
>>>        for i in range(left,image1.size[0]):
>>>            for j in range(image1.size[1]):
>>>                if not self.is_pixel_equal(image1,image2,i,j):
>>>                    left = i
>>>                    return left
>>>        return left

>>>    def get_track(self,distance):
>>>        # 计算运动轨迹
>>>        track = [] # 移动轨迹
>>>        current = 0 # 当前位移
>>>        mid = distance*4/5  # 减速阈值
>>>        t = 0.2  # 间隔时间
>>>        v = 0 # 初速度
>>>        while current < distance:
>>>            if current < mid:
>>>                a = 2 # 加速度
>>>            else:
>>>                a = -3
>>>            v0 = v #初速度
>>>            v =v0 + a*t # 当前速度
>>>            move = v0*t + (1/2)*a*t*t # 移动距离
>>>            current += move
>>>            track.append(round(move))
>>>        return track

>>>    def move_to_gap(self,slider,tracks):
>>>        # 移动滑块
>>>        ActionChains(self.browser).click_and_hold(slider).perform() # 按住滑块
>>>        for x in tracks:
>>>            ActionChains(self.browser).move_by_offset(xoffset=x,yoffset=0).perform()
>>>        time.sleep(0.5)
>>>        ActionChains(self.browser).release().perform()

>>>    def sort_random_geetest(self):
>>>        # 判断第二次验证
>>>        image1 = self.get_screenshot()
>>>        if image1:
>>>            image2 = self.get_geetest_image()
>>>            slider = self.get_slider()
>>>            slider.click()
>>>            gap = self.get_gap(image1=image1, image2=image2)
>>>            track = self.get_track(gap)
>>>            self.move_to_gap(slider, track)

>>>    def crack_geek(self):
>>>        # 第一步,先输入用户名和密码
>>>        self.browser.get(self.url)
>>>        self.sort_username_password()

>>>        # 第二步,完成点击验证
>>>        button = self.get_button()
>>>        button.click()

>>>        # 第三部,完成随机验证
>>>        self.sort_random_geetest()

>>>if __name__ == '__main__':
>>>    gs = Geetest_sample('test','test')
>>>    gs.crack_geek()
2.4 点触验证吗
  • 这类验证码需要借助第三方API解决。
  • 这里使用了超级鹰验证码识别平台作为解决方案。
  • 需要注意调整验证码截图的尺寸。
>>>from selenium import webdriver
>>>from selenium.webdriver import ActionChains
>>>from selenium.webdriver.support.wait import WebDriverWait
>>>from selenium.webdriver.support import expected_conditions as EC
>>>from selenium.webdriver.common.by import By
>>>from PIL import Image
>>>from io import BytesIO
>>>from hashlib import md5
>>>from selenium.common.exceptions import *
>>>import time
>>>import requests

>>>CHAOJIYING_USERNAME = "yourusername"
>>>CHAOJIYING_PASSWORD = "yourpassword"
>>>CHAOJIYING_SOFT_ID = "yoursoftid"
>>>CHAOJIYING_TYPE = "9004" # 坐标选四,返回格式:x1,y1|x2,y2|x3,y3|x4,y4

>>>class Chaojiying(object):
>>>    ```
>>>    connect to chaojiying
>>>    ```
>>>    def __init__(self, username, password, soft_id):
>>>        self.username = username
>>>        password = password.encode('utf8')
>>>        self.password = md5(password).hexdigest()
>>>        self.soft_id = soft_id
>>>        self.base_params = {
>>>            'user': self.username,
>>>            'pass2': self.password,
>>>            'softid': self.soft_id,
>>>        }
>>>        self.headers = {
>>>            'Connection': 'Keep-Alive',
>>>            'User-Agent': 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)',
>>>        }

>>>    def PostPic(self, im, codetype):
>>>        """
>>>        im: 图片字节
>>>        codetype: 题目类型 参考 http://www.chaojiying.com/price.html
>>>        """
>>>        params = {
>>>            'codetype': codetype,
>>>        }
>>>        params.update(self.base_params)
>>>        files = {'userfile': ('ccc.jpg', im)}
>>>        r = requests.post('http://upload.chaojiying.net/Upload/Processing.php', data=params, files=files,
>>>                          headers=self.headers)
>>>        return r.json()

>>>    def ReportError(self, im_id):
>>>        """
>>>        im_id:报错题目的图片ID
>>>        """
>>>        params = {
>>>            'id': im_id,
>>>        }
>>>        params.update(self.base_params)
>>>        r = requests.post('http://upload.chaojiying.net/Upload/ReportError.php', data=params, headers=self.headers)
>>>        return r.json()

>>> class Geetest_sample():
>>>     def __init__(self,user_name,password,url='https://auth.geetest.com/login'):
>>>         self.url = url
>>>         self.user_name = user_name
>>>         self.password = password
>>>         self.browser = webdriver.Firefox()
>>>         self.wait = WebDriverWait(self.browser,10)
>>>         self.chaojiying = >>>Chaojiying(CHAOJIYING_USERNAME,CHAOJIYING_PASSWORD,CHAOJIYING_SOFT_ID)

>>>    def get_button(self,cls):
>>>        # 获取滑块
>>>        button = self.wait.until(EC.element_to_be_clickable((By.CLASS_NAME,cls)))
>>>        return button

>>>    def sort_username_password(self):
>>>        input_username = self.browser.find_element(By.CSS_SELECTOR,'.ivu-input')
>>>        input_password = self.browser.find_element(By.CSS_SELECTOR,'[placeholder~=请输入密码]')
>>>        input_username.send_keys(self.user_name)
>>>        input_password.send_keys(self.password)

>>>    def get_screenshot(self):
>>>        # 获得完整的图片
>>>        screenshot = self.browser.get_screenshot_as_png()
>>>        screenshot = Image.open(BytesIO(screenshot))
>>>        return screenshot

>>>    def get_img_position(self,cls):
>>>        # 获取验证码图片
>>>        img = self.wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME,cls))) # 如果图片验证码出现
>>>        time.sleep(2) # 模拟人的反应
>>>        location = img[0].location
>>>        size = img[0].size
>>>        top,bottom,left,right = location.get('y'),location.get('y') + size.get('height'),location.get('x'),location.get('x') + size.get('width')
>>>        return top,bottom,left,right

>>>    def get_geetest_image(self,cls):
>>>        # 获取验证码位置
>>>        top,bottom,left,right = self.get_img_position(cls)
>>>        print(top,bottom,left,right)
>>>        screenshot = self.get_screenshot()
>>>        captcha = screenshot.crop((left,top,right,bottom))
>>>        return captcha

>>>    def is_pixel_equal(self,image1,image2,x,y):
>>>        # 判断图片是否相同
>>>        pixel1 = image1.load()[x,y]
>>>        pixel2 = image2.load()[x,y]
>>>        threshold = 60
>>>        if abs(pixel1[0] - pixel2[0]) < threshold and abs(pixel1[1]-pixel2[1]) < threshold and abs(pixel1[2]-pixel2[2]) < threshold:
>>>            return True
>>>        else:
>>>            return False

>>>    def get_gap(self,image1,image2):
>>>        # 获取缺口位置
>>>        left = 60
>>>        for i in range(left,image1.size[0]):
>>>            for j in range(image1.size[1]):
>>>                if not self.is_pixel_equal(image1,image2,i,j):
>>>                    left = i
>>>                    return left
>>>        return left

>>>    def get_track(self,distance):
>>>        # 计算运动轨迹
>>>        track = [] # 移动轨迹
>>>        current = 0 # 当前位移
>>>        mid = distance*4/5  # 减速阈值
>>>        t = 0.2  # 间隔时间
>>>        v = 0 # 初速度
>>>        while current < distance:
>>>            if current < mid:
>>>                a = 2 # 加速度
>>>            else:
>>>                a = -3
>>>            v0 = v #初速度
>>>            v =v0 + a*t # 当前速度
>>>            move = v0*t + (1/2)*a*t*t # 移动距离
>>>            current += move
>>>            track.append(round(move))
>>>        return track

>>>    def move_to_gap(self,slider,tracks):
>>>        # 移动滑块
>>>        ActionChains(self.browser).click_and_hold(slider).perform() # 按住滑块
>>>        for x in tracks:
>>>            ActionChains(self.browser).move_by_offset(xoffset=x,yoffset=0).perform()
>>>            time.sleep(0.5)
>>>        ActionChains(self.browser).release().perform()

>>>    def get_points(self,result):
>>>        # 解析结果
>>>        groups = result.get('pic_str').split('|')
>>>        print(groups)
>>>        locations = [[int(number) for number in group.split(',')] for group in groups]
>>>        return locations

>>>    def click_locations(self,locations):
>>>        # 点击验证图片,并点击确认按钮
>>>        element = self.browser.find_element_by_class_name("geetest_widget")
>>>        for location in locations:
>>>            action = ActionChains(self.browser).move_to_element_with_offset(element,location[0],location[1]).click()
>>>            action.perform()
>>>            time.sleep(1)
>>>        button = self.get_button('geetest_commit_tip')
>>>        button.click()

>>>    def resize_img(self,image):
>>>        # 改变图片尺寸
>>>        (x, y) = image.size
>>>        image = image.resize((x, y), Image.ANTIALIAS)
>>>        return image

>>>    def sort_by_chaojiying(self,image):
>>>        image = self.resize_img(image)

>>>        # 超级鹰处理验证码识别
>>>        bytes_array = BytesIO()
>>>        image.save(bytes_array,format="PNG")
>>>        result = self.chaojiying.PostPic(
>>>            bytes_array.getvalue(),CHAOJIYING_TYPE
>>>        )
>>>        return result

>>>    def sort_random_geetest(self):
>>>        # 获取提示
>>>        try:
>>>            image2 = self.get_geetest_image('geetest_canvas_img')
>>>        except TimeoutException as e:
>>>            image2 = None
>>>        try:
>>>            tip = self.get_img_position("geetest_tip_img")
>>>        except TimeoutException as e:
>>>            tip = None

>>>        if image2:
>>>            image1 = self.get_screenshot()
>>>            slider = self.get_button('geetest_slider_button')
>>>            slider.click()

>>>            gap = self.get_gap(image1=image1, image2=image2)
>>>            track = self.get_track(gap)
>>>            print(track)
>>>            self.move_to_gap(slider, track)
>>>        elif tip:
>>>            git_img = self.get_geetest_image("geetest_widget")
>>>            result = self.sort_by_chaojiying(image=git_img)
>>>            locations = self.get_points(result)
>>>            self.click_locations(locations)

>>>    def crack_geek(self):
>>>        # 第一步,先输入用户名和密码
>>>        self.browser.get(self.url)
>>>        self.sort_username_password()

>>>        # 第二步,完成点击验证
>>>        button = self.get_button('geetest_radar_tip_content')
>>>        button.click()

>>>        # 第三部,完成随机验证
>>>        self.sort_random_geetest()

>>>if __name__ == '__main__':
>>>    gs = Geetest_sample('test','test')
>>>    gs.crack_geek()

参考资料



本文作者:大师兄(superkmi)

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 215,463评论 6 497
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 91,868评论 3 391
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 161,213评论 0 351
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 57,666评论 1 290
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 66,759评论 6 388
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 50,725评论 1 294
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,716评论 3 415
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 38,484评论 0 270
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,928评论 1 307
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 37,233评论 2 331
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 39,393评论 1 345
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 35,073评论 5 340
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 40,718评论 3 324
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 31,308评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,538评论 1 268
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 47,338评论 2 368
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 44,260评论 2 352