正则爬取谷歌翻译语种

image.png

这次聊一聊爬取谷歌翻译的语言种类。
首先常规操作：

打开谷歌翻译的链接：https://translate.google.cn/

image.png

打开页面我们看到的是这个界面，点击红色区域可以显示看到语种。

image.png

而且观察网页链接并无变化。于是就想用类似点击按钮selenium方法进行获取数据，用xpath和bs4分别匹配,但是还是匹配不到。再看返回的页面发现他返回的是字符串，其中有语种信息，直接尝试使用正则匹配，果然可以，代码如下。

import requests
import re
headers = {
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36',
                   }
res=requests.get(url=url,headers=headers)
req = res.text  #读取文件页面
req

如下：返回页面
……
</script> </head> <body> <script>(function(){var mobileWebapp={display_language:'zh-CN',source_code_name:[{code:'auto',name:'检测语言'},{code:'sq',name:'阿尔巴尼亚语'},……,{code:'yo',name:'约鲁巴语'},{code:'vi',name:'越南语'},{code:'zh-TW',name:'中文(繁体)'},{code:'zh-CN',name:'中文(简体)'}],body_direction:'ltr',maybe_default_target_code:'zh-CN'
……

然后，我们使用正则匹配，获取我们需要的内容。

content=re.findall(r'.*source_code_name:.(.*)..body_direction:.*',req)

正则：返回页面
["{code:'auto',name:'检测语言'},{code:'sq',name:'阿尔巴尼亚语'},……{code:'yo',name:'约鲁巴语'},{code:'vi',name:'越南语'},{code:'zh-TW',name:'中文(繁体)'},{code:'zh-CN',name:'中文(简体)'}"]

我们继续对数据分割-正则处理，即可得到语种

words=content[0].split(',')
for i in words:
    word=re.findall('name..(.*)..',i)
    if word:
        print(word[0])
    else:
        pass

检测语言
阿尔巴尼亚语
……
约鲁巴语
越南语
中文(繁体)
中文(简体)

完整代码:

import requests
import re
headers = {
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36',
                   }
res=requests.get(url=url,headers=headers)
req = res.text  #读取文件页面
content=re.findall(r'.*source_code_name:.(.*)..body_direction:.*',req)
words=content[0].split(',')
for i in words:
    word=re.findall('name..(.*)..',i)
    if word:
        print(word[0])
    else:
        pass

正则爬取谷歌翻译语种

友情链接更多精彩内容