bs4解析(下)

select()方法

我们也可以通过css选择器的方式来提取数据。但是需要注意的是这里面需要我们掌握css语法
https://www.w3school.com.cn/cssref/css_selectors.asp
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

  • 查找title标签
soup = BeautifulSoup(html_doc,'lxml')
print(soup.select('title'))
运行结果:
[<title>The Dormouse's story</title>]
  • title标签文本值
soup = BeautifulSoup(html_doc,'lxml')
tie = soup.select('title')[0].string
print(tie)
运行结果
The Dormouse's story
  • 查找 class="sister"且 id="link1’’的值
    class="sister"在css中表示: '.sister'
soup = BeautifulSoup(html_doc,'lxml')
tie = soup.select('.sister')[0].get('id')
print(tie)
运行结果
link1
  • 查找id="link2"的值
    id="link2"在css中表示: '#link2'
soup = BeautifulSoup(html_doc,'lxml')
tie = soup.select('#link2')[0].string
print(tie)
运行结果
Lacie

select()拓展案例

from bs4 import BeautifulSoup
html = """
<table class="tablelist" cellpadding="0" cellspacing="0">
    <tbody>
        <tr class="h">
            <td class="l" width="374">职位名称</td>
            <td>职位类别</td>
            <td>人数</td>
            <td>地点</td>
            <td>发布时间</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=33824&keywords=python&tid=87&lid=2218">22989-金融云区块链高级研发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=29938&keywords=python&tid=87&lid=2218">22989-金融云高级后台开发</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31236&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐运营开发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>2</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31235&keywords=python&tid=87&lid=2218">SNG16-腾讯音乐业务运维工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-25</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34531&keywords=python&tid=87&lid=2218">TEG03-高级研发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=34532&keywords=python&tid=87&lid=2218">TEG03-高级图像算法研发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=31648&keywords=python&tid=87&lid=2218">TEG11-高级AI开发工程师(深圳)</a></td>
            <td>技术类</td>
            <td>4</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32218&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="even">
            <td class="l square"><a target="_blank" href="position_detail.php?id=32217&keywords=python&tid=87&lid=2218">15851-后台开发工程师</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
        <tr class="odd">
            <td class="l square"><a id="test" class="test" target='_blank' href="position_detail.php?id=34511&keywords=python&tid=87&lid=2218">SNG11-高级业务运维工程师(深圳)</a></td>
            <td>技术类</td>
            <td>1</td>
            <td>深圳</td>
            <td>2017-11-24</td>
        </tr>
    </tbody>
</table>
"""
soup = BeautifulSoup(html,'lxml')
trs = soup.select('tr')[1:]
for tr in trs:
    jobs = tr.select('td')[0]
    jobs_work =list(jobs.stripped_strings)[0]
    print(jobs_work)
运行结果:
22989-金融云区块链高级研发工程师(深圳)
22989-金融云高级后台开发
SNG16-腾讯音乐运营开发工程师(深圳)
SNG16-腾讯音乐业务运维工程师(深圳)
TEG03-高级研发工程师(深圳)
TEG03-高级图像算法研发工程师(深圳)
TEG11-高级AI开发工程师(深圳)
15851-后台开发工程师
15851-后台开发工程师
SNG11-高级业务运维工程师(深圳)

bs4案例总结

  • 需求:获取‘https://pt.597.com/zhaopin’人才招聘信息.如职位','发布公司','薪资','求职地区','经验要求'等
  • 页面结构分析:
    https://pt.597.com/zhaopin/?q=%E6%96%87%E5%91%98&page=1 第一页
    https://pt.597.com/zhaopin/?q=%E6%96%87%E5%91%98&page=2 第二页
    https://pt.597.com/zhaopin/?q=%E6%96%87%E5%91%98&page=3 第三页
    以此类推,input()函数获取自定义查询岗位并要获取多页
    workjob = input('请你输入要查找的工作岗位:')
    for x in range(31):
    url ='https://pt.597.com/zhaopin/?q=%s&page={}'.format(x+1)%workjob
  • 分析网页源码,经检查网页源码与elements元素一致,经判断网页为静态网页,可以用正则、xpath及BS4进行网页解析,本案例总结用BS4解析
  • 打开‘https://pt.597.com/zhaopin/?q=%E6%96%87%E5%91%98&page=2’网页,选中职位名称下的‘仓库文员’右键检查,得<a href="/job-4b2a4d5132951.html" data-jid="4b2a4d5132951" data-act="1" target="blank" class="fb des_title" style="" rel="">仓库文员</a>。它的上级标签为<li class="firm-l";<li class="firm-l"上级标签为
    <li class="firm-l"对应的是一行完整的招聘信息。而div class="firm_box" id="firm_box">代表着该网页中所有的完整的聘信息,所以我们需要用bs4方法将其查找出来
    soup = BeautifulSoup(res,'lxml')
    firm_box = soup.find('div', class
    ="firm_box")
    进一步缩小范围,通过查找所有的<div class="firm-item",将除了表头以外的每行信息提取出来
    firm_items = firm_box.find_all('div', class_="firm-item")[1:]
    for firm_item in firm_items:
    职位等信息在firm_item对象所应的li标签里
    lis = firm_item.find_all('li')
    lis是个列表,其中发布公司是在lis[1].string里,以此类推
  • 将解析的字符串装进lst列表,通过append()方法将每行信息写入到列表中
  • 根椐列表法写入CSV文件
    def WriteData(self,job_lst):
    hr=['职位','发布公司','薪资','求职地区','经验要求']
    with open('招聘统计表.csv','w',encoding='utf-8',newline='')as f:
    writer = csv.writer(f)
    writer.writerow(hr)
    writer.writerows(job_lst)
import requests
import csv
from bs4 import BeautifulSoup
class Sprider():
    def __init__(self):
        self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36'}

    def UrlSourse(self,url):
        response = requests.get(url, headers=self.headers)
        response.encoding = 'utf-8'
        res = response.text
        return res

    def parserurl(self,res):
        soup = BeautifulSoup(res,'lxml')
        firm_box = soup.find('div', class_="firm_box")
        firm_items = firm_box.find_all('div', class_="firm-item")[1:]
        job_lst = []
        for firm_item in firm_items:
            lis = firm_item.find_all('li')
            target = lis[1].string
            sary = lis[2].string
            diqu = lis[3].string
            jyan = lis[4].string
            jobs = firm_item.find_all('a')
            job = jobs[0].string
            lst = [job, target, sary, diqu, jyan]
            job_lst.append(lst)
        return job_lst

    def WriteData(self,job_lst):
        hr=['职位','发布公司','薪资','求职地区','经验要求']
        with open('招聘统计表.csv','w',encoding='utf-8',newline='')as f:
            writer = csv.writer(f)
            writer.writerow(hr)
            writer.writerows(job_lst)

    def main(self):
        workjob = input('请你输入要查找的工作岗位:')
        job_lst=[]
        for x in range(31):
            url ='https://pt.597.com/zhaopin/?q=%s&page={}'.format(x+1)%workjob
            res = self.UrlSourse(url)
            job_lst += self.parserurl(res)
        self.WriteData(job_lst)


if __name__ == '__main__':

    s = Sprider()
    s.main()

bs4案例进阶总结

import requests
import csv
from bs4 import BeautifulSoup

class Sprider():

    def __init__(self):
        self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36'}

    def urlsourse(self,url):
        response = requests.get(url, headers=self.headers)
        response.encoding = 'utf-8'
        res = response.text
        return res

    def parseurl(self,res):

        soup = BeautifulSoup(res, 'html5lib')
        conMidtab = soup.find('div', class_="conMidtab")
        tables = conMidtab.find_all('table')
        weather_list=[]
        for table in tables:
            trs = table.find_all('tr')[2:]
            for index, tr in enumerate(trs):
                tds = tr.find_all('td')
                city = list(tds[0].stripped_strings)[0]
                if index == 0:
                    city = list(tds[1].stripped_strings)[0]
                lst_dict= {}
                temp = list(tds[-2].stripped_strings)[0]
                lst_dict['city'] = city
                lst_dict['temp'] = temp
                weather_list.append(lst_dict)
        return weather_list

    def writerdata(self,weather_list):
        hr =['city','temp']
        with open('天气预报.csv','w',encoding='utf-8',newline='')as f:
            writer = csv.DictWriter(f,hr)
            writer.writeheader()
            writer.writerows(weather_list)

    def main(self):

        urls =['http://www.weather.com.cn/textFC/hb.shtml','http://www.weather.com.cn/textFC/db.shtml','http://www.weather.com.cn/textFC/hd.shtml',
                  'http://www.weather.com.cn/textFC/hz.shtml','http://www.weather.com.cn/textFC/hn.shtml','http://www.weather.com.cn/textFC/xb.shtml',
                  'http://www.weather.com.cn/textFC/xn.shtml','http://www.weather.com.cn/textFC/gat.shtml']
        lst =[]
        for url in urls:
            res = self.urlsourse(url)
            lst += self.parseurl(res)
        self.writerdata(lst)

if __name__ == '__main__':
        s = Sprider()
        s.main()

新的知识点

  • enumerate() 函数

trs = [1,2,3]
trs
[1, 2, 3]
for index,tr in enumerate(trs):
print(index,tr)
我这里面 需要做一个判断 判断某一个标签。需要得到它的下标来确定是这个出现问题的标签
0 1
1 2
2 3

lst = [4,2,25,35,24,39]

for index,n in enumerate(lst):
    if index == 0:
        n = 100
    print(index,n)
运行结果:
0 100
1 2
2 25
3 35
4 24
5 39
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

相关阅读更多精彩内容

友情链接更多精彩内容