xpath应用以及列表推导式用法

xpath应用以及列表推导式用法

1.xpath

阿里真的太变态了,html里标签的idclassname动态变化!牛批!

xpath使用正则

image.png

至此,正则和xpath的完美结合结束,但是在xpath使用的过程中还有大大小小的坑。

例如该html使用xpath直接获取span/text()只有一个.,所以需要循环遍历里面的span标签

a = """
<div mxa="feedsb1:a" class="feedspN clearfix"><div style="width: 20%;" class="feedspO   feedspU   "><div mxa="feedsb1:b" class="feedspP">消耗(元) <i class="feeds_ feedspQ" mx-view="feeds/gallery/mx-popover/index?content=%E9%80%89%E5%AE%9A%E6%97%B6%E9%97%B4%E5%86%85%E7%9A%84%E4%BF%A1%E6%81%AF%E6%B5%81%E5%B9%BF%E5%91%8A%E6%80%BB%E8%8A%B1%E8%B4%B9%E3%80%82" id="mx_299"></i></div><div mxa="feedsb1:c" class="feedspR"><span class="xh-highlight"><span class="fontsize-20 font-tahoma bold">4,309</span>.<span class="fontsize-14 font-tahoma bold">72</span></span></div></div><div style="width: 20%;" class="feedspO  "><div mxa="feedsb1:b" class="feedspP">展现量 <i class="feeds_ feedspQ" mx-view="feeds/gallery/mx-popover/index?content=%E9%80%89%E5%AE%9A%E6%97%B6%E9%97%B4%E5%86%85%E7%9A%84%E4%BF%A1%E6%81%AF%E6%B5%81%E5%B9%BF%E5%91%8A%E5%B1%95%E7%8E%B0%E6%80%BB%E9%87%8F%E3%80%82" id="mx_300"></i></div><div mxa="feedsb1:c" class="feedspR"><span><span class="fontsize-20 font-tahoma bold">155,219</span></span></div></div><div style="width: 20%;" class="feedspO  "><div mxa="feedsb1:b" class="feedspP">点击量 <i class="feeds_ feedspQ" mx-view="feeds/gallery/mx-popover/index?content=%E9%80%89%E5%AE%9A%E6%97%B6%E9%97%B4%E5%86%85%E7%9A%84%E4%BF%A1%E6%81%AF%E6%B5%81%E5%B9%BF%E5%91%8A%E7%82%B9%E5%87%BB%E6%80%BB%E9%87%8F%E3%80%82" id="mx_301"></i></div><div mxa="feedsb1:c" class="feedspR"><span><span class="fontsize-20 font-tahoma bold">5,996</span></span></div></div><div style="width: 20%;" class="feedspO  "><div mxa="feedsb1:b" class="feedspP">千次展现成本(元) <i class="feeds_ feedspQ" mx-view="feeds/gallery/mx-popover/index?content=%E5%8D%83%E6%AC%A1%E5%B1%95%E7%8E%B0%E6%88%90%E6%9C%AC%20%3D%20%E6%B6%88%E8%80%97%20%2F%20%E5%B1%95%E7%8E%B0%E9%87%8F%20%2A%201000%E3%80%82" id="mx_302"></i></div><div mxa="feedsb1:c" class="feedspR"><span><span class="fontsize-20 font-tahoma bold">27</span>.<span class="fontsize-14 font-tahoma bold">77</span></span></div></div><div style="width: 20%;" class="feedspO  "><div mxa="feedsb1:b" class="feedspP" data-spm-anchor-id="a2et4.11816906.88888888.i3.56841f56QHkB09">点击成本(元) <i class="feeds_ feedspQ" mx-view="feeds/gallery/mx-popover/index?content=%E7%82%B9%E5%87%BB%E6%88%90%E6%9C%AC%20%3D%20%E6%B6%88%E8%80%97%20%2F%20%E7%82%B9%E5%87%BB%E9%87%8F%E3%80%82" id="mx_303"></i></div><div mxa="feedsb1:c" class="feedspR"><span><span class="fontsize-20 font-tahoma bold">0</span>.<span class="fontsize-14 font-tahoma bold">72</span></span></div></div></div>



from lxml import etree

s = etree.HTML(a)

a = [i.xpath("span/text()") for i in s.xpath("//div[starts-with(@class, 'feedsp')]/div[1]/div[2]/span")][0]
print(a)  #获取第二层span的text()

>>>
['4,309', '72']

循环xpath时,第二层循环不需要加/

Xpath使用正则匹配时,1.0版本无法使用ends-with 仅可以使用starts-with,所以使用下面方式匹配:

    js  ="""
            document.evaluate("//a[contains(@id,'adStrategyDkx')]", document).iterateNext().click()
            setTimeout('document.evaluate("//a[contains(@href,'exportOverProductCampaignReportList')]", document).iterateNext().click()',5000);
    """

    driver1.execute_script(js)

2.列表推导式

未使用推导式的代码

a =(['2,621', '35'],['210,852'],['2,398'],['12', '43'],['1', '09'])

for i in a:
    if len(i)==1:
        print(i[0].replace(',', ''))
    else:
        print(((i[0] + '.' + i[1]).replace(',', '')))


>>>
2621.35
210852
2398
12.43
1.09

使用列表推导式

a =(['2,621', '35'],['210,852'],['2,398'],['12', '43'],['1', '09'])

a = [i[0].replace(',', '') if len(i)==1 else ((i[0] + '.' + i[1]).replace(',', '')) for i in a ]

>>>
['2621.35', '210852', '2398', '12.43', '1.09']

使用列表推导式替换后重组

import re
a =[ ('636710425400', 318.09, 1, 5, '2021-04-08'), ('616873233546;\n617383035002;\n585098358905', 62.54, 1, 5, '2021-04-08'), ('39856905008', 120.5, 1, 5, '2021-04-08'), ('610474183283', 84.6, 1, 5, '2021-04-08'), ('625046034602;\n625333579394', 93.01, 1, 5, '2021-04-08'), ('608727393051;\n616119541298;\n610494565658;\n633320454564', 122.55, 1, 5, '2021-04-08')]
ccc =[(re.findall("(\d+);",str(i[0]))[0],i[1],i[2],i[3],i[4]) if ';' in str(i[0]) else i for i in a ]
for i in ccc:
    print(i)

>>>
('636710425400', 318.09, 1, 5, '2021-04-08')
('616873233546', 62.54, 1, 5, '2021-04-08')
('39856905008', 120.5, 1, 5, '2021-04-08')
('610474183283', 84.6, 1, 5, '2021-04-08')
('625046034602', 93.01, 1, 5, '2021-04-08')
('608727393051', 122.55, 1, 5, '2021-04-08')

总结:循环写最后,逻辑顺序排,第一放开头

3.xpath补充

获取直接子节点/
获取li标签下面的第一级的直接子节点,代码如下:

from lxml import etree
 
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)

注意:是直接子节点,二级以上不行

获取所有的子孙节点//

获取ul下面的所有a链接,包括孙子节点

from lxml import etree
 
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//ul//a')
print(result)

获取父节点..

选中href属性为link4.html的a节点,然后再获取其父节点

from lxml import etree
 
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/../@class')
print(result)

文本获取text()

from lxml import etree
 
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/a/text()')
print(result)

属性获取@

我们想获取所有li节点下所有a节点的href属性

from lxml import etree
 
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)

属性多值匹配contains()

网页中常常会出现某个节点某个属性多个值

from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[@class="li"]/a/text()')
print(result) 

如果是用上面的方式返回的是[],由于li的节点属性多个,li节点的class属性有两个值li和li-first,无法正确匹配
这时候就要使用contains()

from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)

这样通过contains()方法,第一个参数传入属性名称,第二个参数传入属性值,只要此属性包含所传入的属性值,就可以完成匹配了。

多属性匹配
我们可能还遇到一种情况,那就是根据多个属性确定一个节点,这时就需要同时匹配多个属性。此时可以使用运算符and来连接,示例如下:

from lxml import etree
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)

xpath参考文档

最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
【社区内容提示】社区部分内容疑似由AI辅助生成,浏览时请结合常识与多方信息审慎甄别。
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

友情链接更多精彩内容