xpath应用以及列表推导式用法
1.xpath
阿里真的太变态了,html
里标签的id
和classname
动态变化!牛批!
xpath使用正则
至此,正则和xpath的完美结合结束,但是在xpath使用的过程中还有大大小小的坑。
例如该html使用xpath直接获取span/text()
只有一个.,所以需要循环遍历里面的span
标签
a = """
<div mxa="feedsb1:a" class="feedspN clearfix"><div style="width: 20%;" class="feedspO feedspU "><div mxa="feedsb1:b" class="feedspP">消耗(元) <i class="feeds_ feedspQ" mx-view="feeds/gallery/mx-popover/index?content=%E9%80%89%E5%AE%9A%E6%97%B6%E9%97%B4%E5%86%85%E7%9A%84%E4%BF%A1%E6%81%AF%E6%B5%81%E5%B9%BF%E5%91%8A%E6%80%BB%E8%8A%B1%E8%B4%B9%E3%80%82" id="mx_299"></i></div><div mxa="feedsb1:c" class="feedspR"><span class="xh-highlight"><span class="fontsize-20 font-tahoma bold">4,309</span>.<span class="fontsize-14 font-tahoma bold">72</span></span></div></div><div style="width: 20%;" class="feedspO "><div mxa="feedsb1:b" class="feedspP">展现量 <i class="feeds_ feedspQ" mx-view="feeds/gallery/mx-popover/index?content=%E9%80%89%E5%AE%9A%E6%97%B6%E9%97%B4%E5%86%85%E7%9A%84%E4%BF%A1%E6%81%AF%E6%B5%81%E5%B9%BF%E5%91%8A%E5%B1%95%E7%8E%B0%E6%80%BB%E9%87%8F%E3%80%82" id="mx_300"></i></div><div mxa="feedsb1:c" class="feedspR"><span><span class="fontsize-20 font-tahoma bold">155,219</span></span></div></div><div style="width: 20%;" class="feedspO "><div mxa="feedsb1:b" class="feedspP">点击量 <i class="feeds_ feedspQ" mx-view="feeds/gallery/mx-popover/index?content=%E9%80%89%E5%AE%9A%E6%97%B6%E9%97%B4%E5%86%85%E7%9A%84%E4%BF%A1%E6%81%AF%E6%B5%81%E5%B9%BF%E5%91%8A%E7%82%B9%E5%87%BB%E6%80%BB%E9%87%8F%E3%80%82" id="mx_301"></i></div><div mxa="feedsb1:c" class="feedspR"><span><span class="fontsize-20 font-tahoma bold">5,996</span></span></div></div><div style="width: 20%;" class="feedspO "><div mxa="feedsb1:b" class="feedspP">千次展现成本(元) <i class="feeds_ feedspQ" mx-view="feeds/gallery/mx-popover/index?content=%E5%8D%83%E6%AC%A1%E5%B1%95%E7%8E%B0%E6%88%90%E6%9C%AC%20%3D%20%E6%B6%88%E8%80%97%20%2F%20%E5%B1%95%E7%8E%B0%E9%87%8F%20%2A%201000%E3%80%82" id="mx_302"></i></div><div mxa="feedsb1:c" class="feedspR"><span><span class="fontsize-20 font-tahoma bold">27</span>.<span class="fontsize-14 font-tahoma bold">77</span></span></div></div><div style="width: 20%;" class="feedspO "><div mxa="feedsb1:b" class="feedspP" data-spm-anchor-id="a2et4.11816906.88888888.i3.56841f56QHkB09">点击成本(元) <i class="feeds_ feedspQ" mx-view="feeds/gallery/mx-popover/index?content=%E7%82%B9%E5%87%BB%E6%88%90%E6%9C%AC%20%3D%20%E6%B6%88%E8%80%97%20%2F%20%E7%82%B9%E5%87%BB%E9%87%8F%E3%80%82" id="mx_303"></i></div><div mxa="feedsb1:c" class="feedspR"><span><span class="fontsize-20 font-tahoma bold">0</span>.<span class="fontsize-14 font-tahoma bold">72</span></span></div></div></div>
from lxml import etree
s = etree.HTML(a)
a = [i.xpath("span/text()") for i in s.xpath("//div[starts-with(@class, 'feedsp')]/div[1]/div[2]/span")][0]
print(a) #获取第二层span的text()
>>>
['4,309', '72']
循环xpath时,第二层循环不需要加
/
Xpath使用正则匹配时,1.0版本无法使用ends-with
仅可以使用starts-with
,所以使用下面方式匹配:
js ="""
document.evaluate("//a[contains(@id,'adStrategyDkx')]", document).iterateNext().click()
setTimeout('document.evaluate("//a[contains(@href,'exportOverProductCampaignReportList')]", document).iterateNext().click()',5000);
"""
driver1.execute_script(js)
2.列表推导式
未使用推导式的代码
a =(['2,621', '35'],['210,852'],['2,398'],['12', '43'],['1', '09'])
for i in a:
if len(i)==1:
print(i[0].replace(',', ''))
else:
print(((i[0] + '.' + i[1]).replace(',', '')))
>>>
2621.35
210852
2398
12.43
1.09
使用列表推导式
a =(['2,621', '35'],['210,852'],['2,398'],['12', '43'],['1', '09'])
a = [i[0].replace(',', '') if len(i)==1 else ((i[0] + '.' + i[1]).replace(',', '')) for i in a ]
>>>
['2621.35', '210852', '2398', '12.43', '1.09']
使用列表推导式替换后重组
import re
a =[ ('636710425400', 318.09, 1, 5, '2021-04-08'), ('616873233546;\n617383035002;\n585098358905', 62.54, 1, 5, '2021-04-08'), ('39856905008', 120.5, 1, 5, '2021-04-08'), ('610474183283', 84.6, 1, 5, '2021-04-08'), ('625046034602;\n625333579394', 93.01, 1, 5, '2021-04-08'), ('608727393051;\n616119541298;\n610494565658;\n633320454564', 122.55, 1, 5, '2021-04-08')]
ccc =[(re.findall("(\d+);",str(i[0]))[0],i[1],i[2],i[3],i[4]) if ';' in str(i[0]) else i for i in a ]
for i in ccc:
print(i)
>>>
('636710425400', 318.09, 1, 5, '2021-04-08')
('616873233546', 62.54, 1, 5, '2021-04-08')
('39856905008', 120.5, 1, 5, '2021-04-08')
('610474183283', 84.6, 1, 5, '2021-04-08')
('625046034602', 93.01, 1, 5, '2021-04-08')
('608727393051', 122.55, 1, 5, '2021-04-08')
总结:循环写最后,逻辑顺序排,第一放开头
3.xpath补充
获取直接子节点/
获取li标签下面的第一级的直接子节点,代码如下:
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a')
print(result)
注意:是直接子节点,二级以上不行
获取所有的子孙节点//
获取ul下面的所有a链接,包括孙子节点
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//ul//a')
print(result)
获取父节点..
选中href属性为link4.html的a节点,然后再获取其父节点
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//a[@href="link4.html"]/../@class')
print(result)
文本获取text()
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li[@class="item-0"]/a/text()')
print(result)
属性获取@
我们想获取所有li节点下所有a节点的href属性
from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = html.xpath('//li/a/@href')
print(result)
属性多值匹配contains()
网页中常常会出现某个节点某个属性多个值
from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[@class="li"]/a/text()')
print(result)
如果是用上面的方式返回的是[]
,由于li的节点属性多个,li节点的class属性有两个值li和li-first,无法正确匹配
这时候就要使用contains()
from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)
这样通过contains()方法,第一个参数传入属性名称,第二个参数传入属性值,只要此属性包含所传入的属性值,就可以完成匹配了。
多属性匹配
我们可能还遇到一种情况,那就是根据多个属性确定一个节点,这时就需要同时匹配多个属性。此时可以使用运算符and来连接,示例如下:
from lxml import etree
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)