前言:由于工作需要,需要Python检查XML数据是否匹配,故记录解决过程
0 开始之前
-
一份XML文件(manager.xml)
<?xml version="1.0" encoding="UTF-8"?> <package version="1.0.0.0"> <supported_operating_systems> <!-- release使用uname -r读取 --> <supported_operating_system platform="64" major="0" minor="0" sr="OEM" release="3.10.0-957.el7.x86_64 " name="Red Hat Enterprise Linux Server release 7.6 (Maipo)"/> <supported_operating_system platform="64" major="0" minor="0" sr="OEM" release="3.10.0-957.el7.x86_64" name="CentOS Linux release 7.6.1810 (Core)"/> </supported_operating_systems> <component_name_cn>Driver_Redhat7.6</component_name_cn> <component_support_system> <supported_operating_system platform="64" major="0" minor="0" sr="OEM" release="3.10.0-957.el7.x86_64" name="Red Hat Enterprise Linux Server release 7.6 (Maipo)"/> <supported_operating_system platform="64" major="0" minor="0" sr="OEM" release="3.10.0-957.el7.x86_64" name="CentOS Linux release 7.6.1810 (Core)"/> </component_support_system> </package> -
一份JSON文件(new_operating.json)
{ "Redhat7.6": [ { { "platform": "64", "major": "0", "minor": "0", "sr": "OEM", "release": "3.10.0-957.el7.x86_64", "name": "Red Hat Enterprise Linux Server release 7.6 (Maipo)" }, { "platform": "64", "major": "0", "minor": "0", "sr": "OEM", "release": "3.10.0-957.el7.x86_64", "name": "CentOS Linux release 7.6.1810 (Core)" } } ] }
1 源码
# coding=utf-8
import xml.dom.minidom
import operator
import json
dom = xml.dom.minidom.parse('manager.xml')
root = dom.documentElement
# 这里的supported_operating_systems,component_support_system是为了取到子节点supported_operating_system的值,便于比较
supported_operating_systems = root.getElementsByTagName('supported_operating_systems')[0].getElementsByTagName(
'supported_operating_system')
component_support_system = root.getElementsByTagName('component_support_system')[0].getElementsByTagName(
'supported_operating_system')
component_description_cn = root.getElementsByTagName('component_description_cn')[0].childNodes[0]
file_path = r'new_operating.json'
def xml_self_compare(supported_operating_systems, component_support_system):
lis1 = []
lis2 = []
for i in supported_operating_systems:
lis1.append(sorted(i.attributes.items()))
for j in component_support_system:
lis2.append(sorted(j.attributes.items()))
return operator.eq(lis1, lis2)
#a = xml_self_compare(supported_operating_systems,component_support_system)
#print(a)
# 如果返回值为false,xml里的supported_operating_systems 和 supported_operating_system不一致,请检查是否存在空格或是属性顺序上下填写不一致
# 或是多写了
def xml_json_compare(supported_operating_systems, component_description_cn):
lis3 = []
lis4 = []
'''component_description_cn是一个标签对象,我们需要它标签里的文本值,即属性data'''
get_os = component_description_cn.data.split('_')[-1]
with open(operating_json_path, encoding='utf-8') as fp:
data = json.load(fp)
for x in supported_operating_systems:
lis3.append(sorted(x.attributes.items()))
for y in data[get_os]:
lis4.append(sorted(y.items()))
return operator.eq(lis3, lis4)
#b = xml_json_compare(supported_operating_systems, component_description_cn)
#print(b)
# 如果返回值为false,json文件里的supported_operating_system和xml里的supported_operating_systems不一致,请检查是否存在空格或是属性顺序上下填写不一致
# 或是多写了
2 期间遇到的问题
Q1: 从XML读取指定数据后与json的数据进行比对,根据root.getElementsByTagName和遍历获取属性值,如下
os_support = root.getElementsByTagName('supported_operating_systems')[0].getElementsByTagName('supported_operating_system')
for i in os_support:
print(i.attributes.items())
#xml读取打印内容
[('platform', '64'), ('major', '0'), ('minor', '0'), ('sr', 'OEM'), ('release', '3.10.0-957.el7.x86_64 '), ('name', 'Red Hat Enterprise Linux Server release 7.6 (Maipo)')]
[('platform', '64'), ('major', '0'), ('minor', '0'), ('sr', 'OEM'), ('release', '3.10.0-957.el7.x86_64'), ('name', 'CentOS Linux release 7.6.1810 (Core)')]
#--------------------分割线-------------------------
file_path = r'new_operating.json'
with open(file_path, encoding='utf-8') as fp:
data = json.load(fp)
print(data)
#json读取打印内容
{'Redhat7.6': [{'platform': '64', 'major': '0', 'minor': '0', 'sr': 'OEM', 'release': '3.10.0-957.el7.x86_64', 'name': 'Red Hat Enterprise Linux Server release 7.6 (Maipo)'}, {'platform': '64', 'major': '0', 'minor': '0', 'sr': 'OEM', 'release': '3.10.0-957.el7.x86_64', 'name': 'CentOS Linux release 7.6.1810 (Core)'}]}
在代码块中可以看出读取出的格式并不一致,最开始笔者想的是将xml读取内容转换为json读取打印内容,再通过比对两个字典的值,但匹配了数次,数据内容没办法统一,于是反向思考了一下,可以将json的格式转换为xml读取内容相近格式,再统一处理
os_support = root.getElementsByTagName('supported_operating_systems')[0].getElementsByTagName('supported_operating_system')
lis1 = []
for i in os_support:
lis1.append(i.attributes.items())
#处理后xml格式
[[('platform', '64'), ('major', '0'), ('minor', '0'), ('sr', 'OEM'), ('release', '3.10.0-957.el7.x86_64 '), ('name', 'Red Hat Enterprise Linux Server release 7.6 (Maipo)')], [('platform', '64'), ('major', '0'), ('minor', '0'), ('sr', 'OEM'), ('release', '3.10.0-957.el7.x86_64'), ('name', 'CentOS Linux release 7.6.1810 (Core)')]]
#----------------------分割线----------------------
file_path = r'new_operating.json'
with open(file_path, encoding='utf-8') as fp:
data = json.load(fp)
for x in data['Redhat7.6']: #这里的Redhat7.6由来请看上述完整代码
lis2.append(list(x.items()))
#处理后json格式
[[('platform', '64'), ('major', '0'), ('minor', '0'), ('sr', 'OEM'), ('release', '3.10.0-957.el7.x86_64'), ('name', 'Red Hat Enterprise Linux Server release 7.6 (Maipo)')], [('platform', '64'), ('major', '0'), ('minor', '0'), ('sr', 'OEM'), ('release', '3.10.0-957.el7.x86_64'), ('name', 'CentOS Linux release 7.6.1810 (Core)')]]
可以看出两个列表此时数据格式完全一致(肉眼上,哈哈QAQ),此时需要将两个列表相比对(机器比对最准确)
import operator
operator.eq(lis1,lis2)
#返回值为True(完全一致)或False(不一致)
Q2:运用zip(lis1,lis2)会出现数据丢失问题(自命名),见如下代码
a = [1,2,3]
b = [1,2,3,4]
for i,j in zip(a,b):
print(i,j)
# 实际输出结果
1 1
2 2
3 3
#笔者本意输出
1 1
2 2
3 3
4
笔者本意是想将两个列表的所有数据同时全部打印,减少代码量,但zip的逻辑是匹配两个列表中元素较少的列表,元素多的不会全部打印,所以就存在数据丢失,暂时笔者还没有比较好的解决办法,所以只能分开遍历取数据
Q3:从xml取的属性和对应的值如果顺序不一样怎么办?从数据层面来看,数据都一致,但顺序不一样,所以operator.eq()返回值是False,不符合实际场景(每个人可能写的顺序不一样,但并不是数据写错的,如果固定格式,显得不灵活)
A3:从xml取数据的时候可以先排列,再比对,即不论每个人写的顺序,数据是固定的,先取数据再排序。见如下代码
os_support = root.getElementsByTagName('supported_operating_systems')[0].getElementsByTagName('supported_operating_system')
lis1 = []
lis2 = []
for i in os_support:
lis1.append(i.attributes.items())
lis2.append(sorted(i.attributes.items()))
print(lis1)
print(lis2)
# lis1输出内容
[[('platform', '64'), ('major', '0'), ('minor', '0'), ('sr', 'OEM'), ('release', '3.10.0-957.el7.x86_64 '), ('name', 'Red Hat Enterprise Linux Server release 7.6 (Maipo)')],[('platform', '64'), ('major', '0'), ('minor', '0'), ('sr', 'OEM'), ('release', '3.10.0-957.el7.x86_64'), ('name', 'CentOS Linux release 7.6.1810 (Core)')]]
# lis2输出内容
[[('major', '0'), ('minor', '0'), ('name', 'Red Hat Enterprise Linux Server release 7.6 (Maipo)'), ('platform', '64'), ('release', '3.10.0-957.el7.x86_64'), ('sr', 'OEM')], [('major', '0'), ('minor', '0'), ('name', 'CentOS Linux release 7.6.1810 (Core)'), ('platform', '64'), ('release', '3.10.0-957.el7.x86_64'), ('sr', 'OEM')]]
利用sorted()方法将数据进行排序(即按照ASCII码顺序,取的数据就算顺序不一样,也可以再排序,这样就不管每个人写的顺序,只在乎数据的完整性和正确性),然后再进行比对,验证数据的完整性和正确性。
3 不足之处
N1:将错误值打印到excel中?或是直接修改?
4 后续
2020-11-30 忽略了属性的排序,但是标签对的排序也要忽略,见如下代码
os_support = root.getElementsByTagName('supported_operating_systems')[0].getElementsByTagName('supported_operating_system')
lis1 = []
for i in os_support:
lis1.append(sorted(i.attributes.items()))
print(lis1)
lis1.sort()
print(lis1)
# 第一次打印
[[('major', '0'), ('minor', '0'), ('name', 'Red Hat Enterprise Linux Server release 7.6 (Maipo)'), ('platform', '64'), ('release', '3.10.0-957.el7.x86_64'), ('sr', 'OEM')], [('major', '0'), ('minor', '0'), ('name', 'CentOS Linux release 7.6.1810 (Core)'), ('platform', '64'), ('release', '3.10.0-957.el7.x86_64'), ('sr', 'OEM')]]
# 第二次打印
[[('major', '0'), ('minor', '0'), ('name', 'CentOS Linux release 7.6.1810 (Core)'), ('platform', '64'), ('release', '3.10.0-957.el7.x86_64'), ('sr', 'OEM')], [('major', '0'), ('minor', '0'), ('name', 'Red Hat Enterprise Linux Server release 7.6 (Maipo)'), ('platform', '64'), ('release', '3.10.0-957.el7.x86_64'), ('sr', 'OEM')]]
存在这种情况:有些人将Red Hat Enterprise Linux Server release 7.6 (Maipo)写在第一行,也有写在第二行,所以通过再次排序,从而再进行比较,而不需要规范xml系统值和属性的写入顺序。
注意:
sorted:会排序原数组的元素生成一个全新的数组
sort:永久改变原数组的元组顺序,改变原数组