使用pandas的read_excel()读取xls文件时遇到一个奇怪的报错:
VauleError: Excel file format cannot be determined, you must specify an engine manually
使用excel工具可以正常打开,后来发现这个文件本身是xml格式的,使用编辑器打开如下:
<?xml version="1.0" encoding="UTF-8"?>
<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
<Worksheet ss:Name="Report">
<ss:Table>
<Row>
<Cell>
<Data ss:Index="1">Hello!World!</Data></Cell>
</Cell>
<Cell>
<Data ss:Index="3">Skip!</Data></Cell>
</Cell>
</Row>
</ss:Table>
</Worksheet>
</Workbook>
读取这个xml文件,存成pandas的dataframe,由于我的xml文件存在使用index实现合并单元格的问题,所以我先读取了行数和列数声明了dataframe,再对其进行赋值
import pandas
import xml.etree.ElementTree as ET
import os
def parse_xml(file_path, file_name):
tree = ET.parse(os.path.join(file_path, file_name))
worksheet = tree.find('{urn:schemas-microsoft-com:office:spreadsheet}Worksheet')
table = worksheet[0]
# 读取行数和列数
row_num = 0
column_num = 0
for row in table:
if row.tag != '{urn:schemas-microsoft-com:office:spreadsheet}Row':
continue
row_num += 1
for cell in row:
if cell.tag != '{urn:schemas-microsoft-com:office:spreadsheet}Cell':
continue
index = int(cell.attrib['{urn:schemas-microsoft-com:office:spreadsheet}Index'])
column_num = index if index > column_num else column_num
# 读取数据到dataframe
sheet_data = pd.DataFrame(index=range(row_num), column=range(column_num))
row_index = 0
for row in table:
if row.tag != '{urn:schemas-microsoft-com:office:spreadsheet}Row':
continue
for cell in row:
if cell.tag != '{urn:schemas-microsoft-com:office:spreadsheet}Cell':
continue
for data in cell:
if data.tag == '{urn:schemas-microsoft-com:office:spreadsheet}Data' and data.text is not None:
column_index = int(cell.attrib['{urn:schemas-microsoft-com:office:spreadsheet}Index']) - 1
sheet_data.loc[row_index][column_index] = data.text
row_index += 1
return sheet_data
注意:该dataframe的column列是需要重置的
另外,该dataframe没有数据格式的,可以通过如下方式重置:
float_list =[]
sheet_data[float_list] = sheet_data[float_list].astype('float')
如果结尾处有一些空白行影响到后续处理,可以进行清除:
sheet_data = sheet_data.drop(sheet_data[sheet_data[sheet_data.isna(sheet_data['列名'])]].index)