3.1 内省Excel工作表
.XLS
import sys
from xlrd import open_workbook
input_file = sys.argv[1]
workbook = open_workbook(input_file)
print('Number of worksheets:', workbook.nsheets)
for worksheet in workbook.sheets():
print("Worksheet name:", worksheet.name, "\tRow:", worksheet.nrows, \
"\tColumns:", worksheet.ncols)
-
xldr
模块不再支持.xlsx格式,xlwt
模块不支持.xlsx格式。 -
workbook.nsheets
,worksheet.name
,worksheet.nrows
,worksheet.ncols
属性分别储存了文件内工作表个数、工作表名、工作表行数、工作表列数。
.XLSX
尝试使用openpyxl
模块读取.xlsx格式,但其并不支持.xls格式。
import sys
from openpyxl import load_workbook
input_file = sys.argv[1]
wb = load_workbook(input_file, read_only=True, data_only=True)
print(f"Number of worksheets: {len(wb.sheetnames)}")
for worksheet in wb.worksheets:
print("Worksheet name:", worksheet.title , "\tRow:", worksheet.max_row, \
"\tColumns:", worksheet.max_column)
wb.close()
There are several flags that can be used in
load_workbook
.
data_only
controls whether cells with formulae have either the formula (default) or the value stored the last time Excel read the sheet.
keep_vba
controls whether any Visual Basic elements are preserved or not (default). If they are preserved they are still not editable.
load_workbook
一定要添加read_only=True
的参数,否则会报错>>>UserWarning: Unknown extension is not supported and will be removed
。与普通工作簿不同,只读工作簿将使用延迟加载,结尾一定要有wb.close()
。相应的,
wb.sheetnames
属性储存了文件内所有工作表名,wb.active
属性,会获取当前活动工作表,worksheet.title
,worksheet.max_row
,worksheet.max_column
属性则分别储存了工作表名、工作表行数、工作表列数。只写工作簿
wb = Workbook(write_only=True)
与普通工作簿不同,新创建的只写工作簿不包含任何工作表;必须使用该create_sheet()
方法专门创建工作表。常规
openpyxl.worksheet.worksheet.Worksheet
已被更快的替代品openpyxl.worksheet._write_only.WriteOnlyWorksheet
. 当您想转储大量数据时,请确保已安装lxml。在只写工作簿中,写入行只能用
append()
,写单元格只能用from openpyxl.cell import WriteOnlyCell
。在任意位置读写不能使用cell()
或iter_rows()
。例如:ws.append([‘Hello world!’, 3.14, None])
,使用列来写入。只写文档能够导出无限量的数据(甚至超过 Excel 实际可以处理的数据量),同时将内存使用量保持在 10Mb 以下。
只写工作簿只能保存一次。之后每次尝试储存工作簿或 append() 到现有工作表都会引发
openpyxl.utils.exceptions.WorkbookAlreadySaved
异常。在实际单元格数据之前出现在文件中的所有内容都必须在添加单元格之前创建,因为它必须在此之前写入文件。例如,应在添加单元格之前设置freeze_panes(冻结窗格)。
3.2 处理单个工作表
读写Excel文件
基础读写
.XLS
import sys
from xlrd import open_workbook
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('jan_2013_output')
workbook = open_workbook(input_file)
worksheet = workbook.sheet_by_name('january_2013')
for row_index in range(worksheet.nrows):
for column_index in range(worksheet.ncols):
output_worksheet.write(row_index, column_index, worksheet.cell_value(row_index, column_index))
output_workbook.save(output_file)
-
output_workbook = Workbook()
实例化一个xlwt的Workbook对象,可以将结果写入用于输出的excel文件。 -
output_worksheet = output_workbook.add_sheet('jan_2013_output')
为输出工作簿添加一个工作表,新建文件则为当前活动表。 -
worksheet.cell_value()
接受两个参数,分别为行、列值的索引(从0开始)。 -
output_worksheet.write()
接收三个参数,分别为行、列值的索引(从0开始)和值。
.XLSX
import sys
from openpyxl import load_workbook
from openpyxl import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
worksheet1 = output_workbook.active
worksheet1.title = "jan_2013_output"
input_workbook = load_workbook(input_file, read_only=True)
input_worksheet = input_workbook['january_2013']
for i in range(1, input_worksheet.max_row+1):
for j in range(1, input_worksheet.max_column+1):
worksheet1.cell(row = i, column = j, value = input_worksheet.cell(row = i, column = j).value)
output_workbook.save(output_file)
input_workbook.close()
- 使用
worksheet1 = output_workbook.active
获取当前活动表属性并利用worksheet1.title()
修改名称。 - 如果直接使用
worksheet1 = output_workbook.create_sheet("jan_2013_output")
则会在“Sheet 1”后新建名为"jan_2013_output"的工作表。 - Cell 对象的 value, row, column 和 coordinate 属性,存放的是该单元格的所保存的值,行列索引(从1开始)和位置信息(如 'A1')。也可以通过 sheet 的 cell 方法直接指定行(row)与列(column),来获取单元格 Cell 对象。
Pandas
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_excel(input_file, sheet_name='january_2013')
writer = pd.ExcelWriter(output_file)
data_frame.to_excel(writer, sheet_name='jan_13_output', index=False)
writer.save()
pandas.read_excel(io, sheet_name=0, header=0, names=None, index_col=None, usecols=None, squeeze=False, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, parse_dates=False, date_parser=None, thousands=None, comment=None, skipfooter=0, convert_float=None, mangle_dupe_cols=True, storage_options=None)
-
pd.read_excel()
方法读取excel文件中的某工作表。 - 通过
writer
变量储存pd.ExcelWriter(output_file)
,用于将 DataFrame 对象写入 Excel 工作表的类,使用data_frame.to_excel()
将对象写入 Excel 工作表。 - 由于pandas1.3.0以上的版本使用openpyxl模块读取.xlsx文件,因此同样会报错
UserWarning: Unknown extension is not supported and will be removed
,并且不能添加read_only=True
参数来防止报错,但对输出结果没有影响。
格式化日期数据
.XLS
import sys
from datetime import date
from xlrd import open_workbook, xldate_as_tuple
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('jan_2013_output')
with open_workbook(input_file) as workbook:
worksheet = workbook.sheet_by_name('january_2013')
for row_index in range(worksheet.nrows):
row_list_output = []
for col_index in range(worksheet.ncols):
if worksheet.cell_type(row_index, col_index) == 3:
date_cell = xldate_as_tuple(worksheet.cell_value(row_index, col_index), workbook.datemode)
date_cell = date(*date_cell[0: 3]).strftime('%m/%d/%y')
row_list_output.append(date_cell)
output_worksheet.write(row_index, col_index, date_cell)
else:
non_date_cell = worksheet.cell_value(row_index, col_index)
row_list_output.append(non_date_cell)
output_worksheet.write(row_index, col_index, non_date_cell)
output_workbook.save(output_file)
-
xldate_as_tuple
函数可以将Excel中代表日期时间的数值转化为元组,只要成为元组便可以提取出具体时间元素,并格式化为不同格式。 -
worksheet.cell_type(i, j)
储存了单元格类型,具体类型数字为下表:
Type symbol | Type number | Python value |
---|---|---|
XL_CELL_EMPTY | 0 | empty string '' |
XL_CELL_TEXT | 1 | a Unicode string |
XL_CELL_NUMBER | 2 | float |
XL_CELL_DATE | 3 | float |
XL_CELL_BOOLEAN | 4 | int; 1 means TRUE, 0 means FALSE |
XL_CELL_ERROR | 5 | int representing internal Excel codes; for a text representation, refer to the supplied dictionary error_text_from_code |
XL_CELL_BLANK | 6 | empty string ''. Note: this type will appear only when open_workbook(..., formatting_info=True) is used. |
-
xldate_as_tuple(worksheet.cell_value(row_index, col_index), workbook.datemode)
中workbook.datemode
参数是必须的,用以确定日期基于1900还是1904年并转化为元组(year, month, day, hour, minute, nearest_second)。 - 元组或者列表前面加
*
,就可以将列表或元组中的元素拆分成独立元素,不用进行遍历。
.XLSX
import sys
from openpyxl import load_workbook
from openpyxl import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.active
output_worksheet.title = 'jan_2013_output'
input_workbook = load_workbook(input_file, read_only=True)
input_worksheet = input_workbook['january_2013']
for i in range(1, input_worksheet.max_row+1):
for j in range(1, input_worksheet.max_column+1):
if input_worksheet.cell(row=i, column=j).is_date:
date_cell = input_worksheet.cell(row=i, column=j).value.strftime('%m/%d/%Y')
output_worksheet.cell(i, j, date_cell)
else:
output_worksheet.cell(row = i, column = j, value = input_worksheet.cell(row = i, column = j).value)
output_workbook.save(output_file)
input_workbook.close()
人麻了=。=,查了好久到底怎么才能让openpyxl自动识别单元格的时间格式,查来查去全是怎么处理时间的问题,这些老外不知道strptime()和strftime()吗怎么比我还菜,最后查看模块文档中cell的属性发现有一个is_date的布尔值属性,绕了半天其实人都给你做好了,气死。
- 使用
cell().is_date
的布尔值来区分日期与其他类型的单元格,使用strftime()
格式化日期,最后添加到输出表格的对应位置。 -
cell.get_type()
函数可以获取单元格类型,数字返回'n'
,字符串则返回's'
,时间则返回'd'
。
查到可以使用
NamedStyle
模块定义一种格式并赋予某些单元格这个格式,当然excel会识别到你用的格式化后的格式(DD/MM/YYYY)但不会识别出这是时间。date_style = NamedStyle(name='datetime', number_format='DD/MM/YYYY') ws.cell(i, j).style = date_style
筛选特定行
行中的值满足某个条件
.XLS
import sys
from datetime import date
from xlrd import open_workbook, xldate_as_tuple
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('jan_2013_output')
sale_amount_value_index = 3
with open_workbook(input_file) as workbook:
worksheet = workbook.sheet_by_name('january_2013')
data = []
header = worksheet.row_values(0)
data.append(header)
for row_index in range(1, worksheet.nrows):
sale = worksheet.cell_value(row_index, sale_amount_value_index)
sale_amount = float(str(sale).strip('$').replace(',', ''))
if sale_amount > 1_400.0:
row_list = []
for column_index in range(worksheet.ncols):
cell_value = worksheet.cell_value(row_index, column_index)
cell_type = worksheet.cell_type(row_index, column_index)
if cell_type == 3:
date_cell = xldate_as_tuple(cell_value, workbook.datemode)
date_cell = date(*date_cell[0:3]).strftime('%m%d%Y')
row_list.append(date_cell)
else:
row_list.append(cell_value)
data.append(row_list)
for list_index, output_list in enumerate(data):
for element_index, element in enumerate(output_list):
output_worksheet.write(list_index, element_index, element)
output_workbook.save(output_file)
先通过定义
sale_amount_value_index=3
储存'sale amount'列所在的索引值。在遍历表格中每一行时,新建一个row_list
列表用于储存每一列的值,同时取出每个'sale amount'列中单元格的值,转化为浮点数储存在sale_amount
中,因为想筛选出大于$1400.00的行,使用这个变量作为if语句的验证条件。在if语句中,使用for遍历满足条件的行的每一项(列),并且获取该列单元格的值和类型,检查单元格类型是否为日期,是则将其格式化为
'%m%d%Y'
格式添加至row_list
,不是则直接添加。而后通过if语句判断
row_list
是否为空,并将不为空的列添加至data
列表中。(但我发现作者搞复杂了,可以直接在if sale_amount > 1400.0:
后新建row_list
,在for循环结束后添加至data
列表,省去了判断列表是否为空的步骤。因此我直接修改代码简化。)最后再次使用
for
对data
列表本身和内部各个列表进行迭代,根据列表内的索引依次写入表格文件,如果直接在前面的for遍历里写入excel的话,会使用原有表格中的行索引值导致出现缺口。
.XLSX
import sys
from openpyxl import WorkBook, load_workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_sheet = output_workbook.active
output_sheet.title = 'jan_2013_output'
input_workbook = load_workbook(input_file, read_only=True)
input_sheet = input_workbook['january_2013']
sale_amount_value_index = 4
data = []
header = [cell.value for cell in input_sheet[1]]
data.append(header)
for i in range(2, input_sheet.max_row+1):
sale = float(str(input_sheet.cell(i, sale_amount_value_index).value).strip('$').replace(',', ''))
if sale > 1400.00:
row_list = []
for j in range(1, input_sheet.max_column+1):
if input_sheet.cell(i, j).is_date:
date_cell = input_sheet.cell(i, j).value.strftime('%m/%d/%Y')
row_list.append(date_cell)
else:
cell_value = input_sheet.cell(i, j).value
row_list.append(cell_value)
data.append(row_list)
for i, row_list in enumerate(data):
for j, cell_value in enumerate(row_list):
output_sheet.cell(i+1, j+1, cell_value)
output_workbook.save(output_file)
input_workbook.close()
事实上,在提取'Sale Amount‘值的时候可以直接用
float()
函数将字符串中的货币符号,逗号等非数字内容化掉。因此,第18行可以简化为:sale = float(input_sheet.cell(i, sale_amount_value_index).value)
使用openpyxl模块时,不能像xlrd模块那样,直接用
worksheet.row_values(0)
函数获取首行值,应当使用列表解析形式将首行值存储在header
中。注意xlrd和xlwt行列索引从0开始,openpyxl行列索引从1开始。
Pandas
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_excel(input_file, 'january_2013')
data_frame_value_meets_condition = data_frame[data_frame['Sale Amount'].astype(float) > 1400.0]
writer = pd.ExcelWriter(output_file)
data_frame_value_meets_condition.to_excel(writer, sheet_name='jan_13_output', index=False)
writer.save()
行中的值属于某个集合
.XLS
import sys
from datetime import date
from xlrd import open_workbook, xldate_as_tuple
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('jan_2013_output')
important_dates = ['01/24/2013', '01/31/2013']
purchase_date_column_index = 4
with open_workbook(input_file) as workbook:
worksheet = workbook.sheet_by_name('january_2013')
data = []
header = worksheet.row_values(0)
data.append(header)
for row_index in range(1, worksheet.nrows):
purchase_datetime = xldate_as_tuple(worksheet.cell_value(row_index, purchase_date_column_index),workbook.datemode)
purchase_date = date(*purchase_datetime[0:3]).strftime('%m/%d/%Y')
row_list = []
if purchase_date in important_dates:
for column_index in range(worksheet.ncols):
cell_value = worksheet.cell_value(row_index,column_index)
cell_type = worksheet.cell_type(row_index, column_index)
if cell_type == 3:
date_cell = xldate_as_tuple(cell_value,workbook.datemode)
date_cell = date(*date_cell[0:3]).strftime('%m/%d/%Y')
row_list.append(date_cell)
else:
row_list.append(cell_value)
if row_list:
data.append(row_list)
for list_index, output_list in enumerate(data):
for element_index, element in enumerate(output_list):
output_worksheet.write(list_index, element_index, element)
output_workbook.save(output_file)
.XLSX
import sys
from openpyxl import Workbook, load_workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
owb = Workbook()
ows = owb.active
ows.title = 'jan_2013_output'
iwb = load_workbook(input_file, read_only=True)
iws = iwb['january_2013']
important_dates = ['01/24/2013', '01/31/2013']
purchase_date_column_index = 5
data = []
header = [cell.value for cell in iws[1]]
data.append(header)
for i in range(2, iws.max_row+1):
pdate = iws.cell(i, purchase_date_column_index).value.strftime('%m/%d/%Y')
if pdate in important_dates:
r_list = []
for j in range(1, iws.max_column+1):
if iws.cell(i, j).is_date:
r_list.append(pdate)
else:
r_list.append(iws.cell(i, j).value)
data.append(r_list)
for i, row in enumerate(data):
for j, value in enumerate(row):
ows.cell(i+1, j+1, value)
owb.save(output_file)
iwb.close()
Pandas
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
important_dates = ['01/24/2013', '01/31/2013']
data_frame = pd.read_excel(input_file, 'january_2013')
data_frame_value_in_set = data_frame[data_frame['Purchase Date'].isin(important_dates)]
writer = pd.ExcelWriter(output_file)
data_frame_value_in_set.to_excel(writer, sheet_name='jan_13_output', index=False)
writer.save()
- 使用Pandas处理表格基本流程为读取工作表并储存至
data_frame
变量,然后根据条件筛选,data_frame['Purchase Date'].isin(important_dates)
返回值则为除了列标题以外其他各行的索引值和条件筛选后的布尔值。再创建一个wirter
变量用于储存输出内容与路径pd.ExcelWriter(output_file)
,最后使用to_excel()
函数输出文件,writer, sheet_name='jan_13_output', index=False
参数为输出内容,工作表名和是否保留行索引。结尾一定要有writer.save()
否则python无法保存并关闭文件。 - 条件筛选依旧可以使用
data_frame.loc[(data_frame['Purchase Date'].isin(important_dates)), :]
选中满足条件的行。
行中值匹配特定模式
.XLS件
import re
import sys
from datetime import date
from xlrd import open_workbook, xldate_as_tuple
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('jan_2013_output')
pattern = re.compile(r'(?P<my_pattern>^J.*)')
customer_name_column_index = 1
with open_workbook(input_file) as workbook:
worksheet = workbook.sheet_by_name('january_2013')
data = []
header = worksheet.row_values(0)
data.append(header)
for row_index in range(1, worksheet.nrows):
if pattern.search(worksheet.cell_value(row_index, customer_name_column_index)):
row_list = []
for column_index in range(worksheet.ncols):
cell_value = worksheet.cell_value(row_index,column_index)
cell_type = worksheet.cell_type(row_index, column_index)
if cell_type == 3:
date_cell = xldate_as_tuple(cell_value, workbook.datemode)
date_cell = date(*date_cell[0:3]).strftime('%m/%d/%Y')
row_list.append(date_cell)
else:
row_list.append(cell_value)
data.append(row_list)
for list_index, output_list in enumerate(data):
for element_index, element in enumerate(output_list):
output_worksheet.write(list_index, element_index, element)
output_workbook.save(output_file)
.XLSX件
import re
import sys
from openpyxl import Workbook, load_workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
owb = Workbook()
ows = owb.active
ows.title = 'jan_2013_output'
iwb = load_workbook(input_file, read_only=True)
iws = iwb['january_2013']
pattern = re.compile(r'(?P<my_pattern>^J.*)')
data = []
header = [cell.value for cell in iws[1]]
data.append(header)
for i in range(2, iws.max_row+1):
if pattern.search(iws.cell(i, 2).value):
r_list = []
for j in range(1, iws.max_column+1):
cell_value = iws.cell(i, j).value
if iws.cell(i, j).is_date:
date_value = cell_value.strftime('%m/%d/%Y')
r_list.append(date_value)
else:
r_list.append(cell_value)
data.append(r_list)
for i, row in enumerate(data):
for j, cell_value in enumerate(row):
ows.cell(i+1, j+1, cell_value)
owb.save(output_file)
iwb.close()
Pandas
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_excel(input_file, 'january_2013')
data_frame_value_matches_pattern1 = data_frame[data_frame['Customer Name'].str.startswith('J')]
data_frame_value_matches_pattern = data_frame_value_matches_pattern1.copy()
data_frame_value_matches_pattern['Purchase Date'] = data_frame_value_matches_pattern1['Purchase Date'].dt.strftime('%m/%d/%Y')
writer = pd.ExcelWriter(output_file)
data_frame_value_matches_pattern.to_excel(writer, sheet_name='jan_2013_output', index=False)
writer.save()
- 匹配模式本身不难,格式化时间花费了我很多时间来完成。值得注意的是,如果将某一列处理后重新赋值给本身,会出现
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
错误,==目前理解为,pandas会重复引用这个值对本身赋值==,因此创建一个副本引用原始dataframe便可以解决问题,此外,使用pd.options.mode.chained_assignment = None
将链式赋值关闭也可以解决问题。
选取特定列
列索引值
.XLS
import sys
from datetime import date
from xlrd import open_workbook, xldate_as_tuple
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('jan_2013_output')
my_columns = [1, 4]
with open_workbook(input_file) as workbook:
worksheet = workbook.sheet_by_name('january_2013')
data = []
for row_index in range(worksheet.nrows):
row_list = []
for column_index in my_columns:
cell_value = worksheet.cell_value(row_index,column_index)
cell_type = worksheet.cell_type(row_index, column_index)
if cell_type == 3:
date_cell = xldate_as_tuple(cell_value,workbook.datemode)
date_cell = date(*date_cell[0:3]).strftime('%m/%d/%Y')
row_list.append(date_cell)
else:
row_list.append(cell_value)
data.append(row_list)
for list_index, output_list in enumerate(data):
for element_index, element in enumerate(output_list):
output_worksheet.write(list_index, element_index, element)
output_workbook.save(output_file)
.XLSX
import sys
from openpyxl import Workbook, load_workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
owb = Workbook()
ows = owb.active
ows.title = 'jan_2013_output'
iwb = load_workbook(input_file, read_only=True)
iws = iwb['january_2013']
my_columns = [2, 5]
data = []
for i in range(1, iws.max_row+1):
r_list = []
for j in my_columns:
cell_value = iws.cell(i, j).value
if iws.cell(i, j).is_date:
date_value = cell_value.strftime('%m/%d/%Y')
r_list.append(date_value)
else:
r_list.append(cell_value)
data.append(r_list)
for i, row in enumerate(data):
for j, cell_value in enumerate(row):
ows.cell(i+1, j+1, cell_value)
owb.save(output_file)
iwb.close()
Pandas
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
my_columns = [1, 4]
data_frame = pd.read_excel(input_file, 'january_2013')
data_frame_column_by_index = data_frame.iloc[:, my_columns]
writer = pd.ExcelWriter(output_file)
data_frame_column_by_index.to_excel(writer, sheet_name='jan_2013_output', index=False)
writer.save()
列标题
.XLS
import sys
from datetime import date
from xlrd import open_workbook, xldate_as_tuple
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('jan_2013_output')
my_columns = ['Customer ID', 'Purchase Date']
with open_workbook(input_file) as workbook:
worksheet = workbook.sheet_by_name('january_2013')
data = [my_columns]
header_list = worksheet.row_values(0)
header_index_list = []
for header_index in range(len(header_list)):
if header_list[header_index] in my_columns:
header_index_list.append(header_index)
for row_index in range(1,worksheet.nrows):
row_list = []
for column_index in header_index_list:
cell_value = worksheet.cell_value(row_index,column_index)
cell_type = worksheet.cell_type(row_index, column_index)
if cell_type == 3:
date_cell = xldate_as_tuple(cell_value,workbook.datemode)
date_cell = date(*date_cell[0:3]).strftime('%m/%d/%Y')
row_list.append(date_cell)
else:
row_list.append(cell_value)
data.append(row_list)
for list_index, output_list in enumerate(data):
for element_index, element in enumerate(output_list):
output_worksheet.write(list_index, element_index, element)
output_workbook.save(output_file)
.XLSX
import sys
from openpyxl import Workbook, load_workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
owb = Workbook()
ows = owb.active
ows.title = 'jan_2013_output'
iwb = load_workbook(input_file, read_only=True)
iws = iwb['january_2013']
data = []
my_columns = ['Customer ID', 'Purchase Date']
data.append(my_columns)
header = [cell.value for cell in iws[1]]
header_indexes = []
for header_index in range(1, iws.max_column+1):
if header[header_index-1] in my_columns:
header_indexes.append(header_index)
for i in range(2, iws.max_row+1):
r_list = []
for j in header_indexes:
cell_value = iws.cell(i, j).value
if iws.cell(i, j).is_date:
date_value = cell_value.strftime("%m/%d/%Y")
r_list.append(date_value)
else:
r_list.append(cell_value)
data.append(r_list)
for i, row in enumerate(data):
for j, cell_value in enumerate(row):
ows.cell(i+1, j+1, cell_value)
owb.save(output_file)
iwb.close()
- openpyxl的难点在于行列索引值与excel一致,从1开始,因此获取列号需要注意是否正确。
Pandas
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_excel(input_file, 'january_2013', index_col=None)
data_frame_column_by_name = data_frame.loc[:, ['Customer ID', 'Purchase Date']]
writer = pd.ExcelWriter(output_file)
data_frame_column_by_name.to_excel(writer, sheet_name='jan_13_output', index=False)
writer.save()
3.3 读取工作簿中的所有工作表
在所有工作表中筛选行
.XLS
import sys
from datetime import date
from xlrd import open_workbook, xldate_as_tuple
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('filtered_rows_all_worksheets')
sales_column_index = 3
threshold = 2000.0
first_worksheet = True
with open_workbook(input_file) as workbook:
data = []
for worksheet in workbook.sheets():
if first_worksheet:
header_row = worksheet.row_values(0)
data.append(header_row)
first_worksheet = False
for row_index in range(1,worksheet.nrows):
row_list = []
sale_amount = worksheet.cell_value(row_index, sales_column_index)
sale_amount = float(sale_amount)
if sale_amount > threshold:
for column_index in range(worksheet.ncols):
cell_value = worksheet.cell_value(row_index,column_index)
cell_type = worksheet.cell_type(row_index, column_index)
if cell_type == 3:
date_cell = xldate_as_tuple(cell_value,workbook.datemode)
date_cell = date(*date_cell[0:3]).strftime('%m/%d/%Y')
row_list.append(date_cell)
else:
row_list.append(cell_value)
if row_list:
data.append(row_list)
for list_index, output_list in enumerate(data):
for element_index, element in enumerate(output_list):
output_worksheet.write(list_index, element_index, element)
output_workbook.save(output_file)
.XLSX
import sys
from openpyxl import Workbook, load_workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
owb = Workbook(write_only=True)
ows = owb.create_sheet(title='filtered_rows_all_worksheets')
iwb = load_workbook(input_file, read_only=True)
first_sheet = True
sc_column_index = 4
threshold = 2000.0
for sheet in iwb.worksheets:
if first_sheet:
header = [cell.value for cell in sheet[1]]
ows.append(header)
first_sheet = False
for i in range(2, sheet.max_row+1):
sale_amount = float(sheet.cell(i, sc_column_index).value)
if sale_amount > threshold:
r_list = []
for j in range(1, sheet.max_column+1):
cell_value = sheet.cell(i, j).value
if sheet.cell(i, j).is_date:
date_value = cell_value.strftime('%m/%d/%Y')
r_list.append(date_value)
else:
r_list.append(cell_value)
ows.append(r_list)
owb.save(output_file)
iwb.close()
- 在这个部分使用了只写型工作簿,直接使用
append()
函数将各列添加到输出工作表内。值得注意的是,小心if语句结束时for语句的位置,否则可能只遍历了第一个工作表就结束了。
Pandas
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_excel(input_file, sheet_name=None, index_col=None)
row_output = []
for worksheet_name, data in data_frame.items():
row_output.append(data[data['Sale Amount'].astype(float) > 2000.0])
filtered_rows = pd.concat(row_output, axis=0, ignore_index=True)
writer = pd.ExcelWriter(output_file)
filtered_rows.to_excel(writer, sheet_name='sale_amount_gt2000', index=False)
writer.save()
-
data_frame=pd.read_excel(input_file, sheet_name=None, index_col=None)
参数sheet_name=None
可以读取所有工作表。 -
concat()
方法用于连接两个或多个数组。该方法不会改变现有的数组,而仅仅会返回被连接数组的一个副本。代码中将所有数据框连接成为一个数据框,axis
参数用于设置方法,0表示垂直堆叠,1表示并排平行堆叠。 - 返回
data_frame
如下(为了看清格式,在输出内容的每个字典键值对后加了换行符),pandas将每个工作表和所有内容作为键值对储存在了一个字典里:
{'january_2013': Customer ID Customer Name Invoice Number Sale Amount Purchase Date
0 1234 John Smith 100-0002 1200 2013-01-01
1 2345 Mary Harrison 100-0003 1425 2013-01-06
2 3456 Lucy Gomez 100-0004 1390 2013-01-11
3 4567 Rupert Jones 100-0005 1257 2013-01-18
4 5678 Jenny Walters 100-0006 1725 2013-01-24
5 6789 Samantha Donaldson 100-0007 1995 2013-01-31,'february_2013': Customer ID Customer Name Invoice Number Sale Amount Purchase Date
0 9876 Daniel Farber 100-0008 1115 2013-02-02
1 8765 Laney Stone 100-0009 1367 2013-02-08
2 7654 Roger Lipney 100-0010 2135 2013-02-15
3 6543 Thomas Haines 100-0011 1346 2013-02-17
4 5432 Anushka Vaz 100-0012 1560 2013-02-21
5 4321 Harriet Cooper 100-0013 1852 2013-02-25,'march_2013': Customer ID Customer Name Invoice Number Sale Amount Purchase Date
0 1234 John Smith 100-0014 1350 2013-03-04
1 8765 Tony Song 100-0015 1167 2013-03-08
2 2345 Mary Harrison 100-0016 1789 2013-03-17
3 6543 Rachel Paz 100-0017 2042 2013-03-22
4 3456 Lucy Gomez 100-0018 1511 2013-03-28
5 4321 Susan Wallace 100-0019 2280 2013-03-30}
在所有工作表中选取特定列
.XLS
import sys
from xlrd import open_workbook
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('selected_columns_all_worksheets')
my_columns = ['Customer Name', 'Sale Amount']
first_worksheet = True
with open_workbook(input_file) as workbook:
data = [my_columns]
index_of_cols_to_keep = []
for worksheet in workbook.sheets():
if first_worksheet:
header = worksheet.row_values(0)
for column_index in range(len(header)):
if header[column_index] in my_columns:
index_of_cols_to_keep.append(column_index)
first_worksheet = False
for row_index in range(1, worksheet.nrows):
row_list = []
for column_index in index_of_cols_to_keep:
cell_value = worksheet.cell_value(row_index, column_index)
row_list.append(cell_value)
data.append(row_list)
for list_index, output_list in enumerate(data):
for element_index, element in enumerate(output_list):
output_worksheet.write(list_index, element_index, element)
output_workbook.save(output_file)
.XLSX
import sys
from openpyxl import Workbook, load_workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
owb = Workbook(write_only=True)
ows = owb.create_sheet(title='selected_columns_all_worksheets')
iwb = load_workbook(input_file, read_only=True)
my_columns = ['Customer Name', 'Sale Amount']
header_indexes = []
ows.append(my_columns)
first_sheet = True
for sheet in iwb.worksheets:
if first_sheet:
header = [cell.value for cell in sheet[1]]
for col_index in range(1, sheet.max_column+1):
if header[col_index-1] in my_columns:
header_indexes.append(col_index)
first_sheet = False
for i in range(2, sheet.max_row+1):
r_list = []
for j in header_indexes:
r_list.append(sheet.cell(i, j).value)
ows.append(r_list)
owb.save(output_file)
iwb.close()
- 通过设置
first_sheet=True
存储布尔值从而仅获取第一个表格的列标题,然后获取所需列对应的列索引值,最后根据索引值将表格内容添加到输出表格。
Pandas
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
data_frame = pd.read_excel(input_file, sheet_name=None, index_col=None)
column_output = []
for data in data_frame.values():
column_output.append(data.loc[:, ['Customer Name', 'Sale Amount']])
all_column_output = pd.concat(column_output, axis=0, ignore_index=True)
writer = pd.ExcelWriter(output_file)
all_column_output.to_excel(writer, sheet_name='selected_columns_all_worksheets', index=False)
writer.save()
3.4 在EXCEL工作簿中读取一组工作表
在一组工作表中筛选特定行
.XLS
import sys
from datetime import date
from xlrd import open_workbook, xldate_as_tuple
from xlwt import Workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('set_of_worksheets')
my_sheets = [0,1]
threshold = 1900.0
sales_column_index = 3
first_worksheet = True
with open_workbook(input_file) as workbook:
data = []
for sheet_index in range(workbook.nsheets):
if sheet_index in my_sheets:
worksheet = workbook.sheet_by_index(sheet_index)
if first_worksheet:
header_row = worksheet.row_values(0)
data.append(header_row)
first_worksheet = False
for row_index in range(1, worksheet.nrows):
sale_amount = worksheet.cell_value(row_index, sales_column_index)
if sale_amount > threshold:
row_list = []
for column_index in range(worksheet.ncols):
cell_value = worksheet.cell_value(row_index,column_index)
cell_type = worksheet.cell_type(row_index, column_index)
if cell_type == 3:
date_cell = xldate_as_tuple(cell_value,workbook.datemode)
date_cell = date(*date_cell[0:3]).strftime('%m/%d/%Y')
row_list.append(date_cell)
else:
row_list.append(cell_value)
data.append(row_list)
for list_index, output_list in enumerate(data):
for element_index, element in enumerate(output_list):
output_worksheet.write(list_index, element_index, element)
output_workbook.save(output_file)
- .xls如果不处理时间格式,输出文件中日期将会被转化为数字。
.XLSX
import sys
from openpyxl import Workbook, load_workbook
input_file = sys.argv[1]
output_file = sys.argv[2]
owb = Workbook(write_only=True)
ows = owb.create_sheet(title='set_of_worksheets', index=None)
iwb = load_workbook(input_file, read_only=True)
my_sheets = [0, 1]
threshold = 1900.0
sales_column_index = 4
first_worksheet = True
for sheet_index, stname in enumerate(iwb.iwb.worksheets):
if sheet_index in my_sheets:
my_sheet = iwb.worksheets[sheet_index]
if first_worksheet:
header = [cell.value for cell in my_sheet[1]]
ows.append(header)
first_worksheet = False
for i in range(2, my_sheet.max_row+1):
sales_value = float(my_sheet.cell(i, sales_column_index).value)
if sales_value > threshold:
r_list = []
for j in range(1, my_sheet.max_column+1):
cell_value = my_sheet.cell(i, j).value
r_list.append(cell_value)
ows.append(r_list)
owb.save(output_file)
iwb.close()
- openpyxl中,
wb.worksheets[]
可以通过从0开始的索引值读取工作表,而wb[sheet_name]
通过工作表名读取。 - 主要是卡在了第14-16行,遍历和储存工作表部分一直有误。
Pandas
import pandas as pd
import sys
input_file = sys.argv[1]
output_file = sys.argv[2]
my_sheets = [0,1]
threshold = 1900.0
data_frame = pd.read_excel(input_file, sheet_name=my_sheets, index_col=None)
row_list = []
for worksheet_name, data in data_frame.items():
row_list.append(data[data['Sale Amount'].astype(float) > threshold])
filtered_rows = pd.concat(row_list, axis=0, ignore_index=True)
writer = pd.ExcelWriter(output_file)
filtered_rows.to_excel(writer, sheet_name='set_of_worksheets', index=False)
writer.save()
- Pandas读取excel时,
sheet_name
既可以使用字符串形式的工作表名,也可以使用从0开始的索引值,None则表示所有工作表。
3.5 处理多个工作簿
工作表计数以及每个工作表中的行列计算
.XLS
import glob
import os
import sys
from xlrd import open_workbook
input_directory = sys.argv[1]
workbook_counter = 0
for input_file in glob.glob(os.path.join(input_directory, '*.xls*')):
workbook = open_workbook(input_file)
print('Workbook: {}'.format(os.path.basename(input_file)))
print('Number of worksheets: {}'.format(workbook.nsheets))
for worksheet in workbook.sheets():
print('Worksheet name:', worksheet.name, '\tRows:',\
worksheet.nrows, '\tColumns:', worksheet.ncols)
workbook_counter += 1
print('Number of Excel workbooks: {}'.format(workbook_counter))
.XLSX
import glob
import os
import sys
from openpyxl import load_workbook
input_directory = sys.argv[1]
workbook_counter = 0
for input_file in glob.glob(os.path.join(input_directory, r"*.xlsx")):
iwb = load_workbook(input_file, read_only=True)
print(f"Workbook: {os.path.basename(input_file)}")
print(f"Number of worksheets: {len(iwb.worksheets)}")
for worksheet in iwb.worksheets:
details = f"Worksheet name: {worksheet.title}, \t"
details += f"Rows: {worksheet.max_row}, \t"
details += f"Columns: {worksheet.max_column}"
print(details)
workbook_counter += 1
print(f"Number of Excel workbooks: {workbook_counter}.")
从多个工作簿中连接数据
.XLS
import glob
import os
import sys
from datetime import date
from xlrd import open_workbook, xldate_as_tuple
from xlwt import Workbook
input_folder = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('all_data_all_workbooks')
data = []
first_worksheet = True
for input_file in glob.glob(os.path.join(input_folder, '*.xls*')):
print (os.path.basename(input_file))
with open_workbook(input_file) as workbook:
for worksheet in workbook.sheets():
if first_worksheet:
header_row = worksheet.row_values(0)
data.append(header_row)
first_worksheet = False
for row_index in range(1,worksheet.nrows):
row_list = []
for column_index in range(worksheet.ncols):
cell_value = worksheet.cell_value(row_index,column_index)
cell_type = worksheet.cell_type(row_index, column_index)
if cell_type == 3:
date_cell = xldate_as_tuple(cell_value, workbook.datemode)
date_cell = date(*date_cell[0:3]).strftime('%m/%d/%Y')
row_list.append(date_cell)
else:
row_list.append(cell_value)
data.append(row_list)
for list_index, output_list in enumerate(data):
for element_index, element in enumerate(output_list):
output_worksheet.write(list_index, element_index, element)
output_workbook.save(output_file)
.XLSX
import glob
import os
import sys
from openpyxl import Workbook, load_workbook
input_folder = sys.argv[1]
output_file = sys.argv[2]
owb = Workbook(write_only=True)
ows = owb.create_sheet(title='all_data_all_workbooks', index=None)
input_files = glob.glob(os.path.join(input_folder, r"*.xlsx*"))
first_sheet = True
for iwb_path in input_files:
print(os.path.basename(iwb_path))
iwb = load_workbook(iwb_path, read_only=True)
for iws in iwb.worksheets:
if first_sheet:
header = [cell.value for cell in iws[1]]
ows.append(header)
first_sheet = False
for i in range(2, iws.max_row+1):
r_list = []
for j in range(1, iws.max_column+1):
cell_value = iws.cell(i, j).value
r_list.append(cell_value)
ows.append(r_list)
iwb.close()
owb.save(output_file)
- 一定注意
iwb.close()
的位置,防止显式关闭提前导致无法继续遍历,靠后会导致打开多个表格,无法正确关闭。
Pandas
import pandas as pd
import glob
import os
import sys
input_path = sys.argv[1]
output_file = sys.argv[2]
all_workbooks = glob.glob(os.path.join(input_path,'*.xls*'))
data_frames = []
for workbook in all_workbooks:
all_worksheets = pd.read_excel(workbook, sheet_name=None, index_col=None)
for worksheet_name, data in all_worksheets.items():
data_frames.append(data)
all_data_concatenated = pd.concat(data_frames, axis=0, ignore_index=True)
writer = pd.ExcelWriter(output_file)
all_data_concatenated.to_excel(writer, sheet_name='all_data_all_workbooks', index=False)
writer.save()
为每个工作簿和工作表计算总数和均值
.XLS
import glob
import os
import sys
from xlrd import open_workbook
from xlwt import Workbook
input_folder = sys.argv[1]
output_file = sys.argv[2]
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('sums_and_averages')
all_data = []
sales_column_index = 3
header = ['workbook', 'worksheet', 'worksheet_total', 'worksheet_average',\
'workbook_total', 'workbook_average']
all_data.append(header)
for input_file in glob.glob(os.path.join(input_folder, '*.xls*')):
with open_workbook(input_file) as workbook:
list_of_totals = []
list_of_numbers = []
workbook_output = []
for worksheet in workbook.sheets():
total_sales = 0
number_of_sales = 0
worksheet_list = []
worksheet_list.append(os.path.basename(input_file))
worksheet_list.append(worksheet.name)
for row_index in range(1,worksheet.nrows):
try:
total_sales += float(str(worksheet.cell_value(row_index, sales_column_index)).strip('$').replace(',',''))
number_of_sales += 1.
except:
total_sales += 0.
number_of_sales += 0.
average_sales = '%.2f' % (total_sales / number_of_sales)
worksheet_list.append(total_sales)
worksheet_list.append(float(average_sales))
list_of_totals.append(total_sales)
list_of_numbers.append(float(number_of_sales))
workbook_output.append(worksheet_list)
workbook_total = sum(list_of_totals)
workbook_average = sum(list_of_totals)/sum(list_of_numbers)
for list_element in workbook_output:
list_element.append(workbook_total)
list_element.append(workbook_average)
all_data.extend(workbook_output)
for list_index, output_list in enumerate(all_data):
for element_index, element in enumerate(output_list):
output_worksheet.write(list_index, element_index, element)
output_workbook.save(output_file)
Pandas
import glob
import os
import sys
from openpyxl import Workbook, load_workbook
input_folder = sys.argv[1]
output_file = sys.argv[2]
owb = Workbook(write_only=True)
ows = owb.create_sheet(title='sums_and_averages', index=None)
sales_column_index = 4
header = ['workbook', 'worksheet', 'worksheet_total', 'worksheet_average',\
'workbook_total', 'workbook_average']
ows.append(header)
input_files = glob.glob(os.path.join(input_folder, "*.xlsx*"))
for file_path in input_files:
iwb = load_workbook(file_path, read_only=True)
list_of_totals = []
list_of_numbers = []
workbook_output = []
for iws in iwb.worksheets:
total_sales = 0
number_of_sales = 0
worksheet_list = []
worksheet_list.append(os.path.basename(file_path))
worksheet_list.append(iws.title)
for i in range(2, iws.max_row+1):
try:
total_sales += float(iws.cell(i, sales_column_index).value)
number_of_sales += 1
except:
total_sales += 0
number_of_sales += 0
average_sales = total_sales / number_of_sales
worksheet_list.append(total_sales)
worksheet_list.append(average_sales)
list_of_totals.append(total_sales)
list_of_numbers.append(float(number_of_sales))
workbook_output.append(worksheet_list)
workbook_total = sum(list_of_totals)
workbook_average = workbook_total / sum(list_of_numbers)
for list_element in workbook_output:
list_element.append(workbook_total)
list_element.append(workbook_average)
ows.append(list_element)
iwb.close()
owb.save(output_file)
- 搞清楚要添加的元素,一个一个添加进工作表的列表,然后添加进工作表所在的大列表,最后算出总值和平均值之后再遍历大列表将值添加进去并存入ows。
Pandas
import pandas as pd
import glob
import os
import sys
input_path = sys.argv[1]
output_file = sys.argv[2]
all_wbs = glob.glob(os.path.join(input_path, "*.xlsx*"))
data_frames = []
for wb in all_wbs:
all_wss = pd.read_excel(wb, sheet_name=None, index_col=None)
wb_total_sales = []
wb_number_of_sales = []
ws_data_frame = []
wss_data_frame = None
wb_data_frame = None
for ws_name, data in all_wss.items():
total_sales = pd.DataFrame(float(value) for value in data.loc[:, "Sale Amount"]).sum()
number_of_sales = len(data.loc[:, "Sale Amount"])
average_sales = total_sales / number_of_sales
wb_total_sales.append(total_sales)
wb_number_of_sales.append(number_of_sales)
data = {'Workbook': os.path.basename(wb),
'Worksheet': ws_name,
'Worksheet_total': total_sales,
'Worksheet_average': average_sales}
ws_data_frame.append(pd.DataFrame(data, columns=['Workbook', 'Worksheet', 'Worksheet_total', 'Worksheet_average']))
wss_data_frame = pd.concat(ws_data_frame, axis=0, ignore_index=True)
wb_total = pd.DataFrame(wb_total_sales).sum()
wb_total_number_of_sales = pd.DataFrame(wb_number_of_sales).sum()
wb_average = wb_total / wb_total_number_of_sales
wb_stats = {'Workbook': os.path.basename(wb),
'Workbook_total': wb_total,
'Workbook_average': wb_average}
wb_stats = pd.DataFrame(wb_stats, columns=['Workbook', 'Workbook_total', 'Workbook_average'])
wb_data_frame = pd.merge(wss_data_frame, wb_stats, on='Workbook', how='left')
data_frames.append(wb_data_frame)
all_data_concatenated = pd.concat(data_frames, axis=0, ignore_index=True)
writer = pd.ExcelWriter(output_file)
all_data_concatenated.to_excel(writer, sheet_name='sums_and_averages', index=False)
writer.save()