python学习笔记_10(CSV文件)

Create by westfallon on 8/20

本文用到的文件在exercise_data文件夹中

python读取普通文件

传统方式(四步走)

标明路径
open函数打开文件
处理文件
关闭资源

示例(打开input.txt文件并输出文件内容)

input_path = "input.txt"  # 标明路径
file_reader = open(input_path, 'r')  # 打开文件
for row in file_reader:  # 按行处理文件
    print(row)
file_reader.close()  # 关闭资源

# 结果: 代亚群是世界上最好看的人！

with结构(三步走)

标明路径
with打开文件
处理文件

优势: 自动关闭资源, 简单

示例(功能同上)

input_path = "input.txt"
with open(input_path, 'r') as file_reader:  # with打开文件
    for row in file_reader:  # 处理文件
        print(row.split(','))

# 结果: 代亚群是世界上最好看的人！

python向普通文件中写入内容(与读取内容联系记忆)

传统方式, 与读文件相同都是四步走

标明路径
open函数打开文件
写入文件
关闭资源

示例(给定values列表, 将列表输出到output.txt文件中)

values = ['a', 'b', 'c', 'd', 'e']
output_path = "output.csv"  # 标明路径
file_writer = open(output_path, 'w')  # open函数打开文件
file_writer.write(str(values))  # 写入文件
file_writer.close()  # 关闭资源

# output.txt中: ['a', 'b', 'c', 'd', 'e']

with结构, 两步走

标明路径
with打开文件
写入内容

示例(功能如上)

values = ['a', 'b', 'c', 'd', 'e']
output_path = "output.csv"
with open(output_path, 'w') as file_writer:
    for value in values:
        file_writer.write(value + ',')

# output.txt中: ['a', 'b', 'c', 'd', 'e']

pandas读取csv文件(重要)

input_path = "input.csv"
date_frame = pd.read_csv(input_path)
print(date_frame)

在输入文件中筛选出特定行的三种方法

一、行中的值满足某个条件

传统方法(要求能看懂啥意思)

读取文件后利用for循环一行一行的判断, 如果满足某个条件则将该行写入输出文件中

示例(读取supplier_data.csv中的数据, 将其中supplier == "Supplier Z" 或者 cost > 600.0的行输出到output.csv文件中)

# 代码不要求会用, 但要求能看懂
input_path = "supplier_data.csv"
output_path = "output.csv"
with open(input_path, 'r') as input_file:
    with open(output_path, 'w') as output_file:
        file_reader = csv.reader(input_file)
        file_writer = csv.writer(output_file)
        header = next(file_reader)
        file_writer.writerow(header)
        for row_list in file_reader:
            supplier = str(row_list[0]).strip()
            cost = str(row_list[3]).lstrip('$')
            if supplier == "Supplier Z" or float(cost) > 600.0:
                file_writer.writerow(row_list)

pandas方法(要求会用)

将文件读取成为data_frame, 直接使用loc函数对其进行操作
loc函数全称为Selection by Label, 即为按标签选取元素, 后面是中括号, 有两个参数, 第一个是index(行), 第二个是column(列), : 表示全要

示例(需求同上)

input_path = "supplier_data.csv"
output_path = "output.csv"
data_frame = pd.read_csv(input_path)
data_frame['Cost'] = data_frame['Cost'].str.strip('$').astype(float)
output_data_frame = data_frame.loc[(data_frame['Supplier Name'].str.contains('Z')) | \
                                 (data_frame['Cost'] > 600.0), :]
output_data_frame.to_csv(output_path, index=False)

二、行中的值属于某个集合

传统方法(要求能看懂)

读取数据后使用for循环一行一行的判断, 使用if .. in ..结构来判断某个值是否在某个集合中

示例(要求筛选supplier_data.csv中日期在['1/20/14', '1/30/14']之中的数据)

important_date = ['1/20/14', '1/30/14']
input_path = "supplier_data.csv"
output_path = "output.csv"
with open(input_path, 'r') as input_file:
    with open(output_path, 'w') as output_file:
        file_reader = csv.reader(input_file)
        file_writer = csv.writer(output_file)
        header = next(file_reader)
        file_writer.writerow(header)
        for row_list in file_reader:
            if row_list[4] in important_date:
                file_writer.writerow(row_list)

pandas方法(会用)

同样使用data_frame和loc函数, 使用isin函数来判断某数是否在集合中

示例(需求同上)

# 这个例子重点记isin函数的使用
important_date = ['1/20/14', '1/30/14']
input_path = "supplier_data.csv"
output_path = "output.csv"
data_frame = pd.read_csv(input_path)
output_data_frame = data_frame.loc[data_frame['Purchase Date'].\
                                 isin(important_date), :]
output_data_frame.to_csv(output_path, index=False)

三、行中的值匹配于某个模式

传统方法不要求, 以下是pandas方法

同样使用data_frame和loc函数, 使用startswith函数来判断某字符串是否以另一字符串开头(字串)

示例(筛选supplier_data.csv中Invoice Number以001-开头的数据)

# 这个例子重点记startswith函数
input_path = "supplier_data.csv"
output_path = "output.csv"
data_frame = pd.read_csv(input_path)
output_data_frame = data_frame.loc[data_frame['Invoice Number'].\
                                 str.startswith("001-"), :]
output_data_frame.to_csv(output_path)

选取指定列的方法

传统方法(根据列序号筛选)

读取数据后使用for循环一行一行的筛选, 将一行中在指定列的数据放入输出中即可

示例(选取supplier_data.csv中第0列和第3列的数据)

my_columns = [0, 3]
input_path = "supplier_data.csv"
output_path = "output.csv"
with open(input_path, 'r') as input_file:
    with open(output_path, 'w') as output_file:
        file_reader = csv.reader(input_file)
        file_writer = csv.writer(output_file)
        for row_list in file_reader:
            row_list_output = []
            for column in my_columns:
                row_list_output.append(row_list[column])
            file_writer.writerow(row_list_output)

pandas方法(根据列序号筛选)

将数据读取为data_frame形式, 使用iloc函数进行筛选
iloc函数为Selection by Position，即按位置选择数据，即第n行，第n列数据，只接受整型参数
iloc与loc的区别: iloc只接受整形参数, 而loc只接受标签参数
注意: x : y为左开右闭区间

示例(需求同上)

my_columns = [0, 3]
input_path = "supplier_data.csv"
output_path = "output.csv"
data_frame = pd.read_csv(input_path)
output_data_frame = data_frame.iloc[:, my_columns]
output_data_frame.to_csv(output_path, index=False)

pandas方法(根据列名筛选)(重要, 要求掌握)

将数据读取为data_frame形式, 使用loc函数进行筛选

示例(选取supplier_data.csv中Supplier Name列和Cost列的数据)

my_columns = ['Supplier Name', 'Cost']
input_path = "supplier_data.csv"
output_path = "output.csv"
data_frame = pd.read_csv(input_path)
output_data_frame = data_frame.loc[:, my_columns]
output_data_frame.to_csv(output_path, index=False)

删除行或列

传统方法不要求, 以下是pandas方法

将数据读取为data_frame形式, 使用drop函数删除特定列
drop函数的参数为一个列表, 列表内如果是整数, 表示需要删除的行序号, 如果是字符串, 表示需要删除的列名

示例(删除supplier_data_unnecessary_header_footer.csv中第0, 1, 2, 16, 17, 18行)

input_path = "supplier_data_unnecessary_header_footer.csv"
output_path = "output.csv"
data_frame = pd.read_csv(input_path, header=None)
data_frame = data_frame.drop([0, 1, 2, 16, 17, 18])
data_frame.columns = data_frame.iloc[0]
data_frame = data_frame.drop(3)  # 多了一行
data_frame.to_csv(output_path, index=False)

使用pandas添加列名

使用情景: 某些数据未设定标题行, 需手动添加
方法: 直接在使用pd.read_csv函数读取文件时指定列名

示例(将supplier_data_no_header_row.csv设置列名)

input_path = "supplier_data_no_header_row.csv"
output_path = "output.csv"
my_columns = ['Supplier Name', 'Invoice Number', 'Part Number', 'Cost', 'Purchase Date']
data_frame = pd.read_csv(input_path, header=None, names=my_columns)
data_frame.to_csv(output_path, index=False)