import seaborn as sns
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
tips = sns.load_dataset('tips')
print(tips.dtypes)
total_bill float64
tip float64
sex category
smoker category
day category
time category
size int64
dtype: object
print(tips.head())
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
astype() 转换数据类型(可用于Series和DataFrame),可转化成python内置的数据类型:str,float,int,complex,bool。以及Numpy库支持的任何dtype。
# 将sex数据转换成字符串类型
tips['sex_str'] = tips['sex'].astype(str)
print(tips.dtypes)
total_bill float64
tip float64
sex category
smoker category
day category
time category
size int64
sex_str object
dtype: object
转成数值类型
有些数值列会有missing或null来代替缺失值,导致整列为字符串类型
tips_sub_miss = tips.head(10)
tips_sub_miss.loc[[1, 3, 5, 7], 'total_bill'] = 'missing'
print(tips_sub_miss)
total_bill tip sex smoker day time size sex_str
0 16.99 1.01 Female No Sun Dinner 2 Female
1 missing 1.66 Male No Sun Dinner 3 Male
2 21.01 3.50 Male No Sun Dinner 3 Male
3 missing 3.31 Male No Sun Dinner 2 Male
4 24.59 3.61 Female No Sun Dinner 4 Female
5 missing 4.71 Male No Sun Dinner 4 Male
6 8.77 2.00 Male No Sun Dinner 2 Male
7 missing 3.12 Male No Sun Dinner 4 Male
8 15.04 1.96 Male No Sun Dinner 2 Male
9 14.78 3.23 Male No Sun Dinner 2 Male
print(tips_sub_miss.dtypes)
total_bill object
tip float64
sex category
smoker category
day category
time category
size int64
sex_str object
dtype: object
# Pandas无法把缺失值转换成float
tips_sub_miss['total_bill'].astype(float) # ValueError: could not convert string to float: 'missing'
# 用to_numeric函数也出错
pd.to_numeric(tips_sub_miss['total_bill']) # Unable to parse string "missing" at position 1
to_numeric() 转换成数值
参数:
- errors:决定当函数遇到无法转换为数值的值时该如何处理
- raise:报错(默认)
- coerce:将无法转换的值返回成NaN(适用)
- ignore:放弃转换,直接返回整列,什么都不做(不适用)
- downcast:转换完成后,将数值类型更改成最小的数值类型,减少内存。(默认为None)
- integer
- signed
- unsigned
- float
tips_sub_miss['total_bill'] = pd.to_numeric(tips_sub_miss['total_bill'],
errors='coerce')
print(tips_sub_miss.dtypes)
total_bill float64
tip float64
sex category
smoker category
day category
time category
size int64
sex_str object
dtype: object
print(tips_sub_miss)
total_bill tip sex smoker day time size sex_str
0 16.99 1.01 Female No Sun Dinner 2 Female
1 NaN 1.66 Male No Sun Dinner 3 Male
2 21.01 3.50 Male No Sun Dinner 3 Male
3 NaN 3.31 Male No Sun Dinner 2 Male
4 24.59 3.61 Female No Sun Dinner 4 Female
5 NaN 4.71 Male No Sun Dinner 4 Male
6 8.77 2.00 Male No Sun Dinner 2 Male
7 NaN 3.12 Male No Sun Dinner 4 Male
8 15.04 1.96 Male No Sun Dinner 2 Male
9 14.78 3.23 Male No Sun Dinner 2 Male
tips_sub_miss['total_bill'] = pd.to_numeric(tips_sub_miss['total_bill'],
errors='coerce',
downcast='float')
print(tips_sub_miss.dtypes)
total_bill float32
tip float64
sex category
smoker category
day category
time category
size int64
sex_str object
dtype: object
分类数据
用于对分类值进行编码,具有如下优点:
- 节约内存,提高速度
- 当值具有一定顺序,需要转化成分类数据
- 有些python库可以处理分类数据(拟合统计模型)
tips['sex'] = tips['sex'].astype('str')
print(tips.dtypes)
total_bill float64
tip float64
sex object
smoker category
day category
time category
size int64
sex_str object
dtype: object
tips['sex'] = tips['sex'].astype('category')
print(tips.dtypes)
total_bill float64
tip float64
sex category
smoker category
day category
time category
size int64
sex_str object
dtype: object
分类Series上的操作
- Series.cat.categories: 类别
- Series.cat.ordered: 类别是否有序
- Series.cat.codes: 返回类别的整数代码
- Series.cat.rename_categories: 重命名类别
- Series.cat.reorder_categories: 对类别重新排序
- Series.cat.add_categories: 添加新类别
- Series.cat.remove_categories: 删除类别
- Series.cat.remove_unsed_categories: 删除未使用的类别
- Series.cat.set_categories: 设置新类别
- Series.cat.as_ordered: 对类别排序
- Series.cat.as_unordered: 使类别无序