Pandas - 7. 数据类型

import seaborn as sns
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
tips = sns.load_dataset('tips')
print(tips.dtypes)
total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
dtype: object
print(tips.head())
   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

astype() 转换数据类型(可用于Series和DataFrame),可转化成python内置的数据类型:str,float,int,complex,bool。以及Numpy库支持的任何dtype。

# 将sex数据转换成字符串类型
tips['sex_str'] = tips['sex'].astype(str)
print(tips.dtypes)
total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object

转成数值类型

有些数值列会有missing或null来代替缺失值,导致整列为字符串类型

tips_sub_miss = tips.head(10)
tips_sub_miss.loc[[1, 3, 5, 7], 'total_bill'] = 'missing'
print(tips_sub_miss)
  total_bill   tip     sex smoker  day    time  size sex_str
0      16.99  1.01  Female     No  Sun  Dinner     2  Female
1    missing  1.66    Male     No  Sun  Dinner     3    Male
2      21.01  3.50    Male     No  Sun  Dinner     3    Male
3    missing  3.31    Male     No  Sun  Dinner     2    Male
4      24.59  3.61  Female     No  Sun  Dinner     4  Female
5    missing  4.71    Male     No  Sun  Dinner     4    Male
6       8.77  2.00    Male     No  Sun  Dinner     2    Male
7    missing  3.12    Male     No  Sun  Dinner     4    Male
8      15.04  1.96    Male     No  Sun  Dinner     2    Male
9      14.78  3.23    Male     No  Sun  Dinner     2    Male
print(tips_sub_miss.dtypes)
total_bill      object
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object
# Pandas无法把缺失值转换成float
tips_sub_miss['total_bill'].astype(float) # ValueError: could not convert string to float: 'missing'
# 用to_numeric函数也出错
pd.to_numeric(tips_sub_miss['total_bill']) # Unable to parse string "missing" at position 1

to_numeric() 转换成数值
参数:

  • errors:决定当函数遇到无法转换为数值的值时该如何处理
    1. raise:报错(默认)
    2. coerce:将无法转换的值返回成NaN(适用)
    3. ignore:放弃转换,直接返回整列,什么都不做(不适用)
  • downcast:转换完成后,将数值类型更改成最小的数值类型,减少内存。(默认为None)
    1. integer
    2. signed
    3. unsigned
    4. float
tips_sub_miss['total_bill'] = pd.to_numeric(tips_sub_miss['total_bill'],
                                           errors='coerce')
print(tips_sub_miss.dtypes)
total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object
print(tips_sub_miss)
   total_bill   tip     sex smoker  day    time  size sex_str
0       16.99  1.01  Female     No  Sun  Dinner     2  Female
1         NaN  1.66    Male     No  Sun  Dinner     3    Male
2       21.01  3.50    Male     No  Sun  Dinner     3    Male
3         NaN  3.31    Male     No  Sun  Dinner     2    Male
4       24.59  3.61  Female     No  Sun  Dinner     4  Female
5         NaN  4.71    Male     No  Sun  Dinner     4    Male
6        8.77  2.00    Male     No  Sun  Dinner     2    Male
7         NaN  3.12    Male     No  Sun  Dinner     4    Male
8       15.04  1.96    Male     No  Sun  Dinner     2    Male
9       14.78  3.23    Male     No  Sun  Dinner     2    Male
tips_sub_miss['total_bill'] = pd.to_numeric(tips_sub_miss['total_bill'],
                                           errors='coerce',
                                           downcast='float')
print(tips_sub_miss.dtypes)
total_bill     float32
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object

分类数据

用于对分类值进行编码,具有如下优点:

  1. 节约内存,提高速度
  2. 当值具有一定顺序,需要转化成分类数据
  3. 有些python库可以处理分类数据(拟合统计模型)
tips['sex'] = tips['sex'].astype('str')
print(tips.dtypes)
total_bill     float64
tip            float64
sex             object
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object
tips['sex'] = tips['sex'].astype('category')
print(tips.dtypes)
total_bill     float64
tip            float64
sex           category
smoker        category
day           category
time          category
size             int64
sex_str         object
dtype: object

分类Series上的操作

  • Series.cat.categories: 类别
  • Series.cat.ordered: 类别是否有序
  • Series.cat.codes: 返回类别的整数代码
  • Series.cat.rename_categories: 重命名类别
  • Series.cat.reorder_categories: 对类别重新排序
  • Series.cat.add_categories: 添加新类别
  • Series.cat.remove_categories: 删除类别
  • Series.cat.remove_unsed_categories: 删除未使用的类别
  • Series.cat.set_categories: 设置新类别
  • Series.cat.as_ordered: 对类别排序
  • Series.cat.as_unordered: 使类别无序
最后编辑于
©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容