分箱可以将连续变量离散化,减小异常值对模型的影响
数据准备
Age = [0,10,20,25,31,35,40,62,90]
pd.qcut() 使每一份的元素个数相同
#将Age分为三个箱子,每个箱子有3个元素
pd.qcut(data['Age'],3,labels=['Teen',‘Middle-age’,'Elder'])
<<[Teen, Teen, Teen, Middle-age, Middle-age, Middle-age, Elder, Elder, Elder]
pd.cut 使每一份的宽度相同
#将Age分为三个箱子,箱子范围分别是0-30,30-60,60-90
pd.cut(Age,3,labels=['Teen',‘Middle-age’,'Elder'])
<<<[Teen, Teen, Teen, Teen, Middle-age, Middle-age, Middle-age, Elder, Elder]
给Age指定区间和标签
pd.cut(ages, [0,5,20,30,50,100], labels=[u"婴儿",u"青年",u"中年",u"壮年",u"老年"])