1、相关性
相关性分析主要通过计算相关系数来衡量变量之间的相关性。
常用的相关系数有 皮尔逊相关系数 和 斯皮尔曼相关系数。皮尔逊相关系数适用于连续变量之间的相关性分析,范围在-1到1之间,绝对值越接近1表示相关性越强。斯皮尔曼相关系数适用于任意变量之间的相关性分析,也在-1到1之间,能够较好地处理非线性关系。
IV值检测 是一种用于评估变量的预测能力的方法,特别适用于二分类问题。IV值通过计算各个变量在不同分类值下的分布差异来衡量变量与目标变量之间的关系强度。一般来说,IV值越大,说明变量对目标变量的预测能力越强。在相关性分析中,我们可以通过计算相关系数来判断不同变量之间的相关性。如果两个变量的相关系数接近1或-1,说明它们之间存在强相关。但是需要注意的是,相关性并不意味着因果关系,只是表明两个变量之间存在某种关联。
2、IV值的计算
(参考链接:https://blog.csdn.net/kevin7658/article/details/50780391)
计算公式如下:

3、IV值经验参考:
一般来说,IV值在0.02以下被认为是无效的预测能力,0.02到0.1之间被认为是弱预测能力,0.1到0.3之间是中等预测能力,大于0.3被认为是强预测能力

3、Hive SQL代码备份
with base_info as (
--分析 3个指标 与 y的相关性
select
vcn
,case when sum(cash_cost_td) >=200 then 1 else 0 end as y --y值
--进组前消耗
,sum(cash_cost_bef_30d) as cash_cost_bef_30d
--进组后消耗
,sum(cash_cost_td) as cash_cost_30d
--商家属性
,max(pay_order_gmv_30d) as pay_order_gmv_30d
from dm_ad.ds_dm_shop_test_0830_info
where date = '${date}'
and min_date = '2024-08-30'
and vcn_type = 'A.自助-企业'
group by
vcn
having (cash_cost_bef_30d>0 or cash_cost_30d>0 or pay_order_gmv_30d>0)
)
,key_value_info as (
--行列转行,变为key-value 键值对的形式
select
vcn
,y
,score
,cast(score_value as double) as score_value
from
(
select
vcn
,y
,cash_cost_bef_30d
,cash_cost_30d
,pay_order_gmv_30d
from base_info
)
lateral view outer explode
(
map
(
'cash_cost_bef_30d',cash_cost_bef_30d
,'cash_cost_30d',cash_cost_30d
,'pay_order_gmv_30d',pay_order_gmv_30d
)
) t as score,score_value
where score_value > 0
)
,score_group_info as (
--等频分箱
select
score
,score_group
,min_score_value
,max_score_value
,sample_cnt
,group_pos_i
,group_neg_i
,sum(group_pos_i) over(partition by score) as group_pos_n
,sum(group_neg_i) over(partition by score) as group_neg_n
from
(
select
aa.score
,case
when aa.score_value = 0 then 'r0'
when aa.score_value <= cast(score_array[0] as double) then 'r1'
when aa.score_value <= cast(score_array[1] as double) then 'r2'
when aa.score_value <= cast(score_array[2] as double) then 'r3'
when aa.score_value <= cast(score_array[3] as double) then 'r4'
else 'r5' end as score_group
,min(aa.score_value) as min_score_value
,max(aa.score_value) as max_score_value
,count(distinct vcn) as sample_cnt
,count(distinct if(y=1, vcn, null)) as group_pos_i
,count(distinct if(y=0, vcn, null)) as group_neg_i
from key_value_info aa
left join
(
select
score
,percentile_approx(score_value, array(0.2, 0.4, 0.6, 0.8, 1.0), 999999) as score_array
from key_value_info
group by
score
) bb
on aa.score = bb.score
group by
aa.score
,case
when aa.score_value = 0 then 'r0'
when aa.score_value <= cast(score_array[0] as double) then 'r1'
when aa.score_value <= cast(score_array[1] as double) then 'r2'
when aa.score_value <= cast(score_array[2] as double) then 'r3'
when aa.score_value <= cast(score_array[3] as double) then 'r4'
else 'r5' end
) tmp
)
select
score
,min_score_value
,max_score_value
,sample_cnt
,sum(iv_i) over(partition by score) as iv_total
,woe_i
,iv_i
,score_group
from
(
select
score
,score_group
,min_score_value
,max_score_value
,sample_cnt
,(ln(group_pos_i/group_pos_n)-ln(group_neg_i/group_neg_n))*(group_pos_i/group_pos_n - group_neg_i/group_neg_n) as iv_i
,ln(group_pos_i/group_pos_n)-ln(group_neg_i/group_neg_n) as woe_i
from score_group_info
) tmp
order by iv_total desc,score_group