1、相关性

相关性分析主要通过计算相关系数来衡量变量之间的相关性。

常用的相关系数有 皮尔逊相关系数 和 斯皮尔曼相关系数。皮尔逊相关系数适用于连续变量之间的相关性分析，范围在-1到1之间，绝对值越接近1表示相关性越强。斯皮尔曼相关系数适用于任意变量之间的相关性分析，也在-1到1之间，能够较好地处理非线性关系。

IV值检测 是一种用于评估变量的预测能力的方法，特别适用于二分类问题。IV值通过计算各个变量在不同分类值下的分布差异来衡量变量与目标变量之间的关系强度。一般来说，IV值越大，说明变量对目标变量的预测能力越强。在相关性分析中，我们可以通过计算相关系数来判断不同变量之间的相关性。如果两个变量的相关系数接近1或-1，说明它们之间存在强相关。但是需要注意的是，相关性并不意味着因果关系，只是表明两个变量之间存在某种关联。

2、IV值的计算

(参考链接：https://blog.csdn.net/kevin7658/article/details/50780391)

计算公式如下：

3、IV值经验参考：

一般来说，IV值在0.02以下被认为是无效的预测能力，0.02到0.1之间被认为是弱预测能力，0.1到0.3之间是中等预测能力，大于0.3被认为是强预测能力

3、Hive SQL代码备份

with base_info as (

--分析 3个指标与 y的相关性

select

vcn

,case when sum(cash_cost_td) >=200 then 1 else 0 end as y --y值

--进组前消耗

,sum(cash_cost_bef_30d) as cash_cost_bef_30d

--进组后消耗

,sum(cash_cost_td) as cash_cost_30d

--商家属性

,max(pay_order_gmv_30d) as pay_order_gmv_30d

from dm_ad.ds_dm_shop_test_0830_info

where date = '${date}'

and min_date = '2024-08-30'

and vcn_type = 'A.自助-企业'

group by

vcn

having (cash_cost_bef_30d>0 or cash_cost_30d>0 or pay_order_gmv_30d>0)

)

,key_value_info as (

--行列转行，变为key-value 键值对的形式

select

vcn

,score

,cast(score_value as double) as score_value

from

(

select

vcn

,cash_cost_bef_30d

,cash_cost_30d

,pay_order_gmv_30d

from base_info

)

lateral view outer explode

(

map

(

'cash_cost_bef_30d',cash_cost_bef_30d

,'cash_cost_30d',cash_cost_30d

,'pay_order_gmv_30d',pay_order_gmv_30d

)

) t as score,score_value

where score_value > 0

)

,score_group_info as (

--等频分箱

select

score

,score_group

,min_score_value

,max_score_value

,sample_cnt

,group_pos_i

,group_neg_i

,sum(group_pos_i) over(partition by score) as group_pos_n

,sum(group_neg_i) over(partition by score) as group_neg_n

from

(

select

aa.score

,case

when aa.score_value = 0 then 'r0'

when aa.score_value <= cast(score_array[0] as double) then 'r1'

when aa.score_value <= cast(score_array[1] as double) then 'r2'

when aa.score_value <= cast(score_array[2] as double) then 'r3'

when aa.score_value <= cast(score_array[3] as double) then 'r4'

else 'r5' end as score_group

,min(aa.score_value) as min_score_value

,max(aa.score_value) as max_score_value

,count(distinct vcn) as sample_cnt

,count(distinct if(y=1, vcn, null)) as group_pos_i

,count(distinct if(y=0, vcn, null)) as group_neg_i

from key_value_info aa

left join

(

select

score

,percentile_approx(score_value, array(0.2, 0.4, 0.6, 0.8, 1.0), 999999) as score_array

from key_value_info

group by

score

) bb

on aa.score = bb.score

group by

aa.score

,case

when aa.score_value = 0 then 'r0'

when aa.score_value <= cast(score_array[0] as double) then 'r1'

when aa.score_value <= cast(score_array[1] as double) then 'r2'

when aa.score_value <= cast(score_array[2] as double) then 'r3'

when aa.score_value <= cast(score_array[3] as double) then 'r4'

else 'r5' end

) tmp

)

select

score

,min_score_value

,max_score_value

,sample_cnt

,sum(iv_i) over(partition by score) as iv_total

,woe_i

,iv_i

,score_group

from

(

select

score

,score_group

,min_score_value

,max_score_value

,sample_cnt

,(ln(group_pos_i/group_pos_n)-ln(group_neg_i/group_neg_n))*(group_pos_i/group_pos_n - group_neg_i/group_neg_n) as iv_i

,ln(group_pos_i/group_pos_n)-ln(group_neg_i/group_neg_n) as woe_i

from score_group_info

) tmp

order by iv_total desc,score_group

【相关性分析】IV值检测

【相关性分析】IV值检测

1、相关性

2、IV值的计算

3、IV值经验参考：

3、Hive SQL代码备份

推荐阅读更多精彩内容