1. 第一层ODS层
1 ODS层存放您从业务系统获取的最原始的数据,是其他上层数据的源数据。
2 本项目中使用的ODS层数据主要包括: 用户基本信息、商品分类信息、商品信息、店铺信息、 订单数据、订单支付信息、活动信息、行为日志信息
1.1 业务数据表
数仓ods层和业务数据表的对应关系
中文含义 | MYSQL表名 | HIVE表名 |
---|---|---|
用户基本信表 | nshop.customer | ods_nshop.ods_nshop_02_customer |
用户关注表 | nshop.customer_attention | ods_nshop.ods_nshop_02_customer_attention |
收货地址管理表 | nshop.customer_consignee | ods_nshop.ods_nshop_02_customer_consignee |
订单表 | nshop.orders | ods_nshop.ods_nshop_02_orders |
订单详情表 | nshop.order_detail | ods_nshop.ods_nshop_02_order_detail |
订单支付记录表 | nshop.orders_pay_records | ods_nshop.ods_nshop_02_orders_pay_records |
1.1.1 MySQL数据库建表
创建数据库:
create database if not exists ods_nshop;
创建表:
下载数据:
https://download.csdn.net/download/qq_28286027/12529873
解压后为nshop.sql,将数据直接执行source nshop.sql命令,导入mysql的ods_nshop表即可。
导入后数据表如下:
1.1.2 Hive数据库建表
创建Hive库并进入:
create database if not exists ods_nshop;
use ods_nshop;
创建表:
1.创建用户基本信息表:
CREATE TABLE
IF NOT EXISTS ods_nshop.ods_02_customer (
customer_id string COMMENT '用户ID',
customer_login string COMMENT '用户登录名',
customer_nickname string COMMENT '用户名(昵称)',
customer_name string COMMENT '用户真实姓名',
customer_pass string COMMENT '用户密码',
customer_mobile string COMMENT '用户手机',
customer_idcard string COMMENT '身份证',
customer_gender TINYINT COMMENT '性别:1男 0女',
customer_birthday string COMMENT '出生年月',
customer_email string COMMENT '用户邮箱',
customer_natives string COMMENT '所在地区',
customer_ctime BIGINT COMMENT '创建时间',
customer_utime BIGINT COMMENT '修改时间'
) location '/data/nshop/ods/ods_02_customer/';
从MySQL导入数据:
sqoop import \
--connect jdbc:mysql://master:3306/nshop \
--username root \
--password 111111 \
--table customer \
--hive-import \
--hive-overwrite \
--hive-table ods_nshop.ods_02_customer \
-m 1
用户关注表:
CREATE TABLE
IF NOT EXISTS ods_nshop.ods_02_customer_attention (
customer_id string COMMENT '用户ID',
attention_id string COMMENT '关注对象ID',
attention_type TINYINT COMMENT '关注类型:3店铺 4商品',
attention_status BIGINT COMMENT '关注状态:1关注 0取消',
attention_ctime BIGINT COMMENT '生成时间'
) location '/data/nshop/ods/ods_02_customer_attention/';
收货地址管理表:
CREATE TABLE
IF NOT EXISTS ods_nshop.ods_02_customer_consignee (
consignee_id string COMMENT '收货地址ID',
customer_id string COMMENT '用户ID',
consignee_name string COMMENT '收货人',
consignee_mobile string COMMENT '收货人电话',
consignee_zipcode string COMMENT '收货人地区',
consignee_addr string COMMENT '收货人详细地址',
consignee_tag string COMMENT '标签:1家 2公司 3学校',
ctime BIGINT COMMENT '创建时间'
) location '/data/nshop/ods/ods_02_customer_consignee/'
从MySQL导入数据:
sqoop import \
--connect jdbc:mysql://master:3306/nshop \
--username root \
--password 111111 \
--table customer_consignee \
--hive-import \
--hive-overwrite \
--hive-table ods_nshop.ods_02_customer_consignee \
-m 1
订单表:
CREATE TABLE
IF NOT EXISTS ods_nshop.ods_02_orders (
order_id string COMMENT '订单ID(时间+商品ID+4位随机)',
customer_id string COMMENT '下单用户ID',
order_status TINYINT COMMENT '订单状态',
customer_ip string COMMENT '下单用户IP',
user_longitude string COMMENT '用户地理:经度',
user_latitude string COMMENT '用户地理:纬度',
user_areacode string COMMENT '用户所在地区',
consignee_name string COMMENT '收货人',
consignee_mobile string COMMENT '收货人电话',
consignee_zipcode string COMMENT '收货人地址',
pay_type string COMMENT '支付类型:线上支付 10 网上银行 11 微信 12 支付宝 | 线下支付 (货到付款) 20 ',
pay_code string COMMENT '支付对应唯一标识,如微信号、支付宝号',
pay_nettype string COMMENT '支付网络方式:0 wifi | 1 4g | 2 3g |3 线下支付',
district_money DECIMAL (8, 1) COMMENT '优惠金额',
shipping_money DECIMAL (8, 1) COMMENT '运费金额',
payment_money DECIMAL (10, 1) COMMENT '支付金额',
order_ctime BIGINT COMMENT '创建时间',
shipping_time BIGINT COMMENT '发货时间',
receive_time BIGINT COMMENT '收货时间'
) location '/data/nshop/ods/ods_02_orders/';
从MySQL导入数据:
sqoop import \
--connect jdbc:mysql://master:3306/nshop \
--username root \
--password 111111 \
--table orders \
--hive-import \
--hive-overwrite \
--hive-table ods_nshop.ods_02_orders \
-m 1
订单详情表:
CREATE TABLE
IF NOT EXISTS ods_nshop.ods_02_order_detail (
order_detail_id string COMMENT '订单详情表ID',
order_id string COMMENT '订单表ID',
product_id string COMMENT '订单商品ID',
product_name string COMMENT '商品名称',
product_remark string COMMENT '商品描述',
product_cnt INT COMMENT '购买商品数量',
product_price DECIMAL (5, 1) COMMENT '购买商品单价',
weighing_cost DECIMAL (2, 1) COMMENT '商品加权价格',
district_money DECIMAL (4, 1) COMMENT '优惠金额',
is_activity BIGINT COMMENT '1:参加活动|0:没有参加活动',
order_detail_ctime BIGINT COMMENT '下单时间'
) location '/data/nshop/ods/ods_02_order_detail/';
从MySQL导入数据:
sqoop import \
--connect jdbc:mysql://master:3306/nshop \
--username root \
--password 111111 \
--table order_detail \
--hive-import \
--hive-overwrite \
--hive-table ods_nshop.ods_02_order_detail \
-m 1
订单支付记录表:
CREATE TABLE
IF NOT EXISTS ods_nshop.ods_02_orders_pay_records (
pay_id string COMMENT '支付记录ID',
order_id string COMMENT '订单ID',
customer_id string COMMENT '用户ID',
pay_status string COMMENT '支付状态:0 支付失败| 1 支付成功',
pay_type string COMMENT '支付类型:线上支付 10 网上银行 11 微信 12 支付宝 | 线下支付 (货到付款) 20 ',
pay_code string COMMENT '支付对应唯一标识,如微信号、支付宝号',
pay_nettype string COMMENT '支付网络方式:1 wifi | 2 4g | 3 3g |4 线下支付',
pay_amount DOUBLE COMMENT '支付金额',
pay_ctime BIGINT COMMENT '创建时间'
) location '/data/nshop/ods/ods_02_orders_pay_records/';
从MySQL导入数据:
sqoop import \
--connect jdbc:mysql://master:3306/nshop \
--username root \
--password 111111 \
--table orders_pay_records \
--hive-import \
--hive-overwrite \
--hive-table ods_nshop.ods_02_orders_pay_records \
-m 1
1.2 埋点日志
电商平台在网页版或APP版终端上进行业务埋点,采集相关用户行为数据发送给服务器进行用户行为分析。
中文含义 | HIVE表名 |
---|---|
用户行为日志表 | ods_nshop.ods_nshop_01_useractlog |
用户行为日志表创建:
CREATE external TABLE
IF NOT EXISTS ods_nshop.ods_nshop_01_useractlog (
action string COMMENT '行为类型:install安装|launch启动|interactive交 互|page_enter_h5页面曝光|page_enter_native页面进入|exit退出',
event_type string COMMENT '行为类型:click点击|view浏览|slide滑动|input输入',
customer_id string COMMENT '用户id',
device_num string COMMENT '设备号',
device_type string COMMENT '设备类型',
os string COMMENT '手机系统',
os_version string COMMENT '手机系统版本',
manufacturer string COMMENT '手机制造商',
carrier string COMMENT '电信运营商',
network_type string COMMENT '网络类型',
area_code string COMMENT '地区编码',
longitude string COMMENT '经度',
latitude string COMMENT '纬度',
extinfo string COMMENT '扩展信息(json格式)',
ct BIGINT COMMENT '创建时间'
) partitioned BY (bdp_day string) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' STORED AS TEXTFILE location '/data/nshop/ods/user_action_log/';
注意:json格式数据表需要通过serde机制处理
- 在hive-site.xml中设置三方jar包
<property>
<name>hive.aux.jars.path</name>
<value>/usr/local/src/apache-hive-3.1.0-bin/lib/</value>
</property>
- 在hive.aux.jars.path设置的路径中增加hive-hcatalog-core-2.1.1.jar 到/usr/local/src/apache-hive-3.1.0-bin/lib/中
执行:
]# /usr/local/src/apache-hive-3.1.0-bin/hcatalog/share/hcatalog/hive-hcatalog-core-2.1.1.jar /usr/local/src/apache-hive-3.1.0-bin/lib/
导入用户行为数据到hive中:
数据下载地址:https://download.csdn.net/download/qq_28286027/12532059
用户行为日志文件000000_0前15行如下:
{"action":"05","event_type":"01","customer_id":"20101000324999676","device_num":"586344","device_type":"9","os":"2","os_version":"2.2","manufacturer":"05","carrier":"2","network_type":"2","area_code":"41092","longitude":"116.35636","latitude":"40.06919","extinfo":"{\"target_type\":\"4\",\"target_keys\":\"20402\",\"target_order\":\"31\",\"target_ids\":\"[\\\"4320402595801\\\",\\\"4320402133801\\\",\\\"4320402919201\\\",\\\"4320402238501\\\"]\"}","ct":1567896035000}
{"action":"05","event_type":"02","customer_id":"20101000405999595","device_num":"956236","device_type":"9","os":"2","os_version":"4.3","manufacturer":"08","carrier":"2","network_type":"3","area_code":"10015","longitude":"116.35985","latitude":"40.069590000000005","extinfo":"{\"target_type\":\"3\",\"target_id\":\"4320406544401\",\"target_action\":\"01\"}","ct":1567896035000}
{"action":"05","event_type":"02","customer_id":"20101000648999352","device_num":"876263","device_type":"2","os":"2","os_version":"6.2","manufacturer":"07","carrier":"2","network_type":"2","area_code":"42058","longitude":"116.34721","latitude":"40.06685","extinfo":"{\"target_type\":\"3\",\"target_id\":\"4320701402201\",\"target_action\":\"01\"}","ct":1567896035000}
{"action":"07","event_type":"02","customer_id":"20101000729999271","device_num":"451481","device_type":"9","os":"2","os_version":"4.5","manufacturer":"09","carrier":"3","network_type":"2","area_code":"37162","longitude":"116.35737","latitude":"40.06962","extinfo":"{\"target_id\":\"4320207200701\"}","ct":1567896035000}
{"action":"06","event_type":"","customer_id":"20101000810999190","device_num":"114649","device_type":"9","os":"2","os_version":"8.2","manufacturer":"02","carrier":"3","network_type":"0","area_code":"35062","longitude":"116.38244999999999","latitude":"40.06447","extinfo":"{}","ct":1567896035000}
{"action":"07","event_type":"01","customer_id":"20101000972999028","device_num":"332835","device_type":"2","os":"2","os_version":"7.5","manufacturer":"02","carrier":"2","network_type":"0","area_code":"15082","longitude":"116.33457","latitude":"40.06449","extinfo":"{\"target_id\":\"4320701224301\"}","ct":1567896035000}
{"action":"05","event_type":"02","customer_id":"20101000972999028","device_num":"332835","device_type":"2","os":"2","os_version":"7.5","manufacturer":"02","carrier":"2","network_type":"0","area_code":"15082","longitude":"116.33457","latitude":"40.06449","extinfo":"{\"target_type\":\"3\",\"target_id\":\"4320701224301\",\"target_action\":\"01\"}","ct":1567896035000}
{"action":"08","event_type":"02","customer_id":"20101001377998623","device_num":"269526","device_type":"9","os":"2","os_version":"2.2","manufacturer":"05","carrier":"2","network_type":"3","area_code":"13053","longitude":"116.36325","latitude":"40.06456","extinfo":"{\"target_id\":\"4320901702901\"}","ct":1567896035000}
{"action":"08","event_type":"01","customer_id":"20101001458998542","device_num":"252612","device_type":"9","os":"2","os_version":"7.5","manufacturer":"03","carrier":"2","network_type":"2","area_code":"10013","longitude":"116.39949","latitude":"40.069250000000004","extinfo":"{\"target_id\":\"4320901657101\"}","ct":1567896035000}
{"action":"07","event_type":"02","customer_id":"20101001701998299","device_num":"314141","device_type":"9","os":"2","os_version":"4.8","manufacturer":"06","carrier":"2","network_type":"2","area_code":"45102","longitude":"116.37762","latitude":"40.06812","extinfo":"{\"target_id\":\"4320206030201\"}","ct":1567896035000}
{"action":"05","event_type":"02","customer_id":"20101001782998218","device_num":"273141","device_type":"9","os":"2","os_version":"2.5","manufacturer":"04","carrier":"3","network_type":"3","area_code":"19621","longitude":"116.39714","latitude":"40.06693","extinfo":"{\"target_type\":\"3\",\"target_id\":\"4320506927501\",\"target_action\":\"01\"}","ct":1567896035000}
{"action":"05","event_type":"02","customer_id":"20101001944998056","device_num":"962792","device_type":"9","os":"2","os_version":"2.3","manufacturer":"07","carrier":"2","network_type":"2","area_code":"10008","longitude":"116.39224","latitude":"40.067510000000006","extinfo":"{\"target_type\":\"3\",\"target_id\":\"4320102333401\",\"target_action\":\"01\"}","ct":1567896035000}
{"action":"05","event_type":"01","customer_id":"20101002025997975","device_num":"133595","device_type":"2","os":"2","os_version":"2.2","manufacturer":"04","carrier":"3","network_type":"3","area_code":"37091","longitude":"116.37975","latitude":"40.06192","extinfo":"{\"target_type\":\"4\",\"target_keys\":\"20206\",\"target_order\":\"20\",\"target_ids\":\"[\\\"4320206871201\\\",\\\"4320206738601\\\",\\\"4320206632601\\\",\\\"4320206465801\\\"]\"}","ct":1567896035000}
{"action":"08","event_type":"01","customer_id":"20101002187997813","device_num":"648218","device_type":"2","os":"2","os_version":"8.2","manufacturer":"03","carrier":"2","network_type":"0","area_code":"13018","longitude":"116.38544","latitude":"40.06662","extinfo":"{\"target_id\":\"4320701967401\"}","ct":1567896035000}
{"action":"08","event_type":"02","customer_id":"20101002187997813","device_num":"648218","device_type":"2","os":"2","os_version":"8.2","manufacturer":"03","carrier":"2","network_type":"0","area_code":"13018","longitude":"116.38544","latitude":"40.06662","extinfo":"{\"target_id\":\"4320701967401\"}","ct":1567896035000}
导入:
]# load data local inpath '/usr/local/user_behavior_log/000000_0' into table ods_nshop_01_useractlog partition(bdp_day='20200618');
行为日志解析:
1.3 外部广告投放相关数据
外部数据来源主要有: 1 企业间的接口调用(如其他企业平台上的广告投放营销) 2 公共数据的爬虫数据(公共数据、友商数据、社区|社交平台公开数据)
中文含义 | HIVE表名 |
---|---|
广告投放信息表 | ods_nshop.ods_nshop_01_releasedatas |
创建广告投放信息表:
CREATE external TABLE
IF NOT EXISTS ods_nshop.ods_01_releasedatas (
customer_id string COMMENT '用户id',
device_num string COMMENT '设备号',
device_type string COMMENT '设备类型',
os string COMMENT '手机系统',
os_version string COMMENT '手机系统版本',
manufacturer string COMMENT '手机制造商',
area_code string COMMENT '地区编码',
release_sid string COMMENT '投放请求id',
release_session string COMMENT '投放会话id',
release_sources string COMMENT '投放渠道',
release_params string COMMENT '投放请求参数',
ct BIGINT COMMENT '创建时间'
) partitioned BY (bdp_day string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' stored AS textfile location '/data/nshop/ods/release/';
导入数据:
数据下载地址:https://download.csdn.net/download/qq_28286027/12529938
数据文件1.csv前10行:
customer_id,device_num,device_type,os,os_version,manufacturer,area_code,release_sid,release_session,release_sources,release_params,release_ctime,nbdp_day
20103000002999998,433229,9,2,9.4,03,51152,1567291235000fjdtkp,1567291235000433229742121,douyin,ip=200.95.79.96&deviceNum=433229&lon=null&lat=null&aid=null&ctime=1567291302000&sources=douyin&session=1567291235000433229742121&productPage=4320309458301,1567896035000,20190908
20105000004999996,121417,2,2,4.2,09,44160,1567291235000sjibrq,1567291235000121417572782,liebao,ip=65.200.95.194&deviceNum=121417&lon=null&lat=null&aid=null&ctime=1567291346000&sources=liebao&session=1567291235000121417572782&productPage=4320904071301,1567896035000,20190908
20302000019999981,349828,2,2,1.7,02,21142,1567291235000gkpgvs,1567291235000349828624925,netyex,ip=70.183.190.208&deviceNum=349828&lon=null&lat=null&aid=null&ctime=1567291385000&sources=netyex&session=1567291235000349828624925&productPage=4320309376701,1567896035000,20190908
20506000041999959,969247,2,2,8.7,10,21132,1567291235000qgijuq,1567291235000969247224224,ximalaya,ip=157.66.156.74&deviceNum=969247&lon=null&lat=null&aid=null&ctime=1567291421000&sources=ximalaya&session=1567291235000969247224224&productPage=4320103859601,1567896035000,20190908
20508000043999957,865977,9,2,8.9,04,62052,1567291235000gppiwp,1567291235000865977661164,baidu,ip=58.151.213.230&deviceNum=865977&lon=null&lat=null&aid=null&ctime=1567291485000&sources=baidu&session=1567291235000865977661164&productPage=4320909655701,1567896035000,20190908
20705000058999942,752814,9,2,4.1,10,43022,1567291235000fftdsv,1567291235000752814286721,tencent,ip=57.131.111.229&deviceNum=752814&lon=null&lat=null&aid=null&ctime=1567291419000&sources=tencent&session=1567291235000752814286721&productPage=4320404173901,1567896035000,20190908
20708000061999939,466888,2,2,5.3,10,14092,1567291235000kesjrt,1567291235000466888256571,netyex,ip=72.29.250.178&deviceNum=466888&lon=null&lat=null&aid=null&ctime=1567291324000&sources=netyex&session=1567291235000466888256571&productPage=4320408025201,1567896035000,20190908
20106000086999914,794259,9,2,1.8,03,51342,1567291235000wjsqjs,1567291235000794259134297,liebao,ip=241.84.18.204&deviceNum=794259&lon=null&lat=null&aid=null&ctime=1567291318000&sources=liebao&session=1567291235000794259134297&productPage=4320902876101,1567896035000,20190908
20107000087999913,225827,2,2,5.9,09,65010,1567291235000sgxjgn,1567291235000225827843226,liebao,ip=75.29.250.157&deviceNum=225827&lon=null&lat=null&aid=null&ctime=1567291304000&sources=liebao&session=1567291235000225827843226&productPage=4320603519801,1567896035000,20190908
导入:
load data local inpath '/usr/local/user_behavior_log/1.csv' into table ods_01_releasedatas partition(bdp_day='20200618');
三方数据ODS层的建表检查:
检查,访问50070端口查看:
1.4 数据清洗
数据在采集或进行外部接口调用过程中可能会产生不符合要求的“脏数据”,脏数据产 生原因及处理如下:
1.4.1 格式内容问题产生的原因:
1)不同数据源采集而来的数据内容和格式定义不一致
2)时间、日期格式不一致清洗,根据实际情况,把时间/日期数据库转换成统一 的表示方式。
3)数据类型不符清洗
1.4.2 逻辑错误清洗
1)数据重复清洗
2)数据不完全相同,但从业务角度看待数据是同一个数据,如页面埋点时,进入 页面都会上报一次数据,只有时间不一样,其他字段相同,在统计pv/uv时应该进 行去重。
3)矛盾内容的修正,如身份证号可以验证我们的年龄,然后我们可以根据字段的 数据来源,判断哪个字段提供的信息更为可靠来做修正。
1.4.3 缺失值的清洗
1)数据值缺失是数据分析中经常遇到的问题之一。
2)造成原因:有些对象的某个或某些属性是不可用的,信息被遗漏,或者实时性 高还未来得及做出判断。
3)数据填充:用一定的值去填充空值,从而使信息表完备化。
4)去空处理.
1.4.4 不符合业务需求(或挖掘分析需求)的数据
在这里,为了减少复杂,就不进行数据清洗了。