adventure part3 02 Sqoop入门和避坑

sqoop

核心的功能有两个：

导入、迁入

导出、迁出

导入数据：MySQL，Oracle 导入数据到 Hadoop 的 HDFS、HIVE、HBASE 等数据存储系统

导出数据：从 Hadoop 的文件系统中导出数据到关系数据库 mysql 等 Sqoop 的本质还是一个命令行工具，和 HDFS，Hive 相比，并没有什么高深的理论。

sqoop：
工具：本质就是迁移数据，迁移的方式：就是把sqoop的迁移命令转换成MR程序

hive
工具，本质就是执行计算，依赖于HDFS存储数据，把SQL转换成MR程序

image.png

[https://www.cnblogs.com/qingyunzong/p/8707885.html]

sqoop从MYSQL提取数据表语句

先写入脚本直接运行脚本，或者所有行变成一行粘贴到linux运行，不要分开运行*
--target-dir /tmp/mdx \ 不要放到存放数据的位置，也不要放到没有使用权限的位置
-m 1单线程导入数据很慢。出现map>map不是错误，是在慢慢导入。
如果m后面的数字大于1，需要 --split -by 'column' 多线程处理。column字段只能是数字字段且值唯一？

hive -e "drop table if exists ods.dim_date_df" ##删除hive原有的旧表
sqoop import \
--hive-import \
--connect "jdbc:mysql://127.0.0.1:3306/datafrog05_adventure?
useUnicode=true&characterEncoding=utf-8&zeroDateTimeBehavior=convertToNull&
tinyInt1isBit=false&dontTrackOpenResources=true&defaultFetchSize=50000&useCursorFetch=true" \ ## 告诉jdbc，连接mysql的url
--driver com.mysql.jdbc.Driver \ # Hadoop根目录
--username root \ ## 连接mysql的用户名
--password root \ ## 连接mysql的密码
--query "select * from dim_date_df where "'$CONDITIONS'" " \## 构建表达式<sql语句 >执行
--fetch-size 50000 \ ## 一次从数据库读取 n 个实例，即n条数据
--hive-table ods.dim_date_df \  ## 创建dim_date_df表（默认也会自动创建表）
--hive-drop-import-delims \  ## 在导入数据到hive时，去掉数据中的\r\n\013\010这样的字符
--delete-target-dir \ ## 如果目标文件已存在就把它删除
--target-dir /tmp/mdx \
## 指定的目录下面并没有数据文件，数据文件被放在了hive的默认/user/hadoop/sqoop/dim_date_df下面
-m 1 ## 迁移过程使用1个map（开启一个线程）

从MYSQL中提取的数据使用hive聚合处理生成每日聚合表

这一部分基本是sql和hive主要是hive -e的语句不涉及sqoop
用心检查才能保证不出错
-with语句的使用必须要和其他SQL语句搭配使用（相当于临时存储）

with table_name1 as (select语句),
  table_name2 as (select语句)
insert into table_name3
select *from table_name1 a 
inner join table2  b
on a.field=b..field

-- with语句 例子
WITH tmp AS (
        SELECT customer_key, cpzl_zw, order_num, cpzl_zw1
        FROM (
            SELECT customer_key, cpzl_zw, row_number() OVER (PARTITION BY 
customer_key ORDER BY create_date ASC) AS order_num
                , lag(cpzl_zw, 1, NULL) OVER (PARTITION BY customer_key ORDER 
BY create_date ASC) AS cpzl_zw1
            FROM ods_sales_orders
        ) a
        WHERE cpzl_zw != cpzl_zw1
    ), 
    tmp2 AS (
        SELECT customer_key, cpzl_zw, order_num, cpzl_zw1, row_number() 
OVER (PARTITION BY customer_key ORDER BY order_num) AS cpzl_zw_num
        FROM tmp
    )
SELECT concat(customer_key, '-', concat_ws('-', collect_set(cpzl_zw)))
FROM tmp2
WHERE cpzl_zw_num < 3
GROUP BY customer_key;

hive -e "drop table if exists ods.dw_order_by_day" 
（如果已经存在就删除）
hive -e "
CREATE TABLE ods.dw_order_by_day( #创建表
  create_date string,
  is_current_year bigint,
  is_last_year bigint,
  is_yesterday bigint,
  is_today bigint,
  is_current_month bigint,
  is_current_quarter bigint,
  sum_amount double,
  order_count bigint)
"
hive -e "
with dim_date as
(select create_date,
            is_current_year,
            is_last_year,
            is_yesterday,
            is_today,
            is_current_month,
            is_current_quarter
            from ods.dim_date_df),#a表
sum_day as
(select create_date,
        sum(unit_price) as sum_amount,
        count(customer_key) as order_count
        from ods.ods_sales_orders
        group by create_date) #b表
insert into ods.dw_order_by_day #从a表和b表中提取数据插入到order_by_day 表中
    select b.create_date,
    b.is_current_year,
    b.is_last_year,
    b.is_yesterday,
    b.is_today,
    b.is_current_month,
    b.is_current_quarter,
    a.sum_amount,
    a.order_count
    from sum_day as a
    inner join dim_date as b
    on a.create_date=b.create_date
"

将hive聚合处理生成的每日聚合表写入mysql中

(写入mysql前要在mysql中建好相应的表)
hive中建表语句的字段顺序要和mysql中建表语句的顺序相同，否则会报错
导出到mysql 有三种模式

三种导出模式

sqoop export \
-Dorg.apache.sqoop.export.text.dump_data_on_error=true \ 不导出字段名
--connect "jdbc:mysql://服务器地址:3306/数据库" \
--username 用户名 \
--password 密码 \
--table dw_amount_diff_mdx01 \
--export-dir /user/hive/warehouse/ods_mdx.db/dw_amount_diff \
--input-null-string "\\\\N" \##空值处理
--input-null-non-string "\\\\N"  \ ##空值处理
--input-fields-terminated-by "\001"  \ ##字段分隔符
--input-lines-terminated-by "\\n"  ##换行符
-m 1

注：Dorg.apache.sqoop.export.text.dump_data_on_error=true 不导出字段名

sqoop参考链接：
https://my.oschina.net/u/3754001/blog/1863320
https://my.oschina.net/u/1765168/blog/1593343
https://www.cnblogs.com/yongestcat/archive/2019/09/12/11510507.html
为
使用powerbi展示数据可视化

adventure part3 02 Sqoop入门和避坑

adventure part3 02 Sqoop入门和避坑

sqoop

从MYSQL中提取的数据使用hive聚合处理生成每日聚合表

将hive聚合处理生成的每日聚合表写入mysql中

相关阅读更多精彩内容

友情链接更多精彩内容