sqoop overview

1/列出mysql数据库中的所有数据库sqoop list-databases -connect jdbc:mysql://localhost:3306/ -username root -password 123456报错：host is not allowed to connect mysql / access denied for ‘root’@‘hadoop1’一个是查看服务器防火墙然后查看 show grants 是否允许远程连接 grant all privileges on *.* to 'root'@'%' identified by 'password' grant option; flush privileges;注意格式 list-databases 是一个命令option 没有空格可以在connect参数中指出具体的dbname2/连接mysql并列出test数据库中的表sqoop list-tables -connect jdbc:mysql://localhost:3306/test -username root -password 1234563/将关系型数据的表结构复制到hive中，只是复制表结构内容不复制 -hive-table hiveTableName 默认指定default数据库的表名可采用如下方式指定具体的数据库内的数据表 hiveDbName.hiveTableName 如果目标表已经存在了，那么创建任务会失败sqoop create-hive-table -connect jdbc:mysql://localhost:3306/test -table student -username root -password 123456 -hive-table studentinhive4/从关系数据库导入文件到hive中 -m,--num-mappersUse n map tasks to import in parallel 导入数据时并行开启n Map任务 sqoop import -connect jdbc:mysql://localhost:3306/test -username root -password 123456 -table student -hive-import -hive-table studentinhive -m 2报错：error tool.importtool error during import no primary key could be found for table s_test .please speify one with --split-by or perform a sequential import with '-m 1'sqoop import -connect jdbc:mysql://localhost:3306/zxtest -username root -password 123456 -table sqoop_test -hive-import -hive-table s_test --split-by id -m 2报错：error tool.importtool encounted ioexception running import job java.io.exception failed to run job application rejected by queue placement policysqoop import -Dmapred.job.queue.name=develop -connect jdbc:mysql://localhost:3306/test -username root -password 123456 -table sqoop_test -hive-import -hive-table s_test --split-by id -m 2假设id为1-6-6-7-8由于执行导入时指定两个map任务以id进行分区的，因此在存储时也是以part-m-00000 和 part-m-00001形式，对应的id为1-4 / 5-9当再次执行导入任务时，设定m为4，默认以追加形式写入数据。会以part-m-00000/part-m-00001/part-m-00002/part-m-00003 对应的id分区为 1-2 3-4 5-6-6 7-8 （默认根据hash进行分区所以第三个分区有三个记录）但是由于当前数据库对应数据表的存储位置下已经存在了part-m-00000和part-m-00001，则新追加导入的以part-m-00000_copy_1 和 part-m-00001_copy_1存在。因此在hive操作时，如果是select 操作，会首先读取part-m-00000 然后是 part-m-00000_copy_1，然后 part-m-00001，然后 part-m-00001_copy_1 然后 part-m-00002，part-m-00003。数据记录为1-4 1-2 5-8 3-4 5-6-6 7-8 hive没有指定字段分隔符和行分隔符时默认的字段分隔符为ascii码的控制符\001,建表的时候用 --fields-terminated-by '\001' --lines-terminated-by “\n”5/将hive中的表数据导入到mysql 中，在进行导入前 mysql中的表hive_test 必须提前创建好sqoop export -Dmapred.job.queue.name=develop -connect jdbc:mysql://localhost:3306/test -username root -table studentfromhive --fields-terminated-by '\001' --lines-terminated-by “\n” -export-dir /user/hive/warehouse/test_shi.db/studentinhive/studentinhive;会读取studentinhive目录下的多个文件以指定分隔符形式读取字段解析记录，导入mysql数据库中。如果因为某些原因导致中断，任务失败，可能会造成部分数据导入成功。如果要求，保证导入数据的事务性，可以通过配置（类似与分布暂存，然后一次性写入） --staging-tableThe table in which data will be staged before being inserted into the destination table.--clear-staging-table Indicates that any data present in the staging table can be deleted. 由于sqoop是通过map完成数据的导入，各个map过程是独立的，没有事物的概念，可能会有部分map数据导入失败的情况。为了解决这一问题，sqoop中有一个折中的办法，即是指定中间 staging表，成功后再由中间表导入到结果表。这一功能是通过 --staging-table指定，同时staging表结构也是需要提前创建出来的:sqoop export --connect jdbc:mysql://192.168.81.176/sqoop --username root -password passwd --table sds --export-dir /user/guojian/sds --staging-table sds_tmp需要说明的是，在使用 --direct， --update-key或者--call存储过程的选项时，staging中间表是不可用的。如果指定的--fields-terminated-by '\001'(^A ) --lines-terminated-by “\n” 不正确或者没有指定的话，在yarn（每一个sqoop 导入数据程序配置了队列会以一个jar（导出或者导入的表名）包形式使用mapreduce计算引擎运行在该队列上）的日志里可以看到failed task , cannot export data please check failed map task logs.caused by java.lang.numberformatexception for input string '****'所以在往hive中导入数据时一定要显示定义使用字段分隔符和行分隔符号6/从数据库导出表的数据到HDFS上的文件sqoop import -Dmapred.job.queue.name=develop -connect jdbc:mysql://localhost:3306/test -username=root -table student --fields-terminated-by '\t' --lines-terminated-by “\n” --split-by id -m 2 -target-dir /user/test/studentfile 备注：studentfile 必须是在hdfs文件系统上不存在的文件夹。由于指定 -m 2 因此，会在studentfile 目录下生成三个文件，一个是存储元数据信息，另外两个分别是part-m-00000/part-m-00001通过hdfs dfs -cat ..../studentfile/part-m-00000 浏览对应的数据库记录。7/数据库增量导入表数据到hdfs中sqoop import -Dmapred.job.queue.name=develop -connect jdbc:mysql://localhost:3306/test -username=root -table student --fields-terminated-by '\t' --lines-terminated-by “\n” --split-by id -m 2 -target-dir /user/test/studentfile -check-column id -incremental append -last-value 8hdfs文件里多个两个part-m-00002/part-m-00003While the Hadoop generic arguments must precede any import arguments, you can type the import arguments in any order with respect to one another. arguments are grouped into collections organized by function. Some collections are present in several tools (for example, the "common" arguments). hadoop的类参数需要在任何其他的import参数之前指定，余下所有参数的顺序可以随意。因为解析命令时，调用函数对参数以分组形式写入集合开始后续操作的。其类参数大概有-conf -D -fs -jt -files -libjars -archivessqoop读取数据库是一行一行的读取记录，导入hdfs。导入进程是并发执行的，输出会有多个文件。可能是序列化文件文本文件（分隔符）sqoop option-file$ sqoop import -Dmapred.job.queue.name=develop -connect jdbc:mysql://localhost:3306/test -username=root -table student --fields-terminated-by '\t' --lines-terminated-by “\n” --split-by id -m 2 -target-dir /user/test/studentfile$ sqoop --options-file /users/homer/work/import.txt## Options file for Sqoop import## Specifies the tool being invokedimport# Hadoop parameter and value-Dmapred.job.queue.name=develop# Connect parameter and value--connectjdbc:mysql://localhost:3306/test# Username parameter and value--usernameroot# --fields-terminated-by\t# --lines-terminated-by\n# --split-byid# -m2# --target-dir/user/test/studentfile## Remaining options should be specified in the command line.#注意事项：1/ option在option file里参数和值需要换行且每一类都需要换行且每一个参数和值后不能有空格等多余字符如果是 --split-by id 会报 -- error parsing arguments for import unrecognized argument 2/ --fields-terminated-by ‘\t’ 以及 --lines-terminated-by '\n' 不仅需要换行还需要将单引号去除否则会报 -- error parsing arguments for import unrecognized argument 3/ 可以在option-file 外指定相关配置$ sqoop import --connect jdbc:mysql://database.example.com/employees \ --username venkatesh --password-file ${user.home}/.password使用密码文件读取相关密码需要确保密码文件中没有多余字符可以采取 echo -n “yoursecret” > password.filesqoop默认支持多种数据库，包含mysql，以jdbc:mysql:// 开头的连接字符串会被sqoop自动处理，如果是其他类型的数据库，则需要指定驱动和保证sqoop的lib库中包含对应的jdbc jar包。--null-stringThe string to be written for a null value for string columns--null-non-stringThe string to be written for a null value for non-string columns

The --null-string and --null-non-string arguments are optional.\ If not specified, then the string "null" will be used.

$ sqoop import -Dmapred.job.queue.name=develop -connect jdbc:mysql://localhost:3306/test -username=root -table student --columns “id,name” --where "id>6" --fields-terminated-by '\t' --lines-terminated-by “\n” --split-by id -m 2 -target-dir /user/test/studentfileSelect

注意事项：指定列时不能重复指定

$ sqoop import -Dmapred.job.queue.name=develop -connect jdbc:mysql://localhost:3306/test -username=root --query ’select * from student where id > 7 and $CONDITIONS‘ --fields-terminated-by '\t' --lines-terminated-by “\n” --split-by id -m 2 -target-dir /user/test/studentfileQuery

error： tool.importtool:encountered ioexecption running import job java.io.exception query [select * from student where id > 7] must contains '$CONDITIONS' in where clause

注意事项：where条件中必须要有 $CONDITIONS

如果判断条件中包含的是双引号，则查询语句需要更改使用 \$CONDITIONS 如下：

$ sqoop import -Dmapred.job.queue.name=develop -connect jdbc:mysql://localhost:3306/test -username=root --query "select * from student where name = 'shidaiwu' and \$CONDITIONS" --fields-terminated-by '\t' --lines-terminated-by “\n” --split-by id -m 2 -target-dir /user/test/studentfileQuery

where中不能有or

默认开启4个task

如果是并行查询导入必须指定 -split-by 否则需要指定 -m 1

validate:验证源数据记录和目标数据记录数是否相同 -- 验证成功在语句执行完毕会显示

$ sqoop import -Dmapred.job.queue.name=develop -connect jdbc:mysql://localhost:3306/test -username=root -table student --fields-terminated-by '\t' --lines-terminated-by “\n” --validate -m 2 -target-dir /user/test/studentfile

注意：validation is not supported for free from query but single table only 不支持 query 选项

导出数据时未选择的列在数据库表中需要为可空类型否则数据库将拒绝接受数据导入

可以增加map数，但是当数据库遇到性能瓶颈时，增加map反而会降低性能。

导出数据时，由于非空列的null数据导出至表将会被打断任务

导出任务被分为多个部分多事务其中一部分错误将导致只有部分数据被导入至表如果使用--staging-table 将数据保存至临时表内最终作为单一事务将临时表数据导入至目标表

一次导出任务并行，每一个写操作都是一个独立的事务。注意事务缓存的增长不能超过警戒线，以及内存溢出的清空。因此，一次导出是一件原子事务。可能在数据尚未完全导入之前就可以在表中看见。

导出任务可能失败的原因:

失去连接

违反一致性约束插入

解析一个不完整的或残缺的记录

使用不正确的分割符解析记录

容量问题内存不足或磁盘空间不足

每一个export map任务都在独立的事务中进行作业，相互独立的提交他们各自的任务。如果任务失败，当前事务会回滚而先前已完成的事务将继续保留在数据库内，就会造成部分完成的导出任务。

插入操作保证表已存在违反一致性约束将失败

那些未被选择导出数据的列在数据库中要么具有默认值要么允许null值否则将会失败

导出数据时调用存储过程

Validation

比较源表和目标表的行数验证操作效果

使用sqoop job 定义多次import export

job 重复执行增量导入时允许新增数据持续性加入当前文件

增量导入 --incremental append --check-column id --last-value 100

合并数据集时指定新旧数据集且指定目标目录

合并时保证一致性约束保证主键的唯一性

自动生成java类文件

使用eval快速简单评估sql 查询仅仅用于评估目的验证数据库链接等目的在生产工作流中并不支持

sqoop job --create

sqoop job --delete

sqoop job --exec

sqoop job --show

sqoop job --list

$ sqoop job -Dmapred.job.queue.name=develop --create importjob -- import -connect jdbc:mysql://localhost:3306/test -username=root -table student --fields-terminated-by '\t' --lines-terminated-by “\n” --validate -m 2 -target-dir /user/test/studentfile

注意事项：-Dmapred.job.queue.name=develop --create importjob -- import -connect hadoop类配置在最前，其次指定import时，需要注意留一个空格，否则会报错无法识别的参数

Creating saved jobs is done with the --create action. This operation requires a -- followed by a tool name and its arguments. The tool and its arguments will form the basis of the saved job.

报警告： mapred.job.queue.name 过时了使用 mapreduce.job.queuename=develop

sqoop job --delete importjob

sqoop job --exec importjob

sqoop job --show importjob

查看和执行 importjob时需要指定Mysql的password

sqoop job --meta-connect jdbc:hsqldb:hsql://sqoopMetaserver.example.com:16000/sqoop

增量导入数据的：

sqoop job -Dmapred.job.queue.name=develop --create importjobIncrement -- import -connect jdbc:mysql://localhost:3306/test -username=root -table student --fields-terminated-by '\t' --lines-terminated-by “\n” -m 1 -target-dir /user/test/studentfileJobIncrement -incremental append --check-column id --last-value 0

此时通过show importjobIncrement 发现 last-value = 0

执行sqoop job --exec importjobIncrement 查看对应文件目录后发现数据开始导入

再次通过show importjobIncrement 发现 last-value = 10

insert 数据进入student 表中

执行sqoop job --exec importjobIncrement 查看对应文件目录后发现数据开始增加

再次通过show importjobIncrement 发现 last-value = 11

*********************************************************

Sqoop provides an incremental import mode which can be used to retrieve only rows newer than some previously-imported set of rows.

The following arguments control incremental imports:

Table 5. Incremental import arguments:

Argument Description

--check-column (col) Specifies the column to be examined when determining which rows to import. (the column should not be of type CHAR/NCHAR/VARCHAR/VARNCHAR/ LONGVARCHAR/LONGNVARCHAR)

--incremental (mode) Specifies how Sqoop determines which rows are new. Legal values for mode include append and lastmodified.

--last-value (value) Specifies the maximum value of the check column from the previous import.

Sqoop supports two types of incremental imports: append and lastmodified. You can use the --incremental argument to specify the type of incremental import to perform.

You should specify append mode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with --check-column. Sqoop imports rows where the check column has a value greater than the one specified with --last-value.

An alternate table update strategy supported by Sqoop is called lastmodified mode. You should use this when rows of the source table may be updated, and each such update will set the value of a last-modified column to the current timestamp. Rows where the check column holds a timestamp more recent than the timestamp specified with --last-value are imported.

At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value in this way to ensure you import only the new or updated data. This is handled automatically by creating an incremental import as a saved job, which is the preferred mechanism for performing a recurring incremental import. See the section on saved jobs later in this document for more information.

**************************************************************

sqoop overview

推荐阅读更多精彩内容