一、准备
1、shell常用命令
https://www.cnblogs.com/gsliuruigang/p/6487084.html
2、mac安装homebrew
https://blog.csdn.net/liaoningxinmin/article/details/85992752
3、ssh免密登录配置
https://blog.csdn.net/liaoningxinmin/article/details/85992752
二、安装jdk
一定要安装8版本以下的!(不然完全找不到解决方法!!)
https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html
三、安装hadoop,配置伪分布式环境
https://blog.csdn.net/liaoningxinmin/article/details/85992752
https://blog.csdn.net/vbirdbest/article/details/88189753
1、brew 安装 hadoop
可以在主目录下 用 brew list来查看brew安装了哪些文件。
安装命令
$ brew install hadoop
2、配置Hadoop相关文件(此处伪分布式,还有单机模式和完全分布式模式)
- a 环境变量配置 :
找到java安装路径
/usr/libexec/java_home
vim ~/.bash_profile
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_211.jdk/Contents/Home #java的安装路径
export HADOOP_HOME=/usr/local/Cellar/hadoop/3.1.2/libexec
export HADOOP_ROOT_LOGGER=DEBUG,console
export PATH=$PATH:${HADOOP_HOME}/bin
#esc+:q!退出 esc+:wq保存退出
source ~/.bash_profile立即执行
- b.core-site.xml配置
cd /usr/local/Cellar/hadoop/3.1.2/libexec/etc/hadoop
open -e core-site.xml
将core-site.xml中代码修改为:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
- c.hadoop-env.sh配置
找到java安装路径
/usr/libexec/java_home
把找到的java路径添加到hadoop-env.sh文件中
cd /usr/local/Cellar/hadoop/3.1.2/libexec/etc/hadoop
ls
open -e hadoop-env.sh
在打开的hadoop-env.sh文件中添加java路径
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_211.jdk/Contents/Home
- d.hdfs-site.xml配置
将hdfs-site.xml中代码修改为:
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
- e.mapred-site.xml配置
将mapred-site.xml中代码修改为:
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
如果文件后缀是 .xml.example,改为 .xml
- f.yarn-site.xml配置
将yarn-site.xml中代码修改为:
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>ranmodeiMac.local</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
</configuration>
ranmodeiMac.local是自己主机的名字,可以用 $hostname查看
3、运行hadoop
进入到 /usr/local/Cellar/hadoop/3.1.2/libexec/bin 路径中,对文件系统进行格式化:
cd /usr/local/Cellar/hadoop/3.1.2/libexec/bin
hdfs namenode -format
进入 /usr/local/Cellar/hadoop/3.1.2/libexec/sbin 路径,启动NameNode和datanode:
cd /usr/local/Cellar/hadoop/3.1.2/sbin/
./start-all.sh
这时候NameNode和DataNode都已经启动成功了,我们可以在网页中看到Overview页面了:
NameNode - http://localhost:9870
在浏览器中查看All Applications 界面:
ResourceManager - http://localhost:8088
jps查看进行,发现缺少datanode
(base) ranmodeiMac:~ ranmo$ jps
55057 NameNode
55665 Jps
2548
53429 SecondaryNameNode
55579 NodeManager
55484 ResourceManager
解决办法:https://www.cnblogs.com/mtime2004/p/10008325.html
(最后写错了,是复制datanode的值,到version里面去)
查看日志:
cd /usr/local/Cellar/hadoop/3.1.2/libexec/logs
open -e hadoop-ranmo-datanode-ranmodeiMac.local.log
在里面找到:
namenode clusterID和datanode clusterID不同,复制datanode clusterID
cd /tmp/hadoop-ranmo/dfs/name/current
open -e version
修改clusterID为复制的datanode clusterID
#Sun Jul 14 03:53:07 CST 2019
namespaceID=333721495
clusterID=CID-c69af5e0-abad-412f-bf0e-33711cfe47f1
cTime=1563047587165
storageType=NAME_NODE
blockpoolID=BP-471837932-192.168.1.4-1563047587165
layoutVersion=-64
之后./stop-all.sh关闭程序,在./start-all.sh,这下显示正常:
(base) ranmodeiMac:sbin ranmo$ jps
59698 ResourceManager
59507 SecondaryNameNode
59795 NodeManager
2548
59270 NameNode
59863 Jps
59373 DataNode
四、路径总结
JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_211.jdk/Contents/Home #java的安装路径
HADOOP_HOME=/usr/local/Cellar/hadoop/3.1.2/libexec
配置文件:
cd /usr/local/Cellar/hadoop/3.1.2/libexec/etc/hadoop
格式化文件:
cd /usr/local/Cellar/hadoop/3.1.2/libexec/bin
运行/关闭文件:
/usr/local/Cellar/hadoop/3.1.2/sbin/
临时文件:
/usr/local/Cellar/hadoop/3.1.2/libexec/tmp
日志文件:
/usr/local/Cellar/hadoop/3.1.2/libexec/logs
五、常用命令
hadoop fs -ls 显示当前目录结构,-ls -R 递归显示目录结构
hadoop fs -ls /显示目录下文件及文件夹
hadoop fs -mkdir 创建目录
hadoop fs -rm -r -skipTrash /path_to_file/file_name 删除文件
hadoop fs -rm -r -skipTrash /folder_name 删除文件夹
hadoop fs -put [localsrc] [dst] 从本地加载文件到HDFS
hadoop fs -get [dst] [localsrc] 从HDFS导出文件到本地
hadoop fs - copyFromLocal [localsrc] [dst] 从本地加载文件到HDFS,与put一致
hadoop fs -copyToLocal [dst] [localsrc] 从HDFS导出文件到本地,与get一致
hadoop fs -test -e 检测目录和文件是否存在,存在返回值$?为0,不存在返回1
hadoop fs -text 查看文件内容
hadoop fs -du 统计目录下各文件大小,单位字节。-du -s 汇总目录下文件大小,-du -h 显示单位
hadoop fs -tail 显示文件末尾
hadoop fs -cp [src] [dst] 从源目录复制文件到目标目录
hadoop fs -mv [src] [dst] 从源目录移动文件到目标目录
六、简单测试
创建input文件夹:
hadoop fs -mkdir /input
显示:
[ranmodeiMac:~ ranmo$ hadoop fs -ls /
2019-07-14 19:21:30,639 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
drwxr-xr-x - ranmo supergroup 0 2019-07-14 18:15 /input
在/usr/local/Cellar/hadoop/3.1.2/目录下创建test文件夹
cd /usr/local/Cellar/hadoop/3.1.2/
mkdir test
在test文件夹中创建dream.txt做测试文本
cd /usr/local/Cellar/hadoop/3.1.2/test
touch dream.txt
open -e dream.txt
hadoop fs -put dream.txt /input
#查看input里是否存在
hadoop fs -ls /input
#显示确实存在,用cat命令查看内容
hadoop fs -cat /input/dream.txt
#2019-07-14 21:19:43,064 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
#Hello world
如果不能显示,http://localhost:9870
界面显示文件可能已经损坏,所以需要重新上传到hdfs。
现在删除input文件夹中的所有文件:
hadoop fs -rmr /input/*
七、安装hive
1、brew安装
brew install hive
2、设置环境变量
open -e ~/.bash_profile 添加:
export HIVE_HOME=/usr/local/Cellar/hive/3.1.1/libexec
export PATH=$PATH:${HIVE_HOME}/bin
source ~/.bash_profile 生效
3、创建配置文件
https://www.cnblogs.com/micrari/p/7067968.html
https://blog.csdn.net/u013185349/article/details/86691634
cd /usr/local/Cellar/hive/3.1.1/libexec/conf
cp hive-default.xml.template hive-site.xml #把别的文件复制过来再进行property,主要是想要别的文件的头部命令
hive-site.xml配置的是mysql的位置、用户名信息等,
最终的配置文件是:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>xxxxx</value>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>mysql
<value>jdbc:mysql://localhost:3306/hive?useSSL=false</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
</configuration>
里面root和xxxxx是自己mysql的用户名以及密码
4、关联mysql和hive
给Hive的lib目录下拷贝一个mysql-connector
curl -L 'http://www.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.11.tar.gz/from/http://mysql.he.net/' | tar xz
cp mysql-connector-java-8.0.11/mysql-connector-java-8.0.11-bin.jar /usr/local/Cellar/hive/3.1.1/libexec/lib/
可以进一步前往lib文件夹查看:
cd /usr/local/Cellar/hive/3.1.1/libexec/lib
发现是把mysql-connector-java-8.0.11文件夹放在了lib里面,但是我只需要文件夹里面的mysql-connector-java-8.0.11-bin.jar,然后我自己前往文件夹路径把jar从里面拿出来放在lib里面了。不然之后的初始化连接过程中,会读不到这个包。
5、数据初始化(把mysql的数据初始化到hive上)
/usr/local/Cellar/hive/3.1.1/libexec/bin
schematool -initSchema -dbType mysql
过程显示Initialization script hive-schema-3.1.0.mysql.sql,是因为自己的mysql是3.1.0版本,然后会对里面hive的tables进行初始化。(但是目前为空)
中途可能会报错,原因可能是:
a. root和password没有写对;
b.MySQL Connector的jar包和数据库不匹配,jar包版本太低
https://blog.csdn.net/qq_21870555/article/details/80711187
c.Failed to load driver,因为lib里面没有connector的jar包(不能是文件夹)
6、运行hive
cd $HIVE_HOME
cd bin
hive
附:mac终端操作mysql
连接(打开)mysql,输入密码后可正常使用(quit 退出mysql)
https://www.cnblogs.com/jamescr7/p/7842784.html
/usr/local/mysql/bin/mysql -u root -p
7、简单测试
a.在桌面创建测试文件student.txt
内容如下:
1,zhangsan,12
2,lisi,13
3,wangwu,14
b.上传至hadoop
hadoop fs -put student.txt /input
hadoop fs -ls /input
显示文件存在,上传成功
c.运行hive,将其转换为表
hive
create table student (id int,username string,age int) row format delimited fields terminated by ',';
load data inpath '/input/student.txt' into table student;
select * from student;
显示完全正常!
1 zhangsan 12
2 lisi 13
3 wangwu 14
desc formatted student;
显示表的详情信息。
OK
# col_name data_type comment
id int
username string
age int
# Detailed Table Information
Database: default
OwnerType: USER
Owner: ranmo
CreateTime: Mon Jul 15 02:43:11 CST 2019
LastAccessTime: UNKNOWN
Retention: 0
Location: hdfs://localhost:9000/user/hive/warehouse/student
Table Type: MANAGED_TABLE
Table Parameters:
bucketing_version 2
numFiles 1
numRows 0
rawDataSize 0
totalSize 35
transient_lastDdlTime 1563129886
# Storage Information
SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
field.delim ,
serialization.format ,
Time taken: 0.22 seconds, Fetched: 33 row(s)
用hadoop命令查看信息中心的位置,是否保存的是hive表
hadoop fs -ls /user/hive/warehouse/student
hadoop fs -cat hadoop fs -ls /user/hive/warehouse/student/student.txt
显示正常!
附:用hive创建外部表
create external table student_w(id int,username string,age int) row format delimited fields terminated by ',';
创建外部表和内部表的区别,在于drop table的时候,外部表不会被删掉,只有用终端命令进行删除;
八、安装scoop
https://blog.csdn.net/maxmao1024/article/details/79478794
https://blog.csdn.net/scgh_fx/article/details/73522372
1、brew安装sqoop
brew install sqoop
2、配置环境变量
open -e ~/.bash_profile
export SQOOP_HOME=/usr/local/Cellar/sqoop/1.4.6_1/libexec
export PATH=$PATH:${SQOOP_HOME}/bin
source ~/.bash_profile生效
3、创建配置文件
cd $SQOOP_HOME
cd conf
open -e sqoop-env.sh
在里面配置hadoop和hive路径,其他路径不用配置,因为还没装。。
export HADOOP_HOME="/usr/local/Cellar/hadoop/3.1.2/libexec"
export HIVE_HOME="/usr/local/Cellar/hive/3.1.1/libexec"
4、关联mysql和hive
sqoop本质上是操作mysql和hive的链接,lib目录下也要拷贝一个mysql-connector,直接从hive lib里面把jar拷贝过来就行了
cp /usr/local/Cellar/hive/3.1.1/libexec/lib/mysql-connector-java-8.0.11.jar lib/
可以进一步前往lib文件夹查看:
cd /usr/local/Cellar/hive/3.1.1/libexec/lib
5、用sqoop实现mysql数据到hive的导入
sqoop help命令查看帮助,显示:
Available commands:
codegen Generate code to interact with database records
create-hive-table Import a table definition into Hive
eval Evaluate a SQL statement and display the results
export Export an HDFS directory to a database table
help List available commands
import Import a table from a database to HDFS
import-all-tables Import tables from a database to HDFS
import-mainframe Import datasets from a mainframe server to HDFS
job Work with saved jobs
list-databases List available databases on a server
list-tables List available tables in a database
merge Merge results of incremental imports
metastore Run a standalone Sqoop metastore
version Display version information
用import实现导入:
sqoop import --connect jdbc:mysql://localhost:3306/hive --username root --password lingying --table food --target-dir /input/food
第一次报错,显示:
2019-07-16 01:55:27,454 ERROR manager.SqlManager: Error executing statement: java.sql.SQLException: The connection property 'zeroDateTimeBehavior' acceptable values are: 'CONVERT_TO_NULL', 'EXCEPTION' or 'ROUND'. The value 'convertToNull' is not acceptable.
显示是connector jar包和mysql的版本兼容问题,https://www.2cto.com/net/201806/757728.html
调整语句为:
sqoop import --connect jdbc:mysql://localhost:3306/hive?zeroDateTimeBehavior=EXCEPTION --username root --password lingying --table hive_test --target-dir /input/food
第二次报错,显示:
[2019-07-16 02:23:52.743]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
错误: 找不到或无法加载主类 org.apache.hadoop.mapreduce.v2.app.MRAppMaster
[2019-07-16 02:23:52.743]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
错误: 找不到或无法加载主类 org.apache.hadoop.mapreduce.v2.app.MRAppMaster
For more detailed output, check the application tracking page: http://ranmodeiMac.local:8088/cluster/app/application_1563210689881_0009 Then click on links to logs of each attempt.
. Failing the application.
2019-07-16 02:23:53,014 INFO mapreduce.Job: Counters: 0
2019-07-16 02:23:53,021 WARN mapreduce.Counters: Group FileSystemCounters is deprecated. Use org.apache.hadoop.mapreduce.FileSystemCounter instead
2019-07-16 02:23:53,023 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 5.8023 seconds (0 bytes/sec)
2019-07-16 02:23:53,028 WARN mapreduce.Counters: Group org.apache.hadoop.mapred.Task$Counter is deprecated. Use org.apache.hadoop.mapreduce.TaskCounter instead
2019-07-16 02:23:53,028 INFO mapreduce.ImportJobBase: Retrieved 0 records.
2019-07-16 02:23:53,029 ERROR tool.ImportTool: Error during import: Import job failed!
参考
https://blog.csdn.net/hongxiao2016/article/details/88919176,重新设置yarn-site.xml解决。
重新调试:
[https://blog.csdn.net/hongxiao2016/article/details/88919176](https://blog.csdn.net/hongxiao2016/article/details/88919176)
显示成功
查看hdfs上的数据
hadoop fs -ls /input/food
显示:
Found 4 items
-rw-r--r-- 1 ranmo supergroup 0 2019-07-16 02:28 /input/food/_SUCCESS
-rw-r--r-- 1 ranmo supergroup 8 2019-07-16 02:28 /input/food/part-m-00000
-rw-r--r-- 1 ranmo supergroup 9 2019-07-16 02:28 /input/food/part-m-00001
-rw-r--r-- 1 ranmo supergroup 9 2019-07-16 02:28 /input/food/part-m-00002
hadoop fs -cat /input/food/part-m-00000
显示:
apple,1
所以其实是把food里面有m行,就分别执行m个mapreduce,最终汇总成m个文件。
6、只用一个mapreduce执行程序
sqoop import --connect jdbc:mysql://localhost:3306/hive?zeroDateTimeBehavior=EXCEPTION --username root --password lingying --table hive_test --target-dir /input/food1 -m 1
查看文件:
hadoop fs -ls /input/food1
显示:
Found 2 items
-rw-r--r-- 1 ranmo supergroup 0 2019-07-16 02:34 /input/food1/_SUCCESS
-rw-r--r-- 1 ranmo supergroup 26 2019-07-16 02:34 /input/food1/part-m-00000
7、用sqoop直接导入hive允许操作
不用sqoop导入操作流程有三步:
a. mysql数据导入hdfs
b. hive创建表
c. hdfs数据导入hive
用scoop可以直接一步实现上述三步:
sqoop import --connect jdbc:mysql://localhost:3306/hive?zeroDateTimeBehavior=EXCEPTION --username root --password lingying --table hive_test --hive-import --hive-table food -m 1 --delete-target-dir
delete在中间的作用,是因为hdfs上已经有这个文件了,所以重复必须删除;
显示报错:
2019-07-16 23:33:42,872 INFO hive.HiveImport: FAILED: ParseException line 1:211 missing EOF at ';' near 'TEXTFILE'
2019-07-16 23:33:43,070 ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Hive exited with status 64
at org.apache.sqoop.hive.HiveImport.executeExternalHiveScript(HiveImport.java:389)
at org.apache.sqoop.hive.HiveImport.executeScript(HiveImport.java:339)
at org.apache.sqoop.hive.HiveImport.importTable(HiveImport.java:240)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:514)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
发现是已经上传到了hadoop,但是在导入hive过程中出错的。用hadoop fs -ls命令,会显示:
Found 1 items
drwxr-xr-x - ranmo supergroup 0 2019-07-16 04:02 hive_test
理论上,hadoop fs -ls /才会显示文件夹的,上面的分析表明:
- hadoop fs -ls 会显示存在hadoop fs -ls /user/user/ranmo里的文件,而这个文件夹相当于是临时目录,所以如果hadoop fs -ls指令显示有文件存在,则表示有的文件没有正确上传;
- 如果不指定路径,正确上传到hive的文件应该都是在hadoop fs -ls /user/hive/warehouse里。
本来以为错误是Hive exited with status 64的问题,后来找了一圈方法也不起作用,发现是FAILED: ParseException line 1:211 missing EOF at ';' near 'TEXTFILE'的问题。这条指令明明是执行hive语句才会出现的指令,为什么用sqoop传数据也会出现?发现是表中的列名有一个是“na’me”,列名不能使用’符号,不然hive读取hadoop数据是以默认分隔符进行切割的(反正大概就是这个意思),调整列名之后正确上传。
附:sqoop常用命令:
https://www.cnblogs.com/cenyuhai/p/3306037.html
九、总结
对小白来说,搭建这一套实在是太不容易了(hadoop圈还有那么多没搭建完天呐),中间的坑实在太多了。。找资料解决bug的过程简直让我的精神得到了洗礼。。。