折腾了一天,加上 宽带限速,等等杂七杂八的,我的心好累
1.首先一定要注意的就是各个组件的版本!!!!不然真的不兼容
jupyter 不支持 pyspark 2.1.及以前的spark
spark 不支持 2.11.12和 2.12.* 和 2.10.版本的scala
zeppelin 不支持 spark 2.4.0版本,不支持 jdk 9 10 ,可能不支持openjdk
pyspark 不支持 python 3.6.
pip 安装的pyspark 版本要保持和真实的spark 相同版本
以上都是花了一天时间得到的真理,最后得出的完美解释是
zeppelin 0.8 ,
spark 2.3.0 ,jupyter
jdk 1.8 ,
scala 2.11.8
python3.5.4
sbt 1.1.7
pyenv 可以使他来切换python 版本
如果scala jdk 装的不是这个版本 ,建议重新安装scala jdk
经尝试,在这个版本下 zeppelin 可以使用spark,并正确执行spark job
jupyter 支持 pyspark
需要注意的是centos 7 默认安装的python 是2.7, 这个版本不适合pyspark 2.3,并且开启 pysaprk 单机集群后,容易出现
Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
但是 python2.7 也不可以删除掉,不然系统 就废掉了,会有各种毛病,尤其是yum 不能用了,
yum 使用的是 python2.7
需要 把系统 的python 软链接从python 2.7 指向python3
[root@delpc conf]# ls -la /usr/bin/python
lrwxrwxrwx. 1 root root 7 9月 16 07:02 /usr/bin/python -> python2
[root@delpc conf]# rm -rf /usr/bin/python
[root@delpc conf]# ln -s /usr/local/bin/python3 /usr/bin/python
#47c1a8
https://community.hortonworks.com/questions/188038/version-of-python-of-pyspark-for-spark2-and-zeppel.html
https://stackoverflow.com/questions/47198678/zeppelin-python-conda-and-python-sql-interpreters-do-not-work-without-adding-a
[root@delpc conf]# python
Python 3.5.5 (default, Dec 10 2018, 10:28:01)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
https://blog.csdn.net/weixin_41827162/article/details/84537404
重新链路yum命令(安装链路python3的时候破坏了yum):
vi usr/bin/yum
第一行#!/usr/bin/python改成#!/usr/bin/python2.7
按esc退出编辑,键入:wq退出并保存。
vi /usr/libexec/urlgrabber-ext-down
这里也需要修改成#! /usr/bin/python2.7,
不然yum install -y zlib-devel会报错:
作者:fyonecon
来源:CSDN
原文:https://blog.csdn.net/weixin_41827162/article/details/84537404
版权声明:本文为博主原创文章,转载请附上博文链接!
/etc/profile
export JAVA_HOME=/usr/java/jdk1.8.0_172-amd64
export REDIS_HOME=/usr/local/redis/bin
export S_H=/home/muller/Documents/seldon-server/kubernetes/bin
export NODE_HOME=/usr/local/node
export MONGODB_HOME=/usr/local/mongodb
export SBT_HOME=/usr/local/sbt
export SPARK_HOME=/usr/local/spark
export PYTHON_HOME=/usr/local/python3
export PATH=$PATH:$S_H:/usr/local/python3/bin:$MONGODB_HOME/bin:$NODE_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$REDIS_HOME
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH":$PATH:$SBT_HOME/bin
export ZEPPELIN_HOME=/data/zeppelin
export PATH=$PATH:$ZEPPELIN_HOME/bin:/opt/maven/bin:$JAVA_HOME/bin:$PYTHON_HOME/bin
eval "$(pyenv init -)"
# It's NOT a good id
from pyspark import *
conf=SparkConf().setMaster('local[*]').setAppName("jupyter")
sc=SparkContext.getOrCreate(conf)
print(sc)
import sys
sys.executable
spark-evn.sh配置
/usr/local/spark/conf
cat spark-env.sh
export PYSPARK_PYTHON=python3
SPARK_EXECUTOR_CORES=2
#, Number of cores for the executors (Default: 1).
SPARK_EXECUTOR_MEMORY=4G
export SPARK_MASTER_IP=192.168.199.102
export SPARK_LOCAL_IP=192.168.199.102
export SPARK_MASTER_PORT=7077
SPARK_LOCAL_IP=cdhcode
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=python3
SPARK_EXECUTOR_CORES=2``
zeppelin-env.sh
export SPARK_HOME=/usr/local/spark
export JAVA_HOME=/usr/java/jdk1.8.0_172-amd64
zeppelin spark interpreter
name value
args
master local[*]
spark.app.name Zeppe
spark.cores.max 2
spark.executor.memory 4G
zeppelin python interpreter 必须设置,不然zeppelin 内置python 有问题,尤其是使用pyspark的时候
https://www.cloudera.com/documentation/enterprise/latest/topics/kudu_development.html#developing
/usr/local/zeppelin/interpreter/python/py4j-0.9.2/src/py4j/java_gateway.py in java_import(jvm_view, import_str)
zeppelin interpreter
name value
PYSPARK_DRIVER_PYTHON /usr/local/python3/bin/python3
PYSPARK_PYTHON /usr/local/python3/bin/python3
zeppelin.pyspark.python /usr/local/python3/bin/python3
zeppelin.python /usr/local/python3/bin/python3
ipython kernelspec list
cd /usr/local/python3/share/jupyter/
jupyter notebook设置python版本 ,在虚拟环境下在 安装jupyter
9d78d513d99b06ed13abc33e1a16a6245203dc272890904b708ed50ed1254c40347bfefa6171415d8e812b2755f11419fdf041236a5c3df4dc9fc90c86e6ce69328e263f731a814413ce41e8890d639660875a9ff344a1adf044d4ef898f8a4353bc17523ac1abd6061714bc2ea75226e3d0c903025f66f8a66739bb
import sys
sys.executable
pyspark 2.1不支持python3.6. 支持python3.5 ,需要py4j,指定pyspark 依赖的python版本
pyspark 默认依赖 python2.7,需要修改成python 3.5 或对应的虚拟环境
1.修改spark-env.sh文件,在末尾添加export PYSPARK_PYTHON=/usr/bin/python3
2.把修改后的spark-env.sh分发到其他子节点的spark安装包下的conf目录下
3. 修改 /bin/pyspark
3.修改spark安装包bin目录下的pyspark,修改下图红色方框的位置,将原来PYSPARK_PYTHON=python改成PYSPARK_PYTHON=python3,同样的,其他子节点也都需要修改
if [[ $PYSPARK_DRIVER_PYTHON == *ipython* && ! $WORKS_WITH_IPYTHON ]]; then
echo "IPython requires Python 2.7+; please install python2.7 or set PYSPARK_PYTHON" 1>&2
exit 1
else
PYSPARK_PYTHON=python3
https://blog.csdn.net/abc_321a/article/details/82589836
报错
PYSPARK_DRIVER_CALLBACK_HOST
TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module''
>>> import pyspark
>>> pyspark.__version__
4.pyenv 切换成真是python
实在不行就删除 rm -rf /root/.pyenv/shims
pyspark jupyter
java.util.NoSuchElementException: key not found: _PYSPARK_DRIVER_CALLBACK_HOST
Exception: Java gateway process exited before sending its port number
#47c1a8
sc.parallelize(Seq(10, 232, 23, 344, 6), 2).glom().collect()
sc.parallelize((10, 232, 23, 344, 6), 2).glom().collect()
from pyspark import *
conf=SparkConf().setMaster('local[*]').setAppName("jupyter")
sc=SparkContext.getOrCreate(conf)
arr=sc.parallelize([0, 2, 3, 4, 6], 5).glom().collect()
print(arr)
import sys
sys.executable
zeppelin 0.8 ,spark 2.3 ,jupyter jdk 1.8 ,scala 2.11.8 python3.5.4
from pyspark import *
host="spark://192.168.199.102:7077"
conf=SparkConf().setMaster(host).setAppName("jupyter")
scz=SparkContext.getOrCreate(conf)
arrz=scz.parallelize([0, 2, 23, 344, 6], 5).glom().collect()
print(arrz)
import sys
sys.executable
from pyspark import *
host="spark://cdhcode:7077"
conf=SparkConf().setMaster(host).setAppName("jupyter")
sc=SparkContext.getOrCreate(conf)
arr=sc.parallelize([10, 232, 23, 344, 6], 5).glom().collect()
print(arr)
import sys
sys.executable
https://blog.csdn.net/lx1309244704/article/details/83863889
解决zeppelin spark 读取文件 报hadoop连接错误
https://www.jianshu.com/p/ffb4498c642c
这个 由于本机已经安装了Hadoop,使用的是伪分布式模式,所以Spark会读取Hadoop的配置信息.
我们这里先不启动Hadoop,使用本地模式,要手动添加file:///并使用绝对路径读取文本文件
所以使用这种方式的话,就可以了
zeppelin 启动报 503 错误
mac 安装 Java9 ,启动zeppelin 0.7.3 启动了,但是报错,查看日志 jetty context 启动失败,找了很多原因,都没有找到,
最后感觉应该是Java 版本,
换成 java 8 以后完美解决,
解决 zeppelin 启动后 spark 报 空指针异常
https://www.jianshu.com/p/8c6073017052
平时使用zeppelin 没有问题,看 zeppelin 0.7.3版本,打算尝鲜,下载解压后,启动zeppelin 没有问题,不过 运行spark 就发生了 报空指针异常,好奇怪,首先权限写入问题不是,网络问题也不是,虽然我本机没有安装spark ,但是zeppelin还是可以使用内嵌的spark。
怀疑是不是配置,看 Stack Overflow 上 有配置 spark.driver.host = localhost,我在看 zeppelin的运行日志的时候确实发现 了 启动remote 相关进程,但是也没有很详细,我配置后重启,但是问题依旧,然后在看日志,发现有一个 hivesupport 的东西,我也没有安装咋回事。
但是其实就是 hive 这个玩意,zeppelin 默认 spark 运行 会 启动hive ,在本机如果 hadoop和yarn hive 都没有启动的话,zeppelin就报错了,所以解决方法就是把 spark interpreter 的 spark 关于hive support的默认设置修改为 false ,这样就解决了
zeppelin.spark.useHiveContext => false
centos7 install python3
yum install openssl-devel zilb-devel python3-devel -y
./configure --prefix=/usr/local/python3 --with-ssl --enable-optimizations
make && make install
ln -s /usr/local/python3/bin/python3 /usr/local/bin/python3
ln -s /usr/local/python3/bin/pip3 /usr/local/bin/pip3
卸载scala
rpm -qa |grep scala
yum remove scala
0.7.3+ 0.80 zeppelin 修改端口后无法启动
https://www.jianshu.com/p/8e501a678c3b
首先 zeppelin的端口默认是8080 ,当我们把zeppelin放到服务器上时候,一般8080是大家都在争抢的端口,一般就把8080端口修改掉,
zeppelin 我一般修改为7980端口,修改前,0.8.0版本的zeppelin还能正常启动,修改后就启动不起来,查看日志报错 说 conf/目录下 helium.json不存在,查看确实没有呀,你本身没有带 ,肯定没有,但是为了让zeppelin启动,就创建了 一个空的 helium.json,还是启动不起来,报java nullpoint,只能找一个像 helium.json内容的填充进来
{
"status": "OK",
"message": "",
"body": {
"zeppelin.clock": [
{
"registry": "local",
"pkg": {
"type": "APPLICATION",
"name": "zeppelin.clock",
"description": "Clock (example)",
"artifact": "zeppelin-examples\/zeppelin-example-clock\/target\/zeppelin-example-clock-0.7.0-SNAPSHOT.jar",
"className": "org.apache.zeppelin.example.app.clock.Clock",
"resources": [
[
":java.util.Date"
]
],
"icon": "icon"
},
"enabled": false
}
],
"zeppelin-bubblechart": [
{
"registry": "local",
"pkg": {
"type": "VISUALIZATION",
"name": "zeppelin-bubblechart",
"description": "Animated bubble chart",
"artifact": ".\/..\/helium\/zeppelin-bubble",
"icon": "icon"
},
"enabled": true
},
{
"registry": "local",
"pkg": {
"type": "VISUALIZATION",
"name": "zeppelin-bubblechart",
"description": "Animated bubble chart",
"artifact": "zeppelin-bubblechart@0.0.2",
"icon": "icon"
},
"enabled": false
}
],
"zeppelinhorizontalbar": [
{
"registry": "local",
"pkg": {
"type": "VISUALIZATION",
"name": "zeppelinhorizontalbar",
"description": "Horizontal Bar chart (example)",
"artifact": ".\/zeppelin-examples\/zeppelin-example-horizontalbar",
"icon": "icon"
},
"enabled": true
}
]
}
}
另外一些配置可以参考
cat spark-env.sh
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export YARN_CONF_DIR=/usr/local/hadoop/etc/hadoop
export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:/usr/local/hadoop/lib/native
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/hadoop/lib/native
export SPARK_YARN_USER_ENV="JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH,LD_LIBRARY_PATH=$LD_LIBRARY_PATH"
export SPARK_LIBRARY_PATH=$SPARK_LIBRARY_PATH:/usr/local/hadoop/lib/native
#export SPARK_JAVA_OPTS="-XX:MetaspaceSize=4G -XX:MaxMetaspaceSize=4G -Duser=$(whoami) -Djava.awt.headless=true"
#-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps"
#export SPARK_CLASSPATH=$SPARK_CLASSPATH:/usr/local/hadoop/share/hadoop/common/lib/*:/usr/local/hadoop/lib/*
[nlpdev@h21 conf]$ cat spark-defaults.conf
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.
# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
#
spark.yarn.jars hdfs://ns1/system/spark2/*
spark.eventLog.enabled true
spark.eventLog.dir hdfs://ns1/logs/spark-events/
spark.eventLog.compress true
spark.history.fs.logDirectory hdfs://ns1/logs/spark-events/
spark.yarn.historyServer.address h14.jwopt.cn:18088
spark.master yarn
#spark.submit.deployMode cluster
spark.executor.memory 5g
spark.yarn.executor.memoryOverhead 1024
spark.kryoserializer.buffer.max 521m
spark.executor.cores 4
spark.scheduler.mode FAIR
spark.sql.shuffle.partitions 256
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.yarn.submit.file.replication 3
spark.driver.memory 3072m
spark.yarn.driver.memoryOverhead 1024
spark.driver.cores 1
##Dynamic Resource Allocation#spark.dynamicAllocation.enabled true
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.executorIdleTimeout 60s
spark.dynamicAllocation.schedulerBacklogTimeout 1s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 5s
#spark.dynamicAllocation.minExecutors 2
spark.dynamicAllocation.maxExecutors 200
spark.dynamicAllocation.initialExecutors 2
#This service preserves the shuffle files written by executors so the executors can be safely removed. default:false
spark.shuffle.service.enabled true
###Shuffle
###Maximum size of map outputs to fetch simultaneously from each reduce task,default 48m
spark.reducer.maxSizeInFlight 384m
## # Size of the in-memory buffer for each shuffle file output stream. default 32k
spark.shuffle.file.buffer 256k
#spark.executor.extraClassPath
#spark.driver.extraClassPath
spark.executor.extraJavaOptions -XX:MetaspaceSize=4G -XX:MaxMetaspaceSize=4G -XX:+UseG1GC -XX:G1ReservePercent=15
spark.driver.extraJavaOptions -XX:MetaspaceSize=4G -XX:MaxMetaspaceSize=4G -XX:+UseG1GC -XX:G1ReservePercent=10