构建spark docker 镜像
下载spark-2.4.8-bin-hadoop2.7.tgz
Note: 这里下载spark包一定不是能是without hadoop 的。不然构建完以后运行,会报一些包找不到。比如log4j
tar -xvf spark-2.4.8-bin-hadoop2.7.tgz
cd spark-2.4.8-bin-hadoop2.7
编辑spark dockerfile
vim kubernetes/dockerfiles/spark/Dockerfile
将18行的 FROM openjdk:8-jdk-slim 替换成 FROM openjdk:8-jdk-slim-buster
因为默认openjdk基础形象是debian 11 ,后面的spark-py镜像会依赖此镜像。debian11 安装的python3 是python3.8 以上版本。spark2.4 不支持python3.7以上版本,所以会报”TypeError:an integer is required(got type bytes)” 这样的错误。
所以将基础镜像更换成debian 10。安装的python3 的版本是3.7
构建镜像
bin/docker-image-tool.sh -t v2.4.8 build
由于apt-get 源是国外的构建会比较慢,最好是开代理。
没有代理的也可以跟换国内源 (更换国内源会有包依赖的问题导致spark-py镜像是构建失败)
vim kubernetes/dockerfiles/spark/Dockerfile
在 29,31行之间 加入
ADD sources.list /etc/apt/sources.list
然后把sources.list文件放在spark-2.4.8-bin-hadoop2.7目录下
最后镜像构建好以后,会有3个镜像
spark、spark-py、spark-r
把这3个镜像push到镜像仓库。
spark 支持 OSS
spark 支持oss 可以再修改spark dockerfile
vim kubernetes/dockerfiles/spark/Dockerfile
将下面的几行jar文件放在COPY data /opt/spark/data 下面
ADD https://repo1.maven.org/maven2/com/aliyun/odps/hadoop-fs-oss/3.3.8-public/hadoop-fs-oss-3.3.8-public.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/com/aliyun/oss/aliyun-sdk-oss/3.8.1/aliyun-sdk-oss-3.8.1.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/aspectj/aspectjweaver/1.9.5/aspectjweaver-1.9.5.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/jdom/jdom/1.1.3/jdom-1.1.3.jar $SPARK_HOME/jars
也可以重新重新再写一个dockerfile 重新构建一个新的镜像。保持原有镜像
FROM acr-test01-registry.cn-beijing.cr.aliyuncs.com/netops/spark-py:v2.4.8
RUN mkdir -p /opt/spark/jars
# 如果需要使用OSS(读取OSS数据或者离线Event到OSS),可以添加以下JAR包到镜像中
ADD https://repo1.maven.org/maven2/com/aliyun/odps/hadoop-fs-oss/3.3.8-public/hadoop-fs-oss-3.3.8-public.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/com/aliyun/oss/aliyun-sdk-oss/3.8.1/aliyun-sdk-oss-3.8.1.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/aspectj/aspectjweaver/1.9.5/aspectjweaver-1.9.5.jar $SPARK_HOME/jars
ADD https://repo1.maven.org/maven2/org/jdom/jdom/1.1.3/jdom-1.1.3.jar $SPARK_HOME/jars
docker build -t ack-spark-oss:v2.4.8 .
docker tag
docker push
spark on ack yaml
这个yaml是用来提交spark任务。
scala 、java 和 python 不太一样。
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-pi
namespace: default
spec:
type: Scala
mode: cluster
image: "acr-test01-registry.cn-beijing.cr.aliyuncs.com/netops/ack-spark-2.4.5:v9"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "oss://qa-oss/spark-examples_2.11-2.4.8.jar"
sparkConf:
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "oss://qa-oss/spark-events"
"spark.hadoop.fs.oss.impl": "org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem"
"spark.hadoop.fs.oss.endpoint": "oss-cn-beijing-internal.aliyuncs.com"
"spark.hadoop.fs.oss.accessKeySecret": "OSd0RVN"
"spark.hadoop.fs.oss.accessKeyId": "LTADXrW"
sparkVersion: "2.4.5"
imagePullSecrets: [spark]
restartPolicy:
type: Never
driver:
cores: 2
coreLimit: "2"
memory: "3g"
memoryOverhead: "1g"
labels:
version: 2.4.5
serviceAccount: spark
annotations:
k8s.aliyun.com/eci-kube-proxy-enabled: 'true'
k8s.aliyun.com/eci-image-cache: "true"
executor:
cores: 2
instances: 5
memory: "3g"
memoryOverhead: "1g"
labels:
version: 2.4.5
annotations:
k8s.aliyun.com/eci-kube-proxy-enabled: 'true'
k8s.aliyun.com/eci-image-cache: "true"
如果你的镜像仓库是public的就不需要imagePullSecrets 这个参数。
如果你的镜像仓库是带验证的。那么就要使用imagePullSecrets 在验证,后面[spark] 是一个configmap 里面是镜像仓库用户名,密码
另外mainApplicationFile 这个是任务jar 包位置。可以是 oss,或是 hdfs 如果用local就需要jar 在镜像内。
sparkconf 部分是配置 spark-history ,如果不需要删除掉。
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-pi
namespace: default
spec:
type: Python
mode: cluster
image: "acr-test01-registry.cn-beijing.cr.aliyuncs.com/netops/ack-spark-2.4.5:v9"
imagePullPolicy: Always
mainApplicationFile: "local:///opt/spark/examples/src/main/python/pi.py"
sparkVersion: "2.4.5"
pythonVersion: "3"
imagePullSecrets: [spark]
restartPolicy:
type: Never
driver:
cores: 2
coreLimit: "2"
memory: "3g"
memoryOverhead: "1g"
labels:
version: 2.4.5
serviceAccount: spark
annotations:
k8s.aliyun.com/eci-kube-proxy-enabled: 'true'
k8s.aliyun.com/eci-image-cache: "true"
executor:
cores: 2
instances: 5
memory: "3g"
memoryOverhead: "1g"
labels: