spark2.4开始支持image图片数据源操作
import org.apache.spark.sql.SparkSession
object ImageDataSourceTest {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder
.master("local[2]")
.appName("ImageDataSourceTest")
.getOrCreate()
// $example on$
val df = spark.read.format("image")
.option("dropInvalid", value = true) // 从结果中删除无效图片
.load("D:\\data\\image")
df.select("image.origin", "image.width", "image.height")
.show(truncate = true)
// $example off$
spark.stop()
}
}
df的schema信息
root
|-- image: struct (nullable = true)
| |-- origin: string (nullable = true)
| |-- height: integer (nullable = true)
| |-- width: integer (nullable = true)
| |-- nChannels: integer (nullable = true)
| |-- mode: integer (nullable = true)
| |-- data: binary (nullable = true)
如果是多层目录,而且需要获取目录名,可以将目录命为:cls=string,在image的同级目录中会多出信息:“|-- cls: string (nullable = true)”
- origin: 图片路径
- height: 图片高度
- width: 图片宽度
- nChannels: 图片通道数量,对于灰度图像,典型值为1,对于彩色图像(例如,RGB),典型值为3,对于具有alpha通道的彩色图像,典型值为4
- mode: openCV兼容的类型,"CV_8UC1" -> 0, "CV_8UC3" -> 16, "CV_8UC4" -> 24,和通道一一对应
- data: BinaryType,以openCV兼容的方式排列,大多数情况下按行排列BGR