在CoreML的实际应用中,很多场景下我们对模型的处理速度都有很严格的要求。这就要求我们在设计模型时,就要谨慎控制每一层的耗时。下面我将在真机上实测CoreML一些比较耗时的层在不同参数下运行时间,为设计模型提供一些参考。
理论参考:How fast is my model?
测试设备:iPhoneX
环境配置:python3.6、tensorflow1.5、Keras2.1.6、numpy1.15.4、coremltools2.0、Xcode10.1
搭建测量环境
首先建立一个简单的Keras模型:
import keras
def create_model():
inp = keras.layers.Input(shape=(128, 128, 3))
//需要测试的层,这里是卷积层Conv2D
x = keras.layers.Conv2D(64, (3, 3), strides=(1, 1),padding='valid', name='conv1', use_bias=True)(x)
return keras.models.Model(inp, x)
model = create_model()
model.compile(loss="categorical_crossentropy", optimizer="Adam",
metrics=["accuracy"])
model.summary()
我们的目的仅仅是测试耗时,模型没有任何实际意义,所以这里不做训练,配置权重为随机数:
W = model.get_weights()
np.random.seed(12345)
for i in range(len(W)):
W[i] = np.random.randn(*(W[i].shape)) * 2 - 1
model.set_weights(W)
最后把模型转换为CoreML模型,保存mlmodel文件,推荐把操作和参数写在名字里以便于比对。
coreml_model = coremltools.converters.keras.convert(
model,
input_names="image",
image_input_names="image",
output_names="output")
//保存mlmodel,把操作和参数写在名字里以便于比对。
coreml_model.save('Conv2D_3_3_3_128_128_64.mlmodel')
将以上代码整理到一个task.py文件中(完整代码后文中会给出),在终端中运行:
localhost:CoreMLtest vyyv$ python3 task.py
以下是返回结果
Using TensorFlow backend.
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 128, 128, 3) 0
_________________________________________________________________
conv1 (Conv2D) (None, 126, 126, 64) 1792
=================================================================
Total params: 1,792
Trainable params: 1,792
Non-trainable params: 0
_________________________________________________________________
2019-01-07 14:50:26.523392: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.2 AVX AVX2 FMA
0 : input_1, <keras.engine.topology.InputLayer object at 0x1028f5630>
1 : conv1, <keras.layers.convolutional.Conv2D object at 0x11d9719b0>
/var/folders/2k/n498pw5d4qb41y0_wc4vmrcm0000gp/T/tmpj6ik2p3s.mlmodel
Input name(s) and shape(s):
image : (C,H,W) = (3, 128, 128)
Neural Network compiler 0: 100 , name = conv1, output shape : (C,H,W) = (64, 126, 126)
至此已成功生成一个mlmodel文件,你可以在task.py所在文件夹中找到它。
下一步建立一个简单的Xcode工程,用以测试模型运行速度。
import UIKit
import CoreML
import os.signpost
class ViewController: UIViewController {
lazy var input_:MLFeatureProvider = {
var image = UIImage.init(named: "5.jpg")
image = image?.resize(to: CGSize(width: 128, height: 128))
let input_ = ModelInput(image: image!.buffer!)
return input_
}()
override func viewDidLoad() {
super.viewDidLoad()
}
@IBAction func test(){
let model = Conv2D_3_3_3_128_128_64().model
runModel(model: model)
}
func runModel(model:MLModel) {
let starttimeInterval: TimeInterval = Date().timeIntervalSince1970
let start = CLongLong(round(starttimeInterval*1000))
if let output = try? model.prediction(from:input_) {
print("finish")
} else {
print("error")
}
let endtimeInterval: TimeInterval = Date().timeIntervalSince1970
let end = CLongLong(round(endtimeInterval*1000))
print("time: \(end-start)")
}
}
/// Model Prediction Input Type
class ModelInput : MLFeatureProvider {
/// Input image of scene to be classified as color (kCVPixelFormatType_32BGRA) image buffer, 224 pixels wide by 224 pixels high
var image: CVPixelBuffer
var featureNames: Set<String> {
get {
return ["image"]
}
}
func featureValue(for featureName: String) -> MLFeatureValue? {
if (featureName == "image") {
return MLFeatureValue(pixelBuffer: image)
}
return nil
}
init(image: CVPixelBuffer) {
self.image = image
}
}
其中用到的图片转换操作:
import UIKit
extension UIImage
{
var buffer: CVPixelBuffer? {
let attrs = [kCVPixelBufferCGImageCompatibilityKey: kCFBooleanTrue, kCVPixelBufferCGBitmapContextCompatibilityKey: kCFBooleanTrue] as CFDictionary
var pixelBuffer: CVPixelBuffer?
let status = CVPixelBufferCreate(kCFAllocatorDefault, Int(self.size.width), Int(self.size.height), kCVPixelFormatType_32ARGB, attrs, &pixelBuffer)
guard (status == kCVReturnSuccess) else {
return nil
}
CVPixelBufferLockBaseAddress(pixelBuffer!, CVPixelBufferLockFlags(rawValue: 0))
let pixelData = CVPixelBufferGetBaseAddress(pixelBuffer!)
let rgbColorSpace = CGColorSpaceCreateDeviceRGB()
let context = CGContext(data: pixelData, width: Int(self.size.width), height: Int(self.size.height), bitsPerComponent: 8, bytesPerRow: CVPixelBufferGetBytesPerRow(pixelBuffer!), space: rgbColorSpace, bitmapInfo: CGImageAlphaInfo.noneSkipFirst.rawValue)
context?.translateBy(x: 0, y: self.size.height)
context?.scaleBy(x: 1.0, y: -1.0)
UIGraphicsPushContext(context!)
self.draw(in: CGRect(x: 0, y: 0, width: self.size.width, height: self.size.height))
UIGraphicsPopContext()
CVPixelBufferUnlockBaseAddress(pixelBuffer!, CVPixelBufferLockFlags(rawValue: 0))
return pixelBuffer
}
func resize(to newSize: CGSize) -> UIImage {
UIGraphicsBeginImageContextWithOptions(CGSize(width: newSize.width, height: newSize.height), true, 1.0)
self.draw(in: CGRect(x: 0, y: 0, width: newSize.width, height: newSize.height))
let resizedImage = UIGraphicsGetImageFromCurrentImageContext()!
UIGraphicsEndImageContext()
return resizedImage
}
}
连接iPhoneX并运行APP,可以在Xcode调试信息窗口中可得到如下信息
finish
time: 10
上面的time: 10就是模型运行耗费时间,10毫秒。
要注意到在这个简单的模型中,输入层和输出层也是要消耗时间的,消耗时间与输入输出数据大小相关,为了精确测量特定层消耗的时间我们还要引入一个coreML自定义层,因为自定义层的起止时间可以用os_signpost标记监控,所以在需测量层前后各放置一个自定义层就可以把耗时计算出来。
注:这里有一个漏洞,为了使用os_signpost标记监控,我们加入的是cup运算层,gpu和cpu之间的数据交换要消耗时间,如果是连续的gpu运算,这部分时间可以节约。
在原模型中加入一个Swish激活函数,原理和方法请参阅Custom Layers in Core ML或Core ML中的自定义层(译)
添加自定义Swish激活函数后的task.py完整代码:
import keras
import numpy as np
import coremltools
def convert_lambda(layer):
if layer.function == swish:
params = coremltools.proto.NeuralNetwork_pb2.CustomLayerParams()
params.className = "Swish"
params.description = "A fancy new activation function"
return params
else:
return None
def swish(x):
return keras.backend.sigmoid(x) * x
def create_model():
inp = keras.layers.Input(shape=(128, 128, 3))
x = keras.layers.Lambda(swish)(inp)
x = keras.layers.Conv2D(64, (3, 3), strides=(1, 1),padding='valid', name='conv1', use_bias=True)(x)
x = keras.layers.Lambda(swish)(x)
return keras.models.Model(inp, x)
model = create_model()
model.compile(loss="categorical_crossentropy", optimizer="Adam",
metrics=["accuracy"])
model.summary()
W = model.get_weights()
np.random.seed(12345)
for i in range(len(W)):
W[i] = np.random.randn(*(W[i].shape)) * 2 - 1
model.set_weights(W)
coreml_model = coremltools.converters.keras.convert(
model,
input_names="image",
image_input_names="image",
output_names="output",
add_custom_layers=True,
custom_conversion_functions={ "Lambda": convert_lambda })
coreml_model.save('R_Conv2D_3_3_3_128_128_64.mlmodel')
在Xcode工程中加入Swish.swift,注意这里要注释掉调运GPU运算的代码encode,我们使用CPU运算以方便使用os_signpost。
import Foundation
import CoreML
import Accelerate
@objc(Swish) class Swish: NSObject, MLCustomLayer {
let swishPipeline: MTLComputePipelineState
required init(parameters: [String : Any]) throws {
// print(#function, parameters)
let device = MTLCreateSystemDefaultDevice()!
let library = device.makeDefaultLibrary()!
let swishFunction = library.makeFunction(name: "swish")!
swishPipeline = try! device.makeComputePipelineState(
function: swishFunction)
super.init()
}
func setWeightData(_ weights: [Data]) throws {
//print(#function, weights)
}
func outputShapes(forInputShapes inputShapes: [[NSNumber]]) throws
-> [[NSNumber]] {
// print(#function, inputShapes)
return inputShapes
}
func evaluate(inputs: [MLMultiArray], outputs: [MLMultiArray]) throws {
print("swish")
let log = OSLog(subsystem: "Swish", category: OSLog.Category.pointsOfInterest)
os_signpost(OSSignpostType.begin, log: log, name: "Swish cpu")
for i in 0..<inputs.count {
let input = inputs[I]
let output = outputs[I]
let count = input.count
let iptr = UnsafeMutablePointer<Float>(OpaquePointer(input.dataPointer))
let optr = UnsafeMutablePointer<Float>(OpaquePointer(output.dataPointer))
// output = -input
vDSP_vneg(iptr, 1, optr, 1, vDSP_Length(count))
// output = exp(-input)
var countAsInt32 = Int32(count)
vvexpf(optr, optr, &countAsInt32)
// output = 1 + exp(-input)
var one: Float = 1
vDSP_vsadd(optr, 1, &one, optr, 1, vDSP_Length(count))
// output = x / (1 + exp(-input))
vvdivf(optr, iptr, optr, &countAsInt32)
}
os_signpost(OSSignpostType.end, log: log, name: "Swish cpu")
}
/* func encode(commandBuffer: MTLCommandBuffer,
inputs: [MTLTexture], outputs: [MTLTexture]) throws {
if let encoder = commandBuffer.makeComputeCommandEncoder() {
for i in 0..<inputs.count {
encoder.setTexture(inputs[i], index: 0)
encoder.setTexture(outputs[i], index: 1)
encoder.dispatch(pipeline: swishPipeline, texture: inputs[I])
encoder.endEncoding()
}
}
}*/
}
extension MTLComputeCommandEncoder {
public func dispatch(pipeline: MTLComputePipelineState, texture: MTLTexture) {
let w = pipeline.threadExecutionWidth
let h = pipeline.maxTotalThreadsPerThreadgroup / w
let threadGroupSize = MTLSizeMake(w, h, 1)
let threadGroups = MTLSizeMake(
(texture.width + threadGroupSize.width - 1) / threadGroupSize.width,
(texture.height + threadGroupSize.height - 1) / threadGroupSize.height,
(texture.arrayLength + threadGroupSize.depth - 1) / threadGroupSize.depth)
setComputePipelineState(pipeline)
dispatchThreadgroups(threadGroups, threadsPerThreadgroup: threadGroupSize)
}
}
并修改ViewController中的runModel函数
func runModel(model:MLModel) {
let log = OSLog(subsystem: "main", category: OSLog.Category.pointsOfInterest)
os_signpost(OSSignpostType.begin, log: log, name: "main")
if let output = try? model.prediction(from:input_) {
print("finish")
} else {
print("error")
}
os_signpost(OSSignpostType.end, log: log, name: "main")
}
这次我们在Instruments中运行APP
运行Product->Profile,选择Blank,点击右上角的+号,在列表中选择os_signpost,至此界面如下:
点击红色运行按钮,并在iphone中多次运行测试模型,可以看到如下显示:
拉伸显示区域高度,可以在左侧显示我们加入的os_signpost的名字,右侧时间轴也可以放大,以便观察运行时的细节:
从上图中可以看到,第四行是我们运行一次模型的时间,第二行是两次swish层运行的时间,它们之间的时间即为我们要测量层运行消耗的时间,拖拽鼠标即可测量。
os_signpost还有很多实用的功能,就不再这里进行扩展了,可自行探索。
至此,测量神经网络每一层运行耗时的准备工作已经完成,下面开始实测。
注:本文中的所有测量数据均为多次测量的平均值。受条件所限,测量结果难免会有偏差,不要直接使用结果数据。我们关心的是耗费时间的量级,以及不同参数下速度的比较关系。
卷积层
我参与的几个神经网络模型都与图像相关,这些模型中遇到做多、最耗时的就是卷积层,我们这次测量工作也主要围绕卷积层进行。
二维卷积层
二维卷积层计算量公式为
MACC = K × K × Cin × Hout × Wout × Cout
- Hout × Wout 对应输出特征图中的像素数目
- K x K 卷积核的宽度和长度
- Cin 输入通道数
- Cout 是卷积核的数目,即输出的维度
Keras 二维卷积层公式为:
keras.layers.convolutional.Conv2D(filters, kernel_size, strides=(1, 1), padding='valid', data_format=None, dilation_rate=(1, 1), activation=None, use_bias=True, kernel_initializer='glorot_uniform', bias_initializer='zeros', kernel_regularizer=None, bias_regularizer=None, activity_regularizer=None, kernel_constraint=None, bias_constraint=None)
和计算量有关的主要参数:
- filters 卷积核的数目,输出的维度,即运算量的公式中的Cout
- kernel_size 卷积核的宽度和长度,即运算量的公式中的K x K
- strides 卷积的步长,与padding一起影响运算量的公式中Hout × Wout的数值
- padding 补0策略,为“valid”, “same”
首先测量kernel_size取不同值时,卷积层运行时间。
设定输入图像shape=(256, 256, 3), filters=64,strides=(1, 1),padding=same,此时输出特征Hout × Wout与输入图像相同为256*256。
K | MACC | ms | MACC/ms | params |
---|---|---|---|---|
1 | 1x1x3x256x256x64=12582912 | 12 | 1.0m (million) | 256 |
3 | 3x3x3x256x256x64=113246208 | 14 | 8.1m | 1792 |
5 | 5x5x3x256x256x64=314572800 | 21 | 15m | 4864 |
7 | 7x7x3x256x256x64=616562688 | 22 | 28m | 9472 |
注:MACC/ms单位是million,不是秒,有些读者误会了。
设定输入图像shape=(128, 128, 3), filters=64,strides=(1, 1),padding=same,此时输出特征Hout × Wout与输入图像相同为128*128。
K | MACC | ms | MACC/ms | params |
---|---|---|---|---|
1 | 1x1x3x128x128x64=3145728 | 4.5 | 0.7m | 256 |
3 | 3x3x3x128x128x64=28311552 | 5.2 | 5.4m | 1792 |
5 | 5x5x3x128x128x64=78543200 | 6.0 | 13.1m | 4864 |
7 | 7x7x3x128x128x64=154140672 | 6.2 | 24.8m | 9472 |
可以看到,随着K增大,模型的运算速率明显增加,CoreML应该是针对卷积核大小做了速度上的优化。
测量输出特征Hout × Wout取不同值时,卷积层运行时间。
设定filters=64,kernel_size=(3, 3),strides=(1, 1),padding=same,此时输出特征Hout × Wout与输入图像相同,改变输入图像宽高,输出特征大小也将随之改变。
filters | MACC | ms | MACC/ms | params |
---|---|---|---|---|
64x64 | 3x3x3x64x64x64=7077888 | 1.9 | 3.7m | 1792 |
128x128 | 3x3x3x128x128x64=28311552 | 5.2 | 5.4m | 1792 |
256x256 | 3x3x3x256x256x64=113246208 | 14 | 8.1m | 1792 |
512x512 | 3x3x3x512x512x64=452984832 | 52 | 8.7m | 1792 |
测量filters取不同值时,卷积层运行时间。
设定输入图像shape=(512, 512, 3), kernel_size=(3, 3),strides=(1, 1),padding=same,此时输出特征Hout × Wout与输入图像相同为512*512。
filters | MACC | ms | MACC/ms | params |
---|---|---|---|---|
16 | 3x3x3x512x512x16=113246208 | 14 | 8.1m | 448 |
32 | 3x3x3x512x512x32=226492416 | 29 | 7.8m | 896 |
64 | 3x3x3x512x512x64=452984832 | 52 | 8.7m | 1792 |
128 | 3x3x3x512x512x128=905969664 | 90 | 10.1m | 3584 |
设定输入图像shape=(128, 128, 3), kernel_size=(3, 3),strides=(1, 1),padding=same,此时输出特征Hout × Wout与输入图像相同为128*128。
filters | MACC | ms | MACC/ms | params |
---|---|---|---|---|
64 | 3x3x3x128x128x64=28311552 | 5 | 5.7m | 1792 |
128 | 3x3x3x128x128x128=56623104 | 8 | 7.1m | 3584 |
256 | 3x3x3x128x128x256=113246208 | 15 | 7.5m | 7168 |
512 | 3x3x3x128x128x512=226492416 | 28 | 8.1m | 14336 |
测量输入通道取不同值时,卷积层的运行时间
如果输入层为图片,一般输入通道数为1或3。要改变输入通道数需要在前面加一层卷积。
def create_model():
inp = keras.layers.Input(shape=(128, 128, 3))
x = keras.layers.Conv2D(1024, (3, 3), strides=(1, 1),padding='same', name='conv1', use_bias=True)(inp)
x = keras.layers.Lambda(swish)(x)
x = keras.layers.Conv2D(128, (3, 3), strides=(1, 1),padding='same', name='conv2', use_bias=True)(x)
x = keras.layers.Lambda(swish)(x)
return keras.models.Model(inp, x)
设定输入图像shape=(64, 64, 3), 通过第一层卷积把通道数改变为128、256、512、1024,第二层卷积的kernel_size=(3, 3),strides=(1, 1),padding=same,此时输出特征Hout × Wout与输入图像相同为64*64。
Cin | MACC | ms | MACC/ms | params |
---|---|---|---|---|
16 | 3x3x16x64x64x128=75497472 | 4.2 | 18.0m | 18560 |
32 | 3x3x32x64x64x128=150994944 | 6 | 25.2m | 36992 |
64 | 3x3x64x64x64x128=301989888 | 9.5 | 31.8m | 73856 |
128 | 3x3x128x64x64x128=603979776 | 17 | 35.5m | 147584 |
256 | 3x3x256x64x64x128=1.20796e9 | 26 | 46m | 295040 |
512 | 3x3x512x64x64x128=2.415919e9 | 45 | 54m | 589952 |
1024 | 3x3x1024x64x64x128=4.831838e9 | 70 | 69m | 1179776 |
深度卷积 DepthwiseConv2D
CoreML中的卷积层有一个参数是nGroups,当nGroups不等于1的时候,这个卷积层就是一个分组卷积。keras.applications.mobilenet.DepthwiseConv2D通过coremltools就会被转换为kernelChannels=1的分组卷积层,nGroups=上一层的输出通道数。
池化层
池化层的计算量不能使用MACC统计,需要使用FLOPs。它的计算量公式为
Hin × Win x Cin
下面是对池化层速度的测量(这里测量的是TinyYOLO的6个池化层)
Hin × Win x Cin | ms | FLOPS |
---|---|---|
416x416x16=2768896 | 4.1 | 0.67b |
208x208x32=1384448 | 2.7 | 0.51b |
104x104x64=692224 | 2 | 0.35b |
52x52x128=346112 | 1.2 | 0.29b |
26x26x256=173056 | 1 | 0.17b |
13x13x512=86528 | 1 | 0.09b |
TinyYOLO
TinyYOLO是一个非常简单的模型结构,包含9个卷积、6个池化层和其他一些层。
池化层的时间前面已经列出过,下面列出卷积层的时间开销
K × K × Cin × Hout × Wout × Cout | ms | MACC/ms |
---|---|---|
3x3x3x416x416x16=74760192 | 9.1 | 8.2m |
3x3x16x208x208x32=199360512 | 8.1 | 24.6m |
3x3x32x104x104x64=199360512 | 6.7 | 29.8m |
3x3x64x52x52x128=199360512 | 5.9 | 33.8m |
3x3x128x26x26x256=199360512 | 6.5 | 30.7m |
3x3x256x13x13x512=199360512 | 8.6 | 23.2m |
3x3x512x13x13x1024=797442048 | 22 | 36.2m |
3x3x1024x13x13x1024=1.594884e9 | 32 | 49.8m |
1x1x1024x13x13x125=21632000 | 1.6 | 13.5m |