TensorFlow2 之 AutoGraph 的机制原理

TensorFlow 有三种计算图的构建方式：静态计算图，动态计算图，以及 Autograph。

TensorFlow 2.0 主要使用的是动态计算图和 Autograph。动态计算图易于调试，编码效率较高，但执行效率偏低。静态计算图执行效率很高，但较难调试。

而 Autograph 机制可以将动态图转换成静态计算图，兼收执行效率和编码效率之利。当然 Autograph 机制能够转换的代码并不是没有任何约束的，有一些编码规范需要遵循，否则可能会转换失败或者不符合预期。

1. Autograph 的机制原理

当我们使用 @tf.function 装饰一个函数的时候，后面到底发生了什么呢？

import tensorflow as tf
import numpy as np 

@tf.function(autograph=True)
def myadd(a,b):
    for i in tf.range(3):
        tf.print(i)
    c = a+b
    print("tracing")
    return c

后面什么都没有发生。仅仅是在 Python 堆栈中记录了这样一个函数的签名。

当我们第一次调用这个被 @tf.function 装饰的函数时，后面到底发生了什么？

例如我们写下如下代码。

myadd(tf.constant("hello"),tf.constant("world"))

输出：

tracing
0
1
2

发生了2件事情：

第一件事情是创建计算图。即创建一个静态计算图，跟踪执行一遍函数体中的 Python 代码，确定各个变量的 Tensor 类型，并根据执行顺序将算子添加到计算图中。在这个过程中，如果开启了 autograph=True (默认开启),会将 Python 控制流转换成 TensorFlow 图内控制流。主要是将 if 语句转换成 tf.cond 算子表达，将 while 和 for 循环语句转换成 tf.while_loop 算子表达，并在必要的时候添加 tf.control_dependencies 指定执行顺序依赖关系。
第二件事情是执行计算图。

因此我们先看到的是第一个步骤的结果：即 Python 调用标准输出流打印 ”tracing” 语句。然后看到第二个步骤的结果：TensorFlow 调用标准输出流打印 1,2,3。

当我们再次用相同的输入参数类型调用这个被 @tf.function 装饰的函数时，后面到底发生了什么？

例如我们写下如下代码。

myadd(tf.constant("good"),tf.constant("morning"))

输出：

0
1
2

只会发生一件事情，那就是上面步骤的第二步，执行计算图。所以这一次我们没有看到打印”tracing”的结果。

当我们再次用不同的的输入参数类型调用这个被 @tf.function 装饰的函数时，后面到底发生了什么？

例如我们写下如下代码。

myadd(tf.constant(1),tf.constant(2))

由于输入参数的类型已经发生变化，已经创建的计算图不能够再次使用。需要重新做 2 件事情：创建新的计算图、执行计算图。

需要注意的是，如果调用被 @tf.function 装饰的函数时输入的参数不是 Tensor 类型，则每次都会重新创建计算图。

myadd("hello","world")
myadd("good","morning")

输出：

tracing
0
1
2
tracing
0
1
2

2 AutoGraph 的使用规范

了解了以上 Autograph 的机制原理，我们也就能够理解 Autograph 编码规范的 3 条建议了：

被 @tf.function 修饰的函数应尽可能使用 TensorFlow 中的函数而不是 Python 中的其他函数。例如使用 tf.print 而不是 print，使用 tf.range 而不是 range，使用 tf.constant(True) 而不是 True。

解释：Python 中的函数仅仅会在跟踪执行函数以创建静态图的阶段使用，普通 Python 函数是无法嵌入到静态计算图中的，所以在计算图构建好之后再次调用的时候，这些 Python 函数并没有被计算，而 TensorFlow 中的函数则可以嵌入到计算图中。使用普通的 Python 函数会导致被 @tf.function 修饰前【eager执行】和被 @tf.function 修饰后【静态图执行】的输出不一致。

实例：

import numpy as np
import tensorflow as tf

@tf.function
def np_random():
    a = np.random.randn(3,3)
    tf.print(a)

@tf.function
def tf_random():
    a = tf.random.normal((3,3))
    tf.print(a)

np_random 每次执行都是一样的结果。

np_random()
np_random()

显示：

array([[ 1.41689869,  0.36213943, -0.20104379],
       [ 1.1282627 , -1.12600955, -0.44922577],
       [ 1.15880956, -0.53492144, -0.38729054]])
array([[ 1.41689869,  0.36213943, -0.20104379],
       [ 1.1282627 , -1.12600955, -0.44922577],
       [ 1.15880956, -0.53492144, -0.38729054]])

tf_random 每次执行都会有重新生成随机数。

tf_random()
tf_random()

显示：

[[1.31113029 0.725376606 1.76679862]
 [0.216151401 -0.358448982 -1.20523798]
 [-0.389484823 -1.04814029 1.11230767]]
[[0.383962572 -0.334224582 0.0739119276]
 [0.240156487 -0.839055359 0.527555346]
 [-1.36884451 -0.604626417 1.27370262]]

避免在 @tf.function 修饰的函数内部定义 tf.Variable。

解释：如果函数内部定义了 tf.Variable，那么在【eager执行】时，这种创建 tf.Variable 的行为在每次函数调用时候都会发生。但是在【静态图执行】时，这种创建 tf.Variable 的行为只会发生在第一步跟踪 Python 代码逻辑创建计算图时，这会导致被 @tf.function 修饰前【eager执行】和被 @tf.function 修饰后【静态图执行】的输出不一致。实际上，TensorFlow 在这种情况下一般会报错。

例子：

# 避免在@tf.function修饰的函数内部定义tf.Variable.

x = tf.Variable(1.0,dtype=tf.float32)
@tf.function
def outer_var():
    x.assign_add(1.0)
    tf.print(x)
    return(x)

outer_var() 
outer_var()

@tf.function
def inner_var():
    x = tf.Variable(1.0,dtype = tf.float32)
    x.assign_add(1.0)
    tf.print(x)
    return(x)

#执行将报错
#inner_var()
#inner_var()

被 @tf.function 修饰的函数不可修改该函数外部的 Python 列表或字典等数据结构变量。

解释：静态计算图是被编译成 C++ 代码在 TensorFlow 内核中执行的。Python 中的列表和字典等数据结构变量是无法嵌入到计算图中，它们仅仅能够在创建计算图时被读取，在执行计算图时是无法修改 Python 中的列表或字典这样的数据结构变量的。

tensor_list = []

#@tf.function #加上这一行切换成Autograph结果将不符合预期！！！
def append_tensor(x):
    tensor_list.append(x)
    return tensor_list

append_tensor(tf.constant(5.0))
append_tensor(tf.constant(6.0))
print(tensor_list)

3 AutoGraph 和 tf.Module

前面在介绍 Autograph 的编码规范时提到构建 Autograph 时应该避免在 @tf.function 修饰的函数内部定义 tf.Variable。

但是如果在函数外部定义 tf.Variable 的话，又会显得这个函数有外部变量依赖，封装不够完美。一种简单的思路是定义一个类，并将相关的 tf.Variable 创建放在类的初始化方法中。而将函数的逻辑放在其他方法中。

TensorFlow 提供了一个基类 tf.Module，通过继承它构建子类，我们不仅可以获得以上的自然而然，而且可以非常方便地管理变量，还可以非常方便地管理它引用的其它 Module，最重要的是，我们能够利用 tf.saved_model 保存模型并实现跨平台部署使用。

实际上，tf.keras.models.Model，tf.keras.layers.Layer 都是继承自 tf.Module 的，提供了方便的变量管理和所引用的子模块管理的功能。

因此，利用 tf.Module 提供的封装，再结合 TensoFlow 丰富的低阶 API，实际上我们能够基于 TensorFlow 开发任意机器学习模型(而非仅仅是神经网络模型)，并实现跨平台部署使用。

3.1 应用 `tf.Module` 封装 Autograph

定义一个简单的function：

import tensorflow as tf 
x = tf.Variable(1.0,dtype=tf.float32)

#在tf.function中用input_signature限定输入张量的签名类型：shape和dtype
@tf.function(input_signature=[tf.TensorSpec(shape = [], dtype = tf.float32)])    
def add_print(a):
    x.assign_add(a)
    tf.print(x)
    return(x)

add_print(tf.constant(3.0))
#add_print(tf.constant(3)) #输入不符合张量签名的参数将报错

下面利用 tf.Module 的子类化将其封装一下。

class DemoModule(tf.Module):
    def __init__(self, init_value=tf.constant(0.0), name=None):
        super().__init__(name=name)
        with self.name_scope:  # 相当于with tf.name_scope("demo_module")
            self.x = tf.Variable(init_value, dtype=tf.float32, trainable=True)

    @tf.function(input_signature=[tf.TensorSpec(shape=[], dtype=tf.float32)])
    def addprint(self, a):
        with self.name_scope:
            self.x.assign_add(a)
            tf.print(self.x)
            return self.x

执行：

demo = DemoModule(init_value=tf.constant(1.0))
result = demo.addprint(tf.constant(5.0))
# 查看模块中的全部变量和全部可训练变量
print(demo.variables)
print(demo.trainable_variables)
# 查看模块中的全部子模块
demo.submodules

输出：

6
(<tf.Variable 'demo_module/Variable:0' shape=() dtype=float32, numpy=6.0>,)
(<tf.Variable 'demo_module/Variable:0' shape=() dtype=float32, numpy=6.0>,)

()

使用 tf.saved_model 保存模型，并指定需要跨平台部署的方法：

tf.saved_model.save(demo,"./data/demo/1",signatures = {"serving_default":demo.addprint})

加载模型：

demo2 = tf.saved_model.load("./data/demo/1")
demo2.addprint(tf.constant(5.0))

查看模型文件相关信息，红框标出来的输出信息在模型部署和跨平台使用时有可能会用到：

!saved_model_cli show --dir ./data/demo/1 --all

在 tensorboard 中查看计算图，模块会被添加模块名 demo_module，方便层次化呈现计算图结构。

import datetime

# 创建日志
stamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
logdir = './data/demomodule/%s' % stamp
writer = tf.summary.create_file_writer(logdir)

#开启autograph跟踪
tf.summary.trace_on(graph=True, profiler=True) 

#执行autograph
demo = DemoModule(init_value = tf.constant(0.0))
result = demo.addprint(tf.constant(5.0))

#将计算图信息写入日志
with writer.as_default():
    tf.summary.trace_export(
        name="demomodule",
        step=0,
        profiler_outdir=logdir)

#启动 tensorboard在jupyter中的魔法命令
%reload_ext tensorboard

from tensorboard import notebook
notebook.list()

notebook.start("--logdir ./data/demomodule/")

除了利用 tf.Module 的子类化实现封装，我们也可以通过给 tf.Module 添加属性的方法进行封装。

mymodule = tf.Module()
mymodule.x = tf.Variable(0.0)

@tf.function(input_signature=[tf.TensorSpec(shape = [], dtype = tf.float32)])  
def addprint(a):
    mymodule.x.assign_add(a)
    tf.print(mymodule.x)
    return (mymodule.x)

mymodule.addprint = addprint
# 测试
mymodule.addprint(tf.constant(1.0)).numpy()

#使用tf.saved_model 保存模型
tf.saved_model.save(mymodule,"./data/mymodule",
    signatures = {"serving_default":mymodule.addprint})

#加载模型
mymodule2 = tf.saved_model.load("./data/mymodule")
mymodule2.addprint(tf.constant(5.0))

3.2 tf.Module 和 tf.keras.Model，tf.keras.layers.Layer

tf.keras 中的模型和层都是继承 tf.Module 实现的，也具有变量管理和子模块管理功能。

import tensorflow as tf
from tensorflow.keras import models,layers,losses,metrics
print(issubclass(tf.keras.Model,tf.Module))
print(issubclass(tf.keras.layers.Layer,tf.Module))
print(issubclass(tf.keras.Model,tf.keras.layers.Layer))

tf.keras.backend.clear_session() 

model = models.Sequential()

model.add(layers.Dense(4,input_shape = (10,)))
model.add(layers.Dense(2))
model.add(layers.Dense(1))
model.summary()
model.variables

model.layers[0].trainable = False #冻结第0层的变量,使其不可训练
model.trainable_variables

print(model.name)
print(model.name_scope())
model.submodules, model.layers