TensorFlow 中的样本与数据类型

Example & Feature —— 与样本存储相关的消息类型
tf.Example 是一个消息类型，其结构为：Dict[str, tf.Feature]；

message Example {
  Features features = 1;
};

message Features {
  // Map from feature name to feature.
  map<string, Feature> feature = 1;
};

一条样本对应一个 tf.Example；
key 为特征名，value 为 tf.Feature，包含三种类型：

List[bytes]：可转换自 string、byte
List[int64]：可转换自 float (float32)、double (float64)
List[float]：可转换自 bool、enum、int32、uint32、int64、uint64

// Containers for non-sequential data.
message Feature {
  // Each feature can be exactly one kind.
  oneof kind {
    BytesList bytes_list = 1;
    FloatList float_list = 2;
    Int64List int64_list = 3;
  }
};

// 这里摘一个 Int64List 的定义如下，float/bytes同理
message Int64List {
  // 可以看到，如其名所示，表示的是int64数值的列表
  repeated int64 value = 1 [packed = true];
}

https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/example.proto
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto

TFRecord —— 一种用于存储二进制记录序列的数据格式

存储序列化的 tf.Example，可线性读取

Tensor —— NN 中使用的数据类型
在计算机科学中，数据类型负责告诉编译器或解释器程序员打算如何使用数据。
NN 的输入、输出数据，以及网络中的参数均采用 Tensor 数据结构。
因此进入模型训练 / 推理的时候，需要按照 batch_size 将一批 Example 转换为：Dict[str, tf.Tensor]
List[Dict[str, tf.Feature], Dict[str, tf.Feature], ...] → Dict[str, tf.Tensor]
Tensor 类的实例如何存储在内存中
Tensor 可以具有多个维度，如常见的图像特征是四维的 [B, C, H, W]
但对于计算机而言，数据的存储只能是线性的。
因此，一个 Tensor 类的实例由一维连续的计算机内存段组成。
一段内存本质上是连续的，有许多不同的方案可以将 N 维 Tensor 数组的项排列在一维块中。根据排列顺序的区别，又可以分为行主序和列主序两种风格。

image.png

不同的数据排布（format）方式，会显著影响计算性能，其中针对GPU的特点，可采用的数据排布方式有：NCHW、NHWC、NCHW4、NCHW32、NCHW64和CHWN4等等
N：Batch。表示图片的批次，此处为2；
H：Height。表示图片的高，此处为3；
W：Weight。表示图片的宽，此处为3；
C：Channel。表示图片的通道数，此处为64。

TensorFlow 中的样本与数据类型

推荐阅读更多精彩内容