Introduction
Keyword Spotting (KWS) aims at detecting predefined key-words in an audio stream, and it is a potential technique to provide the desired hands-free interface[1].
A commonly used technique for keyword spotting is the Key-word/Filler Hidden Markov Model (HMM). Other recent work ex- plores discriminative models for keyword spotting based on large-margin formulation or recurrent neural networks. These systems show improvement over the HMM approach but require processing of the entire utterance to find the optimal keyword region or take information from a long time span to predict the entire keyword, increasing detection latency.
Deep Learning Method
Deep KWS[1]
Honk[2]
CNN[3]
ResNet with Dilated Convolutions[4]
RNN with Attention[5]
文章[6]比较分析了几种基于CNN的KWS,并给出了仿真出的最优参数、复杂度等。其中KWS里面的“state of the art”结果来自上面的RNN with Attention[5]。
应用Xception的depthwise separable convolution,提出适合移动设备的简化运算版[7]。
Reference
-
Small-footprint keyword spotting using deep neural networks ↩ ↩
-
Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting ↩
-
Speech Command Recognition with Convolutional Neural Network ↩
-
Deep residual learning for small-footprint keyword spotting ↩
-
Comparison and Analysis of SampleCNN Architectures for Audio Classification ↩
-
Temporal convolution for real-time keyword spotting on mobile devices ↩