Keyword Spotting关键词侦听

Introduction

Keyword Spotting (KWS) aims at detecting predefined key-words in an audio stream, and it is a potential technique to provide the desired hands-free interface[1].
A commonly used technique for keyword spotting is the Key-word/Filler Hidden Markov Model (HMM). Other recent work ex- plores discriminative models for keyword spotting based on large-margin formulation or recurrent neural networks. These systems show improvement over the HMM approach but require processing of the entire utterance to find the optimal keyword region or take information from a long time span to predict the entire keyword, increasing detection latency.

Deep Learning Method

Deep KWS[1]

Framework of Deep KWS system, components from left to right: (i) Feature Extraction (ii) Deep Neural Network (iii) Posterior Handling

Honk[2]

Convolutional neural network architecture for keyword spotting.

CNN[3]

Structure of Convolutional network architecture.

ResNet with Dilated Convolutions[4]

ResNet architecture.

RNN with Attention[5]

Recurrent neural network with attention mechanism. Numbers between [brackets] are tensor dimensions. raw len is WAV audio length (16000 in this case). spec len is the sequence length of the generated mel-scale spectrogram. nMel is the number of mel bands. nClasses is the number of desired classes. The activation of the last Dense layer is softmax. The activation of the 64 and 32 dense classification layers is the rectified linear unit (relu).

文章[6]比较分析了几种基于CNN的KWS,并给出了仿真出的最优参数、复杂度等。其中KWS里面的“state of the art”结果来自上面的RNN with Attention[5]

应用Xception的depthwise separable convolution,提出适合移动设备的简化运算版[7]

Reference


  1. Small-footprint keyword spotting using deep neural networks

  2. Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting

  3. Speech Command Recognition with Convolutional Neural Network

  4. Deep residual learning for small-footprint keyword spotting

  5. A neural attention model for speech command recognition

  6. Comparison and Analysis of SampleCNN Architectures for Audio Classification

  7. Temporal convolution for real-time keyword spotting on mobile devices

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容