iOS-PocketSphinx——建立语言模型

传送门

当前文章：《iOS-PocketSphinx——建立语言模型》

系统环境

Mac OS 10.15.6

安装cmculmtk工具

cmuclmtk-0.7下载地址：https://sourceforge.net/projects/cmusphinx/files/cmuclmtk/0.7/
或使用终端命令下载：

$ svn checkout https://svn.code.sf.net/p/cmusphinx/code/trunk/cmuclmtk/

下载完cd到cmculmtk目录

$ ./configure 或 ./autogen.sh
$ make

我执行make时报如下错误：

$ make
(CDPATH="${ZSH_VERSION+.}:" && cd . && /bin/sh /Users/huangchusheng/Aaron/iOS/PocketSphinx/cmuclmtk/missing autoheader)
rm -f stamp-h1
touch config.h.in
cd . && /bin/sh ./config.status config.h
config.status: creating config.h
config.status: config.h is unchanged
/Applications/Xcode.app/Contents/Developer/usr/bin/make  all-recursive
Making all in src
Making all in liblmest
make[3]: Nothing to be done for `all'.
Making all in libs
/bin/sh ../../libtool  --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I. -I../..    -I../../src/liblmest -I../../src/win32 -g -O2 -MT rr_mkdtemp.lo -MD -MP -MF .deps/rr_mkdtemp.Tpo -c -o rr_mkdtemp.lo rr_mkdtemp.c
libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../.. -I../../src/liblmest -I../../src/win32 -g -O2 -MT rr_mkdtemp.lo -MD -MP -MF .deps/rr_mkdtemp.Tpo -c rr_mkdtemp.c  -fno-common -DPIC -o .libs/rr_mkdtemp.o
rr_mkdtemp.c:44:33: error: implicit declaration of function 'mkdir' is invalid
      in C99 [-Werror,-Wimplicit-function-declaration]
       if (!mktemp(template) || mkdir(template, 0700)) 
                                ^
1 error generated.
make[3]: *** [rr_mkdtemp.lo] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2

打开rr_mkdtemp.c文件，导入以下头文件（mkdir函数所在头文件）

#include <sys/stat.h>
#include <sys/types.h>

然后重新执行make，又报如下错误：

$ make
/Applications/Xcode.app/Contents/Developer/usr/bin/make  all-recursive
Making all in src
Making all in liblmest
make[3]: Nothing to be done for `all'.
Making all in libs
make[3]: Nothing to be done for `all'.
Making all in .
make[3]: Nothing to be done for `all-am'.
Making all in programs
gcc -DHAVE_CONFIG_H -I. -I../..    -I../../src/libs -I../../src/liblmest -I../../src/win32 -g -O2 -MT text2wngram.o -MD -MP -MF .deps/text2wngram.Tpo -c -o text2wngram.o text2wngram.c
text2wngram.c:255:62: warning: format specifies type 'unsigned short' but the
      argument has type 'int' [-Wformat]
  ...%hu%s",temp_directory, current_file_number, temp_file_ext);
     ~~~                    ^~~~~~~~~~~~~~~~~~~
     %d
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/usr/include/secure/_stdio.h:47:56: note: 
      expanded from macro 'sprintf'
  __builtin___sprintf_chk (str, 0, __darwin_obsz(str), __VA_ARGS__)
                                                       ^~~~~~~~~~~
text2wngram.c:334:3: error: implicit declaration of function 'merge_tempfiles'
      is invalid in C99 [-Werror,-Wimplicit-function-declaration]
  merge_tempfiles(1,
  ^
1 warning and 1 error generated.
make[3]: *** [text2wngram.o] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2

打开text2wngram.c文件，导入以下头文件（merge_tempfiles函数所在头文件）

#include "ac_lmfunc_impl.h"

重新make成功，再接着执行

$ sudo make install

make install 后，可以在/usr/local/bin下看到以下文件：

binlm2arpa
evallm
idngram2lm
idngram2stats
mergeidngram
ngram2mgram
text2idngram
text2wfreq
text2wngram
wfreq2vocab
wngram2idngram

文本准备

First of all you need to prepare a large collection of clean texts. Expand abbreviations, convert numbers to words, clean non-word items. For example to clean Wikipedia XML dumps you can use special Python scripts like Wikiextractor. To clean HTML pages you can try BoilerPipe. It’s a nice package specifically created to extract text from HTML.

For an example on how to create a language model from Wikipedia text, please see this blog post. Movie subtitles are also a good source for spoken language.

Once you have gone through the language modeling process, please submit your language model to the CMUSphinx project. We’ll be happy to share it!

Language modeling for Mandarin and other similar languages, is largely the same as for English, with one additional consideration. The difference is that the input text must be word segmented. A segmentation tool and an associated word list is provided to accomplish this.

首先，我们需要准备大量的干净的文本。扩展缩略语，将数字转换为单词并清除非单词项（如特殊字符，标点符号等）。清洗文本的工具有Wikiextractor、BoilerPipe等。

在这里需要注意的是，对于中文或者一些相似的其他语言，我们需要对输入的原始文本进行分词，可以使用分词工具和相关单词列表来实现这个任务。

语言模型

语言模型可以三种不同的格式存储和加载：文本 ARPA格式，二进制BIN格式和二进制DMP格式。ARPA格式占用更多空间，但是可以对其进行编辑。ARPA文件具有.lm扩展名。二进制格式占用的空间少得多，并且加载速度更快。二进制文件具有.lm.bin扩展名。也可以在这些格式之间进行转换。DMP格式已过时，不建议使用。

下载中文（普通话）语言包

新的中文语言包

链接地址：Home / Acoustic and Language Models / Mandarin
下载cmusphinx-zh-cn-5.2.tar.gz并解压，目录文件如下：

zh_cn.cd_cont_5000  // 中文声学模型
zh_cn.dic  // 字典 中文词汇-音素对应表
zh_cn.lm.bin  // 默认中文语言模型

旧的中文语言包

链接地址：Home / Acoustic and Language Models / Archive / Mandarin
下载以下文件：

zh_broadcastnews_16k_ptm256_8000.tar.bz2（解压得到zh_broadcastnews_ptm256_8000）
zh_broadcastnews_64000_utf8.DMP
zh_broadcastnews_utf8.dic

不知能否直接使用.DMP，管它的，先将.DMP转化为.lm
转为.lm.bin

$ sphinx_lm_convert -i zh_broadcastnews_64000_utf8.DMP -o zh_broadcastnews_64000_utf8.lm.bin

转为.lm（由于文件太大，转为.lm大小都1G了，使用时还是用二进制的.lm.bin）：

$ sphinx_lm_convert -i zh_broadcastnews_64000_utf8.DMP -ifmt bin -o zh_broadcastnews_64000_utf8.lm -ofmt arpa

两个不同的中文语言包有什么区别？

其实我也挺疑惑，两者在使用上有何区别？从.dic上看，新语言包在音素上多了声调，如下：
旧语言包：

你好 n i h ao

新语言包：

你好 n i3 h ao3

当然，除了字典，语言模型.lm和声学模型也存在差异。

我个人是倾向于使用新的模型，虽然网上大多数文章都是已旧模型为例，我觉得可能是早期还没出新模型吧。
不过在使用新模型，实现自己的需求时，似乎不能得偿所愿。
唯有使用旧模型重新再摸索，看看能不能得出点结论出来，如果有新的发现，会再次更新文章。

初步结果：旧模型训练出来的效果似乎比新模型训练出来的效果要好，继续摸索。

重命名

个人喜好，为了与英文语言模型命名保持统一，方便使用，我将其重命名，zh-cn目录文件如下：

zh-cn
zh-cn.dict
zh-cn.lm.bin

加载汉语语言词典进行汉语识别：

$ pocketsphinx_continuous -inmic yes -hmm zh-cn -lm zh-cn.lm.bin -dict zh-cn.dict

准备字典中文词汇-音素对应表

创建tianmao.dict，将关键字与音素一一对应。对照zh-cn.dict（旧音素映射为例），找到词汇的对应音素。原字典中不一定会有相同的词语，有的话，就按照原先的写，没有的话，就按照单个发音的写上即可，最终字典内容如下：

天猫精灵 t ian m ao j ing l ing
你好天猫 n i h ao t ian m ao
来一首歌 l ai y i sh ou g e
来点音乐 l ai d ian y in uxs uxe
音量大 y in l iang d a
音量小 y in l iang x iao
声音大一点 sh eng y in d a y i d ian
声音小一点 sh eng y in x iao y i d ian
下一首 x ia y i sh ou
上一首 sh ang y i sh ou
切歌 q ie g e

用CMUCLMTK训练ARPA模型

创建语言模型的过程如下：
1）准备训练的文本tianmao.txt用，并用<s>和</s> 标记分隔语音，如下：

<s> 天猫精灵 </s>
<s> 你好天猫 </s>
<s> 来一首歌 </s>
<s> 来点音乐 </s>
<s> 音量大 </s>
<s> 音量小 </s>
<s> 声音大一点 </s>
<s> 声音小一点 </s>
<s> 下一首 </s>
<s> 上一首 </s>
<s> 切歌 </s>

2）生成词汇文件。这是文件中所有单词的列表

可以分开用两个命令，先生成.wfreq文件，再生成.vocab文件

$ text2wfreq < tianmao.txt > tianmao.wfreq
$ wfreq2vocab <tianmao.wfreq> tianmao.vocab

也直接用一个命令生成.vocab文件

$ text2wfreq < tianmao.txt | wfreq2vocab > tianmao.vocab

执行成功

text2wfreq : Reading text from standard input...
text2wfreq : Done.
wfreq2vocab : Will generate a vocabulary containing the most
              frequent 20000 words. Reading wfreq stream from stdin...
wfreq2vocab : Done.

生成的tianmao.vocab文件内容如下：

## Vocab generated by v2 of the CMU-Cambridge Statistcal
## Language Modeling toolkit.
##
## Includes 13 words ##
</s>
<s>
上一首
下一首
你好天猫
切歌
声音大一点
声音小一点
天猫精灵
来一首歌
来点音乐
音量大
音量小

3）使用以下命令生成ARPA格式语言模型：
先执行以下命令，生成tianmao.idngram

$ text2idngram -vocab tianmao.vocab -idngram tianmao.idngram < tianmao.txt

再执行以下命令，生成tianmao.lm

$ idngram2lm -vocab_type 0 -idngram tianmao.idngram -vocab tianmao.vocab -arpa tianmao.lm

4）生成CMU二进制形式（BIN）：

$ sphinx_lm_convert -i tianmao.lm -o tianmao.lm.bin

使用Web服务构建简单的语言模型

如果您的语言是英语且文本很小，则有时使用Web服务来构建它会更方便。以这种方式构建的语言模型对于简单的命令和控制任务非常有用。首先，您需要创建一个语料库。

“语料库”只是用于训练语言模型的句子列表。例如，我们将假设的语音控制任务用于移动Internet设备。我们想告诉它“打开浏览器”，“新电子邮件”，“前进”，“后退”，“下一个窗口”，“最后一个窗口”，“打开音乐播放器”之类的内容。因此，我们将从创建一个名为的文件开始corpus.txt：

打开浏览器
新电子邮件
前进
后退
下一个窗口
最后一个窗口
打开音乐播放器

然后转到LMTool页面。只需单击“浏览...”按钮，选择corpus.txt您创建的文件，然后单击“ COMPILE KNOWLEDGE BASE”。

您应该看到一个包含一些状态消息的页面，然后是一个名为“ Sphinx知识库”的页面。该页面将包含标题为“词典”和“语言模型”的链接。下载这些文件并记下它们的名称（它们应由4位数字组成，后跟扩展名 .dic和.lm）。现在，您可以使用PocketSphinx测试新创建的语言模型。

将模型转换为二进制格式

为了快速加载大型模型，您可能希望将它们转换为二进制格式，这样可以节省解码器的初始化时间。对于小型模型，这不是必需的。Pocketsphinx和sphinx3可以使用该-lm选项处理它们两者。Sphinx4通过lm文件的扩展名自动检测格式。

ARPA格式和BINARY格式可以相互转换。您可以使用sphinxbase中的sphinx_lm_convert命令生成另一个文件：

$ sphinx_lm_convert -i model.lm -o model.lm.bin
$ sphinx_lm_convert -i model.lm.bin -ifmt bin -o model.lm -ofmt arpa

您也可以通过这种方式将旧的DMP模型转换为二进制格式。

在PocketSphinx中使用语言模型

If you have installed PocketSphinx, you will have a program called pocketsphinx_continuous which can be run from the command line to recognize speech. Assuming it is installed under /usr/local, and your language model and dictionary are called 8521.dic and 8521.lm and placed in the current folder, try running the following command:

如果安装了PocketSphinx，将有一个名为的程序pocketsphinx_continuous，可以从命令行运行该程序以识别语音。假设语言模型和字典为8521.lm和8521.dic并放置在当前文件夹，请尝试运行下面的命令（如果是中文模型，还需指定-hmm）：

$ pocketsphinx_continuous -inmic yes -lm 8521.lm -dict 8521.dic

You will see a lot of diagnostic messages, followed by a pause, then the output “READY…“. Now you can try speaking some of the commands. It should be able to recognize them with full accuracy. If not, you may have problems with your microphone or sound card.

您会看到很多诊断消息，然后是暂停，然后是输出 “ READY…”。现在，您可以尝试讲一些命令。它应该能够完全准确地识别它们。否则，您的麦克风或声卡可能会出现问题。

使用自己训练的中文语言模型

亲测，如果用上面的命令来测试中文语言模型，会有如下加载错误，且并无法识别：

···省略一大堆加载日志···
INFO: dict.c(333): Reading main dictionary: tianmao.dict
ERROR: "dict.c", line 195: Line 1: Phone 't' is mising in the acoustic model; word '天猫精灵' ignored
ERROR: "dict.c", line 195: Line 2: Phone 'n' is mising in the acoustic model; word '你好天猫' ignored
ERROR: "dict.c", line 195: Line 3: Phone 'l' is mising in the acoustic model; word '来一首歌' ignored
ERROR: "dict.c", line 195: Line 4: Phone 'l' is mising in the acoustic model; word '来点音乐' ignored
ERROR: "dict.c", line 195: Line 5: Phone 'ii' is mising in the acoustic model; word '音量大' ignored
ERROR: "dict.c", line 195: Line 6: Phone 'ii' is mising in the acoustic model; word '音量小' ignored
ERROR: "dict.c", line 195: Line 7: Phone 'sh' is mising in the acoustic model; word '声音大一点' ignored
ERROR: "dict.c", line 195: Line 8: Phone 'sh' is mising in the acoustic model; word '声音小一点' ignored
ERROR: "dict.c", line 195: Line 9: Phone 'x' is mising in the acoustic model; word '下一首' ignored
ERROR: "dict.c", line 195: Line 10: Phone 'sh' is mising in the acoustic model; word '上一首' ignored
ERROR: "dict.c", line 195: Line 11: Phone 'q' is mising in the acoustic model; word '切歌' ignored
INFO: dict.c(213): Dictionary size 0, allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(336): 0 words read
···省略一大堆加载日志···

注意Phone 't' is mising in the acoustic model，在声学模型中没有't'这个音素，我猜如果不指定-hmm，默认用的是英语声学模型。

识别中文语言模型，还必须使用以下-hmm选项指定声学模型文件夹，我们暂时用官方的中文声学模型文件夹zh-cn。

实时识别：

$ pocketsphinx_continuous -inmic yes -hmm zh-cn -lm tianmao.lm.bin -dict tianmao.dict

识别文件：

$ pocketsphinx_continuous -hmm zh-cn -lm tianmao.lm.bin -dict tianmao.dict -infile arctic_0001.wav

识别率低、误触等问题

如果使用的是官方的声学模型，识别率非常低，随便一点声音也会触发识别，返回随机词语。例如使用自己建立的语言模型+官方声学模型，识别结果会从你自己准备的文本中随机返回一个词汇（当然不是完全随机，应该跟发音有关，返回最接近该发音的词汇，哪怕两者发音相差十万八千里）。

我这边准备做语音唤醒，建立的语言模型可能就包含一个词汇或两个词汇，这样很容易随便发出点声音就触发该词汇，这不是我想要的。

后面我对官方的中文声学模型进行调整后，该现象得到很大的改善，如果发出的声音偏差大一点，则不会触发该词汇。

如何调整声学模型？请看下一章：
《iOS-PocketSphinx——调整默认声学模型》

参考资料：

建立语言模型（官方教程）：https://cmusphinx.github.io/wiki/tutoriallm/#using-keyword-lists-with-pocketsphinx

Cmuclmtk Development（官方文档）：https://cmusphinx.github.io/wiki/cmuclmtkdevelopment/

cmuclmtk命令说明：http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html

PocketSphinx语音识别系统语言模型的训练和声学模型的改进：https://www.cnblogs.com/bhlsheji/p/4514475.html