初始安装

安装tesseract-ocr
https://digi.bib.uni-mannheim.de/tesseract/
安装jTessBoxEditorFX
可到 https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/ 下载

训练步骤

准备样本图片

合并TIFF图片

将多张样本图片合并为一张tiff图片
格式为：[lang].[fontname].exp[num].tif
比如：yjc.font.exp0.tif
具体操作：JTessBoxEditorFX->Tools->Merge TIFF...

合并tiff图片.png

生成TIFF图片的box文件

tesseract yjc.font.exp0.tif yjc.font.exp0 batch.nochop makebox

调整box

打开tiff文件

打开tiff文件
调整识别内容

调整内容
保存

通过TIF图像文件和box盒子文件生成进行LSTM训练所需的lstmf文件

这里使用eng语言

tesseract yjc.font.exp0.tif yjc.font.exp0 -l eng --psm 6 lstm.train

提取语言的LSTM文件

到https://github.com/tesseract-ocr/tessdata_best下载相应语言的traineddata文件
上面使用的是eng

combine_tessdata -e eng.traineddata eng.lstm

运行后会生成eng.lstm文件

创建包含训练文件绝对路径的文件

这里创建文件：eng.training_files.txt
文件内容：

C:\Users\xxx\Desktop\test1\yjc.font.exp0.lstmf

开始训练

这里指定了checkpoint文件输出位置【model_output】，开始训练的位置【continue_from】，训练数据文件位置【train_listfile】，包含unicharset、recoder和可选语言模型的初始训练数据【traineddata】，每次训练迭代输出【debug_interval=-1】，迭代次数【max_iterations 】

lstmtraining --model_output="C:\Users\xxx\Desktop\test1\train\out" --continue_from="C:\Users\xxx\Desktop\test1\train\eng.lstm" --train_listfile="C:\Users\xxx\Desktop\test1\train\eng.training_files.txt" --traineddata="C:\Users\xxx\Desktop\test1\train\eng.traineddata" --debug_interval -1 --max_iterations 4000

这里会生成checkpoint文件

合成新的训练数据

lstmtraining --stop_training --continue_from="C:\Users\xxx\Desktop\test1\train\out_checkpoint" --traineddata="C:\Users\xxx\Desktop\test1\train\eng.traineddata" --model_output="C:\Users\xxx\Desktop\test1\train\yjc.traineddata"

这里生成的yjc.traineddata为最终结果

参考资料：

https://tesseract-ocr.github.io/tessdoc/tess4/TrainingTesseract-4.00.html

Tesseract OCR图片识别训练