文字识别, 时不时有问题.
之前是6, 5 分不清楚,
今天发现 逗号跟7 混到一起,识别出来一坨
想限定为 digit ,度娘 ,谷歌 问了一圈,大概解决方法如下 :
stackoverflow 里面提到 ,tesseract 可以通过api 设置 ,貌似 说的是另外一个 tesseract模块,装上有问题,放弃
https://stackoverflow.com/questions/9794029/python-tesseract-ocr-get-digits-only
然后想想估计pytesseract也可以 ,找到源文件看了看,且又搜了一下 ,解决方案如下:
pytesseract.image_to_string(im,lang='eng',config='-psm 7 digits')
语言,指定为英文 , config 配置为 -psm 7 digits
这样只识别 数字 。
数字的 白名单 可以在 Tesseract-OCR\tessdata\configs\digits 里面设置
附:
-psm N
Set Tesseract to only run a subset of layout analysis and assume a certain form of image. The options for N are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR.
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.