剧情
这两天在看论文,密密麻麻的英文,各种专有名词,看得很头痛。借助谷歌翻译可以辅助理解(其实谷歌翻译得挺不错的),就是用谷歌翻译的时候遇到了一点麻烦,见下图。
PDF格式的论文里文字选中的时候是这样的:
复制到记事本上是这样的:
可以看到:
- 很多单词被截断,中间用‘-’进行连接。
- 完整的文本中间全是断行。也就是说,在ASCII码里有很多 ‘\n’ ,直接复制到Google的话,每一个'\n'之前的文本都会当做一个句子去处理。
这样翻译效果就很差。需要手动一个个把断行给接起来。理想效果是这样:
论文里估计有几百行近千行,这样一行行地按Backspace键既乏味又蠢,对于一个程序员是不可忍受的。然后我就想搞一个Python脚本去处理。
开始动手
思路很简单:
- 先找一个pdf处理库,把文字提取出来
- 然后进行字符串增删操作,把断句连起来
- 把生成的文本调用Google API翻译出来,或者模拟浏览器访问 http://translate.google.cn ,把翻译结果拿回来
先去Github搜索: python pdf process,结果如下:
第一个结果里依赖项太多了。而且列表里也没啥好的库,我怀疑关键词是不是错了,就换成: python pdf extract。后来就找到了slate这个库。后来又发现slate这个库完全是基于pdfminer这个库做的,我干脆就直接用pdfminer了。
去豆瓣的pypi国内镜像里看了一下,确实有pdfminer这玩意,就用pip把pdfminer给装上了。因为证书的问题,我最后用的清华的源。安装命令如下:
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pdfminer
在pdfminer的repo里发现了一个脚本pdf2txt.py(地址),就是用来从pdf里提取文字的,很开心,直接用上了。第一步搞定。
文字再处理
从pdf2text.py中提取中夹杂大量断行的文字到一个txt文件中,呈现如下效果:
思考一会后,我打算做如下的处理逻辑:
- 只提取摘要后面的文字
- 对于一行,如果只有数字和空格,则删掉(因为这样的行是叶号、页脚)
- 每一行都去掉末尾的‘\n’
- 对于只有'\n'的一行,则多加一个'\n'换行符到新文本中
- 去掉换行符后,如果行尾是'-',则删掉
边写边试,第二步也基本搞定了,最后在调用Google翻译的时候遇到了一点麻烦,解析html时发现文字没有翻译出来,不知何故。但是最后一步已经不重要了,时间宝贵,到此为止。
源代码与处理效果
用脚本处理后的效果:
最后附上源代码:
import sys
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice, TagExtractor
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.cmapdb import CMapDB
from pdfminer.layout import LAParams
from pdfminer.image import ImageWriter
"""
print ('usage: %s [-d] [-p pagenos] [-m maxpages] [-P password] [-o output]'
' [-C] [-n] [-A] [-V] [-M char_margin] [-L line_margin] [-W word_margin]'
' [-F boxes_flow] [-Y layout_mode] [-O output_dir] [-R rotation] [-S]'
' [-t text|html|xml|tag] [-c codec] [-s scale]'
' file ...' % argv[0])
"""
# main
def extract_text(filename, password_param, output_file):
# debug option
debug = 0
# input option
password = ''
pagenos = set()
maxpages = 0
# output option
outfile = None
outtype = None
imagewriter = None
rotation = 0
stripcontrol = False
layoutmode = 'normal'
codec = 'utf-8'
pageno = 1
scale = 1
caching = True
showpageno = True
laparams = LAParams()
if filename.strip()[-4:] != '.pdf':
print 'file type is not pdf!'
return
elif output_file is not None:
outfile = output_file
else:
outfile = filename.strip()[:-4] + '.txt'
print 'output file path: %s' % outfile
if password_param is not None:
password = password_param
PDFDocument.debug = debug
PDFParser.debug = debug
CMapDB.debug = debug
PDFPageInterpreter.debug = debug
#
rsrcmgr = PDFResourceManager(caching=caching)
if not outtype:
outtype = 'text'
if outfile:
if outfile.endswith('.htm') or outfile.endswith('.html'):
outtype = 'html'
elif outfile.endswith('.xml'):
outtype = 'xml'
elif outfile.endswith('.tag'):
outtype = 'tag'
if outfile:
outfp = file(outfile, 'w')
else:
outfp = sys.stdout
if outtype == 'text':
device = TextConverter(rsrcmgr, outfp, codec=codec, laparams=laparams,
imagewriter=imagewriter)
elif outtype == 'xml':
device = XMLConverter(rsrcmgr, outfp, codec=codec, laparams=laparams,
imagewriter=imagewriter,
stripcontrol=stripcontrol)
elif outtype == 'html':
device = HTMLConverter(rsrcmgr, outfp, codec=codec, scale=scale,
layoutmode=layoutmode, laparams=laparams,
imagewriter=imagewriter, debug=debug)
elif outtype == 'tag':
device = TagExtractor(rsrcmgr, outfp, codec=codec)
else:
return
fname = filename
fp = file(fname, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
print 'extracting text in pdf ... ...'
page_cnt = 1
for page in PDFPage.get_pages(fp, pagenos,
maxpages=maxpages, password=password,
caching=caching, check_extractable=True):
page.rotate = (page.rotate+rotation) % 360
print 'processing page %d ...' % page_cnt
interpreter.process_page(page)
page_cnt += 1
fp.close()
device.close()
outfp.close()
print 'text has been written into %s ' % outfile
return outfile
def check_line_valid(line):# only line like '1 ' will be invalid
line = line[:-1]
if line == '':
return True
digits = '0123456789'
for c in line:
if c != ' ' and c not in digits:
return True
return False
def process_line(line):
if line != '\n': # single line with '\n' will be ignored
line = line[:-1]
if line[-1:] == '-':
line = line[:-1]
else:
line += ' '
else:
line += '\n'
return line
"""
"""
def reformat_output_file(outfile):
text_reformated = ''
file_handler = open(outfile)
line = file_handler.readline()
recording = False
while line:
if line == 'Abstract\n':
recording = True
if recording is True:
if check_line_valid(line):
line = process_line(line)
text_reformated += line
line = file_handler.readline()
file_handler.close()
print '%s has been reformated.' % outfile
file_reformated_name = outfile[:-4] + '.reformated.txt'
file_handler = open(file_reformated_name, 'w')
file_handler.write(text_reformated)
return text_reformated
of = extract_text('H://Hendricks_Deep_Compositional_Captioning_CVPR_2016_paper.pdf', '', None)
tr = reformat_output_file(of)
代码说明:
- 依赖项:pdfminer
- 环境:Windows 10,Python 2.7
- reformated后的文件会写入源pdf所在目录下的一个txt文件内