PPT生成图片，实现基于ppt中文字内容的快速检索

一、需求描述

作业一个业余售前，手头储存了大量的ppt素材，都是ppt格式的文件。最喜欢干的事情就是ppt的搬运工，将ppt素材中合适的胶片组织到正在写的胶片中。而这个过程中，从历史的海量胶片中所需要的素材变得是一个非常麻烦的事情，需要逐个打开可能的ppt，看是否所要的素材。

解决思路

将所有的ppt文件，每一页胶片导出生成单独的图片，同时将胶片中的文字提取出来，写入图片文件的备注信息中。
这时候利用mac自带的搜索引擎，能够快速找到和主题相关的图片，利用mac自带的预览功能，能够一屏看9页胶片，快速筛选合适的胶片，根据图片的文件名再索引到具体的ppt文件。

二、具体步骤

1、PPT文件汇聚

利用搜索引擎，将所有的ppt都集中到PPT_ALL文件夹中。利用专业的重复文件去除工具剔除重复文件。经过剔重后，我有大概3000多个ppt文件。经过这些年的工作，已经积累这么多ppt材料，如何将这些利用起来，盘活这部分资源，很重要。

2、PPT文件导出单独图片，提取关键信息

将ppt文件导出生成单独图片，提取胶片的文字内容写入中间文本文件中。

# 本程序需要在Mac的win虚拟集中运行，需要提前安装win32com.client的控件，才能调用office的ppt控件
#encoding=utf-8
import os
import shutil
import re
import win32com
from win32com.client import DispatchEx
from win32com.client import Dispatch,constants

def ppt_pic(fileName,export_dir,tag_fileName):
    Real_name =os.path.basename(fileName).split(".")[0]
    powerpoint = win32com.client.Dispatch("PowerPoint.Application")
    powerpoint.Visible       = True    # 为了便于查阅PPT工作情况，这里设置为可见（为了安全考虑，设置成False也会显示）
    powerpoint.DisplayAlerts = False   # 为了使工作不中断，忽略可能的弹出警告
  # ppt = powerpoint.Presentations.Open(unicode(fileName,"utf8"))
    try:
        ppt = powerpoint.Presentations.Open(fileName)

        #1、直接将整个PPT全部导出成图片（不能指定图片的名字，默认是“幻灯片X”）
        # ppt.Export(export_dir, "png")

        #2、一页一页分别导出成图片(要以指定每张图片的名字)
        slide_count = ppt.Slides.Count
        tag_f =open(tag_fileName,'a')
        for i in range(1,slide_count+1):
             fullpath = os.path.join(export_dir,"PIC_"+Real_name+"_%d.jpg" % i)
            #ppt.Slides(i).Export(unicode(fullpath,"utf8"), "JPG")
             ppt.Slides(i).Export(fullpath, "JPG")
             shape_count = ppt.Slides(i).Shapes.Count
             s=""
             for j in range(1,shape_count+1):
                 if ppt.Slides(i).Shapes(j).HasTextFrame:
                    s = s + ppt.Slides(i).Shapes(j).TextFrame.TextRange.Text.encode(encoding='gb18030',errors='strict')+"!!"
            # tag_f.write(unicode(fullpath,"utf8").encode(encoding='gb18030',errors='strict'))
             tag_f.write(fullpath)
             tag_f.write('\t')
             tag_f.write(re.sub('\s',' ',s))
             tag_f.write('\n')
        tag_f.close()
    except:
        print "error"+fileName
    powerpoint.quit()

#导出ppt的图片到ppt_out文件夹
export_dir = "Y:\\Documents\\PPT_OUT"
#提取ppt中文字，写入tag.txt中
tag_fileName = "Y:\\Documents\\tag.txt"
#从PPT_ALL文件中遍历所有ppt文件
for pptfile in os.listdir("Y:\\Documents\\PPT_ALL"):
    domain = os.path.abspath("Y:\\Documents\\PPT_ALL")
    fileName=os.path.join(domain,pptfile)
    print fileName
    if fileName.endswith(".ppt") or fileName.endswith(".pptx"):
        ppt_pic(fileName,export_dir,tag_fileName)
        try:
            os.remove(fileName)
        except:
            print "error delete"+fileName
            continue

通过运行此程序，所有的PPT_ALL中的ppt会导出生成图片存储PPT_OUT中，所有图片的命名为ppt文件名字+序列号。每一页中提取的文本信息会写入到tag.txt中

2、将胶片中的文本写入图片的备注中

我们知道mac的文件搜索功能非常强大，除了可以搜索文件名，还可以搜索文件的备注信息。
所以要加每页图片中的文字信息写入备注中，如下图注释部分。
需要熟悉mac的文件元数据结构，并按照文件名，将文本信息分别写入到对应文件的注释信息中。

image.png

具体代码如下：

#coding=utf-8
import re
import os
import sys   #reload()之前必须要引入模块
reload(sys)
sys.setdefaultencoding('utf-8')

def writexattrs(F,TagList):
    import subprocess
    """ writexattrs(F,TagList):
    Writes the list of tags to xattr field of file named F
    """
    plistFront = '<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"><plist version="1.0"><array>'
    plistEnd = '</array></plist>'
    plistTagString = ''
    for Tag in TagList:
        plistTagString = plistTagString + '<string>{}</string>'.format(Tag)
    TagText = plistFront + plistTagString + plistEnd

    WhichAttribute = "com.apple.metadata:kMDItemFinderComment"
    # Other attributes you might want to try: ["kOMUserTags","kMDItemOMUserTags","_kMDItemUserTags","kMDItemkeywords"]
    XattrCommand = 'xattr -w {0} \'{1}\' "{2}"'.format(WhichAttribute,TagText.encode("utf8"),F)
    # optional, print command format to check:
    # print XattrCommand
    ProcString = subprocess.check_output(XattrCommand, stderr=subprocess.STDOUT,shell=True) 
    return ProcString

#获得文件名
f = open('//Users//XXX//Documents//tag.txt', 'r')
lines = f.readlines()
I =0
for line in lines:
    try:
        line_sp = line.split('.jpg')
        if len(line_sp)>1:
# 导出的文本中存在大量的不合适的字符，需要进行批量替换
            filename = line_sp[0]+'.jpg'
            TagStr = line_sp[1].replace("!!"," ")
            TagStr = TagStr.replace("-"," ")
            TagStr = TagStr.replace("."," ")
            TagStr = re.sub(r'([\d]+)','',TagStr)
            TagList = TagStr.split()
            #print filename
            file_back =filename.split('PPT_OUT\\')
            file_real =os.path.join("/Users/XXX/Documents/PPT_OUT",file_back[1])
            i=I+1
            print I
            print file_real
            if len(TagList)>1:
                writexattrs(file_real,TagList)
    except:
        print "error"+filename

三、享受到快速检索的福利

打开PPT_OUT文件夹，里面全是图片，利用mac自带的搜索引擎，比如搜索“区块链”关键词，就会将所有包含区块链关键字的胶片给搜索出来，利用图片预览，能够快速找到所需要的胶片啦。
具体效果如下图：

搜索结果

具体过程中遇到了不少问题，比如苹果的文件系统元数据如何写入到注释中，以及导出图片等等，过程挺有意思，同时，大大提升了检索信息的效率，写起胶片简直就是倍速。

PPT生成图片，实现基于ppt中文字内容的快速检索

PPT生成图片，实现基于ppt中文字内容的快速检索

一、需求描述

解决思路

二、具体步骤

1、PPT文件汇聚

2、PPT文件导出单独图片，提取关键信息

2、将胶片中的文本写入图片的备注中

三、享受到快速检索的福利

技术改变生活，技术改变未来，拥抱生活，拥抱未来。

推荐阅读更多精彩内容