知识库RAG应用技术点 -- 知识处理
文档解析 -> 分块/分段 -> 向量化 -> 元数据,知识图谱
文档解析
关于知识库应用,第一步的操作都是需要加载解析相关文档(doc,PFD,PPT等),再此基础上才有后续的的分块、向量化等操作;这个环节数据处理的效果对后续的应用影响比较大。
如下是几个常用的python库:
unstructured,TIka(提取多种文档格式,可以提取文档内容和元数据),pdfminer3k(读取pdf文本),pdfplumber(解析pdf的文本和表格内容)
- pdfplumber demo
import pdfplumber
import io
from PIL import Image
import os
with pdfplumber.open('pdfs/The Part-Time Parliament.pdf') as pdf:
print(f'metadata : {pdfs.metadata}\n') ## pdf metadata
for i,page_pdf in enumerate(pdf.pages):
text = page_pdf.extract_text() ## 提取文本
table = page_pdf.extract_table() ## 提取表格
print(f'page-{i} text : {text}\n')
print(f'page-{i} table : {table}\n')
## 提取图片
imgs = page_pdf.images
for j,img in enumerate(imgs):
if img:
bbox = [img['x0'], page_pdf.cropbox[3]-img['y1'], img['x1'], page_pdf.cropbox[3]-img['y0']]
img_page = page_pdf.crop(bbox=bbox)
img_obj = img_page.to_image(resolution=500)
page_number = img['page_number']
image_name_prefix = f'page_{page_number}_image_{j + 1}'
image_name = f'{image_name_prefix}' + ".png"
image_path = f'pdfs/{image_name}'
img_obj.save(image_path)
#### 对数学公式支持不好,图片需要特殊处理
- pdfminer3k demo
import io
import os
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
pdf_path = 'pdfs/The Part-Time Parliament.pdf'
rsrcmgr = PDFResourceManager()
retstr = io.StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = open(pdf_path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
page_texts = []
pre_rsrcmgr = PDFResourceManager()
pre_laparams = LAParams()
for page_num,page in enumerate(PDFPage.get_pages(fp, pagenos, maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True)):
interpreter.process_page(page)
## 分页处理
pre_retstr = io.StringIO()
pre_device = TextConverter(pre_rsrcmgr, pre_retstr, codec='utf-8', laparams=pre_laparams)
pre_interpreter = PDFPageInterpreter(pre_rsrcmgr, pre_device)
pre_interpreter.process_page(page)
pre_text = pre_retstr.getvalue()
page_texts.append((page_num, pre_text))
pre_retstr.close()
pre_device.close()
all_text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
for page,text in page_texts:
print(f'page-{page} Text : {text}\n')
print('==============================')
print(f'all_text : {all_text}\n')
#### 只能提取文本,表格、图片等需要结合其他工具包处理
- TIka demo
pdf_path = 'pdfs/The Part-Time Parliament.pdf'
parsed = parser.from_file(pdf_path)
metadata = parsed['metadata']
print(f'metadata : {metadata}\n')
text = parsed['content']
print(f'text : {text}\n')
#### 可以提取文本和meta数据,无法处理图片、表格类;同时也没有区分页码
- unstructured demo
pdf_path = 'pdfs/The Part-Time Parliament.pdf'
elements = partition(filename=pdf_path)
dataframes = []
for index,el in enumerate(elements):
table_data = None
image_data = None
text_data = None
page = el.metadata.page_number
if isinstance(el, Table):
table_data = pd.read_html(el.metadata.text_as_html)
for table in table_data:
dataframes.append(table)
elif isinstance(el, Image):
image_data = el.metadata.coordinates
else:
text_data = str(el)
print(f'!!! page-{page} - {index}\n Table : {table_data}\n Image : {image_data}\n Text : {text_data}\n')
print('=========== END =================')
#### unstructured库可以处理多种文件格式,如 PDF、Word、CSV 等
#### 对于不同的文件格式,可能需要安装额外的依赖,例如处理 PDF 文件可能需要安装pytesseract等
#### 从unstructured.partition.auto中导入partition函数,该函数可自动识别文件类型并进行处理
#### 可以直接根据类型来处理文本、表格,对于图片的识别提取需要借用其他包