关键词:Triton
,Marker
,Python
前言
在知识库场景下往往需要对PDF文档进行解析,从而能够通过RAG完成知识检索,本文介绍开源的PDF转Markdown工具marker,并借助Triton Inference Server将其服务化。
内容摘要
- 知识库场景下pdf解析简述
- Marker简介和安装
- Marker快速开始
- 使用Triton服务化
知识库场景下pdf解析简述
PDF文档通常包含多样化的格式、图片、表格等元素,由于RAG对数据的标准化和准确性有很高的依赖性,直接将PDF转化为text容易丢失和混淆文件中内容的组织形式,一种更优的方式是将PDF转化为Markdown,它能够更好的留结内容的构化信息。
以解析《Attention is all you need》这篇PDF论文为例,原始PDF如下
转化为text的结果如下
Attention Is All You Need
AshishVaswani∗ NoamShazeer∗ NikiParmar∗ JakobUszkoreit∗
GoogleBrain GoogleBrain GoogleResearch GoogleResearch
avaswani@google.com noam@google.com nikip@google.com usz@google.com
7102
LlionJones∗ AidanN.Gomez∗ † ŁukaszKaiser∗
GoogleResearch UniversityofToronto GoogleBrain
llion@google.com aidan@cs.toronto.edu lukaszkaiser@google.com
ceD
IlliaPolosukhin∗ ‡
illia.polosukhin@gmail.com 6
]LC.sc[
Abstract
Thedominantsequencetransductionmodelsarebasedoncomplexrecurrentor
convolutionalneuralnetworksthatincludeanencoderandadecoder. Thebest
performing models also connect the encoder and decoder through an attention
5v26730.6071:viXra
而转化为Markdown的结果如下
# Attention Is All You Need
| Ashish Vaswani∗ Google Brain | |
|--------------------------------|-------------------------------------------------|
| avaswani@google.com | Noam Shazeer∗ Google Brain |
| noam@google.com | Niki Parmar∗ |
| Google Research | |
| nikip@google.com | Jakob Uszkoreit∗ Google Research usz@google.com |
Łukasz Kaiser∗
Google Brain lukaszkaiser@google.com Aidan N. Gomez∗ †
University of Toronto aidan@cs.toronto.edu Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
## Abstract
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
text无法恢复换行的连续结构,上一行和下一行断开,而Markdown会将其解析为完整的一段;如果PDF的结构稍微复杂一点,text就会将不同位置上完成不相关的文字解析合并在一起,比如例子中的“7102”是论文左侧的发表时间,实际为2017年,最后Markdown相比于text能识别出层次结构,比如表格、标题等,整体而言Markdown解析的质量更高。
marker简介和安装
marker是github上一个一个基于Python语言实现的开源的项目,它基于多个OCR模型的组合流水线来完成PDF转Markdown的任务,模型包括
- ORC文字提取
- 页面布局和阅读顺序识别
- 分模块的清洗和格式化
- 模型合并和后处理
使用pip可以安装marker
pip install marker-pdf
安装完之后在环境变量路径下会安装对应的转化工具marker_single
$ which marker_single
/home/anaconda3/envs/my_env/bin/marker_single
额外的marker实际上是调用了众多的模型对PDF进行识别和推理,因此需要下载一些模型文件,marker默认在使用的时候下载,我们先在HuggingFace上先离线下载好所有需要的模型,放在vikp目录下,所需的模型文件如下
[root@xxxx vikp]# ls -lt
总用量 0
drwxr-xr-x 2 root root 132 5月 14 16:37 surya_order
drwxr-xr-x 2 root root 10 5月 14 15:35 order_bench
drwxr-xr-x 2 root root 10 5月 14 15:32 publaynet_bench
drwxr-xr-x 2 root root 10 5月 14 15:32 rec_bench
drwxr-xr-x 2 root root 229 5月 14 15:31 surya_rec
drwxr-xr-x 2 root root 10 5月 14 15:28 doclaynet_bench
drwxr-xr-x 2 root root 98 5月 14 15:27 surya_det_math
drwxr-xr-x 2 root root 98 5月 14 15:26 surya_det2
drwxr-xr-x 2 root root 159 5月 14 15:20 pdf_postprocessor_t5
drwxr-xr-x 2 root root 98 5月 14 15:18 surya_layout2
drwxr-xr-x 2 root root 319 5月 14 15:17 texify
以surya_order为例,在HuggingFace都能够找到对应的模型
marker快速开始
使用环境变量下的marker_single命令即可运行marker,输入为单篇PDF文档的位置,输出为一个结果目录,先切换到上一层目录,确保vikp文件夹在当前执行目录的同一级
root@1fc83e178b80:/home/marker-pdf/1# marker_single 606addeff4c0070ce300ff0adc88eceb.pdf ./ --batch_multiplier 2 --max_pages 1 --langs Chinese
Loading detection model vikp/surya_det2 on device cuda with dtype torch.float16
Loading detection model vikp/surya_layout2 on device cuda with dtype torch.float16
Loading reading order model vikp/surya_order on device cuda with dtype torch.float16
Loaded texify model to cuda with torch.float16 dtype
Detecting bboxes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.22s/it]
Detecting bboxes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 9.16it/s]
Finding reading order: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.44s/it]
Saved markdown to the ./606addeff4c0070ce300ff0adc88eceb folder
日志表明marker分别加载了三个模型到GPU,然后将推理的结果写到了606addeff4c0070ce300ff0adc88eceb目录,目录下有识别出的图片,Markdown文件,以及配置文件
打开Markdown,识别的内容如下
## 创业板投资风险提示
本次股票发行后拟在创业板市场上市,该市场具有较高的投资风险。创业板公司 具有创新投入大、新旧产业融合成功与否存在不确定性、尚处于成长期、经营风险高、 业绩不稳定、退市风险高等特点,投资者面临较大的市场风险。投资者应充分了解创 业板市场的投资风险及本公司所披露的风险因素,审慎作出投资决定。
湖北亨迪药业股份有限公司
![0_image_0.png](0_image_0.png)
![0_image_1.png](0_image_1.png)
Hubei Biocause Heilen Pharmaceutical Co., Ltd.
(荆门高新区·掇刀区杨湾路 122 号)
![0_image_2.png](0_image_2.png)
![0_image_3.png](0_image_3.png)
首次公开发行股票并在创业板上市 招股说明书 保荐人(主承销商)
(中国(上海)自由贸易试验区商城路 618 号)
使用Triton服务化
使用命令行的方式不能实现跨机器跨语言的场景,因此需要将marker服务化,marker本质上是torch模型组成的pipeline,因此很适合使用Triton Inference Server进行部署,最终暴露出HTTP API服务用于调用。
首先用Docker拉取Triton基础镜像nvcr.io/nvidia/tritonserver:23.10-py3,并在其中pip安装marker-pdf包,安装完成后重新commit为新容器,例如命名为tritonserver:marker-pdf-env。
然后设置模型的目录结构,Triton的模型统一存放在model_repository目录下,在model_repository下创建marker-pdf目录,其结构如下
[root@zx-61 marker-pdf]# tree
.
├── 1
│ ├── model.py
│ ├── vikp
│ │ ├── doclaynet_bench
│ │ ├── order_bench
│ │ ├── pdf_postprocessor_t5
│ │ ├── publaynet_bench
│ │ ├── rec_bench
│ │ ├── surya_det2
│ │ ├── surya_det_math
│ │ ├── surya_layout2
│ │ ├── surya_order
│ │ ├── surya_rec
│ │ └── texify
└── config.pbtxt
1目录代表模型版本,其下有后端逻辑代码model.py以及所需要的ORC模型目录vikp,config.pbtxt为模型服务的配置文件,里面定义了输入和输出,设备资源配置等,其内容如下
[root@zx-61 marker-pdf]# cat config.pbtxt
name: "marker-pdf"
backend: "python"
max_batch_size: 0
input [
{
name: "text"
dims: [ -1 ]
data_type: TYPE_STRING
},
{
name: "max_pages"
dims: [ 1 ]
data_type: TYPE_INT64
}
]
output [
{
name: "output"
dims: [ -1 ]
data_type: TYPE_STRING
}
]
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
}
]
输入参数为text和max_pages,分别代表PDF的二进制文件经过base64编码之后的字符串内容,以及marker转化的最大页数,比如max_pages设置为5则转化PDF的前5页。输出字段为output,直接输出Markdown的字符串内容,不需要其他图片等信息。
本质上是对marker_single命令的服务话,而marker_single是调用的marker.convert下的convert_single_pdf,对其稍作修改,将输入改为PDF的二进制文件经过base64编码之后的字符串,将输入只取Markdown的内容即可,model.py内容如下
import os
# 设置显存空闲block最大分割阈值
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:32'
# 设置work目录
os.environ['TRANSFORMERS_CACHE'] = os.path.dirname(os.path.abspath(__file__)) + "/work/"
os.environ['HF_MODULES_CACHE'] = os.path.dirname(os.path.abspath(__file__)) + "/work/"
# os.environ["CUDA_VISIBLE_DEVICES"] = '0,1,2'
import gc
import json
import base64
import torch
import numpy as np
from marker.convert import convert_single_pdf
from marker.logger import configure_logging
from marker.models import load_all_models
import triton_python_backend_utils as pb_utils
gc.collect()
class TritonPythonModel:
def initialize(self, args):
device = "cuda" if args["model_instance_kind"] == "GPU" else "cpu"
device_id = args["model_instance_device_id"]
self.device = f"{device}:{device_id}"
# output config
self.model_config = json.loads(args['model_config'])
output_config = pb_utils.get_output_config_by_name(self.model_config, "output")
self.output_response_dtype = pb_utils.triton_string_to_numpy(output_config['data_type'])
# load model
configure_logging()
self.model_lst = load_all_models(device=self.device, dtype=torch.float16)
def execute(self, requests):
responses = []
for request in requests:
text = pb_utils.get_input_tensor_by_name(request, "text").as_numpy().astype("S")
text = np.char.decode(text, "utf-8").tolist()[0]
max_pages = pb_utils.get_input_tensor_by_name(request, "max_pages").as_numpy()[0]
fname = base64.b64decode(text)
full_text, images, out_meta = convert_single_pdf(fname, self.model_lst, max_pages=max_pages,
langs=['Chinese'], batch_multiplier=2)
response = np.char.encode(np.array(full_text))
response_output_tensor = pb_utils.Tensor("output", response.astype(self.output_response_dtype))
final_inference_response = pb_utils.InferenceResponse(output_tensors=[response_output_tensor])
responses.append(final_inference_response)
return responses
def finalize(self):
print('Cleaning up...')
核心的推理过程为
full_text, images, out_meta = convert_single_pdf(fname, self.model_lst, max_pages=max_pages,
langs=['Chinese'], batch_multiplier=2)
其中full_text就是Markdown的结果。
下一步启动Triton服务,注意使用-w设置容器内的执行目录到vikp所在的同一级目录
docker run --rm --gpus=all --shm-size=1g -p18999:8000 -p18998:8001 -p18997:8002 \
-e PYTHONIOENCODING=utf-8 -w /models/marker-pdf/1 \
-v /home/model_repository/:/models \
tritonserver:marker-pdf-env \
tritonserver --model-repository=/models --model-control-mode explicit --load-model marker-pdf --log-format ISO8601
启动成功后使用Python请求调用服务测试
import time
import base64
import requests
import json
data = base64.b64encode(open("/home/桌面/论文/1706.03762.pdf", "rb").read()).decode("utf-8")
url = "http://10.2.13.31:18999/v2/models/marker-pdf/infer"
raw_data = {
"inputs": [{"name": "text", "datatype": "BYTES", "shape": [1], "data": [data]},
{"name": "max_pages", "datatype": "INT64", "shape": [1], "data": [1]}],
"outputs": [{"name": "output", "shape": [1]}]
}
t1 = time.time()
res = requests.post(url, json.dumps(raw_data, ensure_ascii=True), headers={"Content_Type": "application/json"},
timeout=2000)
t2 = time.time()
print(t2 - t1)
print(res.json()["outputs"][0]["data"][0])
客户端先读取PDF文件转化为二进制的base64编码字符串,请求结果打印如下
# Attention Is All You Need
| Ashish Vaswani∗ Google Brain | |
|--------------------------------|-------------------------------------------------|
| avaswani@google.com | Noam Shazeer∗ Google Brain |
| noam@google.com | Niki Parmar∗ |
| Google Research | |
| nikip@google.com | Jakob Uszkoreit∗ Google Research usz@google.com |
Łukasz Kaiser∗
Google Brain lukaszkaiser@google.com Aidan N. Gomez∗ †
University of Toronto aidan@cs.toronto.edu Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
## Abstract
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
## 1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and
∗Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.
†Work performed while at Google Brain.
‡Work performed while at Google Research.
| Llion Jones∗ |
|------------------|
| Google Research |
| llion@google.com |
全文完毕