在知识库场景下往往需要对PDF文档进行解析,从而能够通过RAG完成知识检索,本文介绍开源的PDF转Markdown工具marker,并借助Triton Inference Server将其服务化。
- 知识库场景下pdf解析简述
- Marker简介和安装
- Marker快速开始
- 使用Triton服务化
以解析《Attention is all you need》这篇PDF论文为例,原始PDF如下
# Attention Is All You Need
| Ashish Vaswani∗ Google Brain | |
| avaswani@google.com | Noam Shazeer∗ Google Brain |
| noam@google.com | Niki Parmar∗ |
| Google Research | |
| nikip@google.com | Jakob Uszkoreit∗ Google Research usz@google.com |
Łukasz Kaiser∗
Google Brain lukaszkaiser@google.com Aidan N. Gomez∗ †
University of Toronto aidan@cs.toronto.edu Illia Polosukhin∗ ‡
## Abstract
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
- ORC文字提取
- 页面布局和阅读顺序识别
- 分模块的清洗和格式化
- 模型合并和后处理
pip install marker-pdf
$ which marker_single
[root@xxxx vikp]# ls -lt
总用量 0
drwxr-xr-x 2 root root 132 5月 14 16:37 surya_order
drwxr-xr-x 2 root root 10 5月 14 15:35 order_bench
drwxr-xr-x 2 root root 10 5月 14 15:32 publaynet_bench
drwxr-xr-x 2 root root 10 5月 14 15:32 rec_bench
drwxr-xr-x 2 root root 229 5月 14 15:31 surya_rec
drwxr-xr-x 2 root root 10 5月 14 15:28 doclaynet_bench
drwxr-xr-x 2 root root 98 5月 14 15:27 surya_det_math
drwxr-xr-x 2 root root 98 5月 14 15:26 surya_det2
drwxr-xr-x 2 root root 159 5月 14 15:20 pdf_postprocessor_t5
drwxr-xr-x 2 root root 98 5月 14 15:18 surya_layout2
drwxr-xr-x 2 root root 319 5月 14 15:17 texify
root@1fc83e178b80:/home/marker-pdf/1# marker_single 606addeff4c0070ce300ff0adc88eceb.pdf ./ --batch_multiplier 2 --max_pages 1 --langs Chinese
Loading detection model vikp/surya_det2 on device cuda with dtype torch.float16
Loading detection model vikp/surya_layout2 on device cuda with dtype torch.float16
Loading reading order model vikp/surya_order on device cuda with dtype torch.float16
Loaded texify model to cuda with torch.float16 dtype
Detecting bboxes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.22s/it]
Detecting bboxes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 9.16it/s]
Finding reading order: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.44s/it]
Saved markdown to the ./606addeff4c0070ce300ff0adc88eceb folder
## 创业板投资风险提示
本次股票发行后拟在创业板市场上市,该市场具有较高的投资风险。创业板公司 具有创新投入大、新旧产业融合成功与否存在不确定性、尚处于成长期、经营风险高、 业绩不稳定、退市风险高等特点,投资者面临较大的市场风险。投资者应充分了解创 业板市场的投资风险及本公司所披露的风险因素,审慎作出投资决定。
Hubei Biocause Heilen Pharmaceutical Co., Ltd.
(荆门高新区·掇刀区杨湾路 122 号)
首次公开发行股票并在创业板上市 招股说明书 保荐人(主承销商)
(中国(上海)自由贸易试验区商城路 618 号)
使用命令行的方式不能实现跨机器跨语言的场景,因此需要将marker服务化,marker本质上是torch模型组成的pipeline,因此很适合使用Triton Inference Server进行部署,最终暴露出HTTP API服务用于调用。
[root@zx-61 marker-pdf]# tree
├── 1
│ ├── model.py
│ ├── vikp
│ │ ├── doclaynet_bench
│ │ ├── order_bench
│ │ ├── pdf_postprocessor_t5
│ │ ├── publaynet_bench
│ │ ├── rec_bench
│ │ ├── surya_det2
│ │ ├── surya_det_math
│ │ ├── surya_layout2
│ │ ├── surya_order
│ │ ├── surya_rec
│ │ └── texify
└── config.pbtxt
[root@zx-61 marker-pdf]# cat config.pbtxt
name: "marker-pdf"
backend: "python"
max_batch_size: 0
input [
name: "text"
dims: [ -1 ]
data_type: TYPE_STRING
name: "max_pages"
dims: [ 1 ]
data_type: TYPE_INT64
output [
name: "output"
dims: [ -1 ]
data_type: TYPE_STRING
instance_group [
count: 1
kind: KIND_GPU
gpus: [ 0 ]
import os
# 设置显存空闲block最大分割阈值
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:32'
# 设置work目录
os.environ['TRANSFORMERS_CACHE'] = os.path.dirname(os.path.abspath(__file__)) + "/work/"
os.environ['HF_MODULES_CACHE'] = os.path.dirname(os.path.abspath(__file__)) + "/work/"
# os.environ["CUDA_VISIBLE_DEVICES"] = '0,1,2'
import gc
import json
import base64
import torch
import numpy as np
from marker.convert import convert_single_pdf
from marker.logger import configure_logging
from marker.models import load_all_models
import triton_python_backend_utils as pb_utils
class TritonPythonModel:
def initialize(self, args):
device = "cuda" if args["model_instance_kind"] == "GPU" else "cpu"
device_id = args["model_instance_device_id"]
self.device = f"{device}:{device_id}"
# output config
self.model_config = json.loads(args['model_config'])
output_config = pb_utils.get_output_config_by_name(self.model_config, "output")
self.output_response_dtype = pb_utils.triton_string_to_numpy(output_config['data_type'])
# load model
self.model_lst = load_all_models(device=self.device, dtype=torch.float16)
def execute(self, requests):
responses = []
for request in requests:
text = pb_utils.get_input_tensor_by_name(request, "text").as_numpy().astype("S")
text = np.char.decode(text, "utf-8").tolist()[0]
max_pages = pb_utils.get_input_tensor_by_name(request, "max_pages").as_numpy()[0]
fname = base64.b64decode(text)
full_text, images, out_meta = convert_single_pdf(fname, self.model_lst, max_pages=max_pages,
langs=['Chinese'], batch_multiplier=2)
response = np.char.encode(np.array(full_text))
response_output_tensor = pb_utils.Tensor("output", response.astype(self.output_response_dtype))
final_inference_response = pb_utils.InferenceResponse(output_tensors=[response_output_tensor])
return responses
def finalize(self):
print('Cleaning up...')
full_text, images, out_meta = convert_single_pdf(fname, self.model_lst, max_pages=max_pages,
langs=['Chinese'], batch_multiplier=2)
docker run --rm --gpus=all --shm-size=1g -p18999:8000 -p18998:8001 -p18997:8002 \
-e PYTHONIOENCODING=utf-8 -w /models/marker-pdf/1 \
-v /home/model_repository/:/models \
tritonserver:marker-pdf-env \
tritonserver --model-repository=/models --model-control-mode explicit --load-model marker-pdf --log-format ISO8601
import time
import base64
import requests
import json
data = base64.b64encode(open("/home/桌面/论文/1706.03762.pdf", "rb").read()).decode("utf-8")
url = ""
raw_data = {
"inputs": [{"name": "text", "datatype": "BYTES", "shape": [1], "data": [data]},
{"name": "max_pages", "datatype": "INT64", "shape": [1], "data": [1]}],
"outputs": [{"name": "output", "shape": [1]}]
t1 = time.time()
res = requests.post(url, json.dumps(raw_data, ensure_ascii=True), headers={"Content_Type": "application/json"},
t2 = time.time()
print(t2 - t1)
