AI模型部署:Triton+Marker部署PDF转markdown服务

关键词:TritonMarkerPython

前言

在知识库场景下往往需要对PDF文档进行解析,从而能够通过RAG完成知识检索,本文介绍开源的PDF转Markdown工具marker,并借助Triton Inference Server将其服务化。


内容摘要

  • 知识库场景下pdf解析简述
  • Marker简介和安装
  • Marker快速开始
  • 使用Triton服务化

知识库场景下pdf解析简述

PDF文档通常包含多样化的格式、图片、表格等元素,由于RAG对数据的标准化和准确性有很高的依赖性,直接将PDF转化为text容易丢失和混淆文件中内容的组织形式,一种更优的方式是将PDF转化为Markdown,它能够更好的留结内容的构化信息。
以解析《Attention is all you need》这篇PDF论文为例,原始PDF如下

《Attention is all you need》的PDF

转化为text的结果如下

Attention Is All You Need
AshishVaswani∗ NoamShazeer∗ NikiParmar∗ JakobUszkoreit∗
GoogleBrain GoogleBrain GoogleResearch GoogleResearch
avaswani@google.com noam@google.com nikip@google.com usz@google.com
7102
LlionJones∗ AidanN.Gomez∗ † ŁukaszKaiser∗
GoogleResearch UniversityofToronto GoogleBrain
llion@google.com aidan@cs.toronto.edu lukaszkaiser@google.com
ceD
IlliaPolosukhin∗ ‡
illia.polosukhin@gmail.com 6
]LC.sc[
Abstract
Thedominantsequencetransductionmodelsarebasedoncomplexrecurrentor
convolutionalneuralnetworksthatincludeanencoderandadecoder. Thebest
performing models also connect the encoder and decoder through an attention
5v26730.6071:viXra

而转化为Markdown的结果如下

# Attention Is All You Need
| Ashish Vaswani∗ Google Brain   |                                                 |
|--------------------------------|-------------------------------------------------|
| avaswani@google.com            | Noam Shazeer∗ Google Brain                      |
| noam@google.com                | Niki Parmar∗                                    |
| Google Research                |                                                 |
| nikip@google.com               | Jakob Uszkoreit∗ Google Research usz@google.com |
Łukasz Kaiser∗
Google Brain lukaszkaiser@google.com Aidan N. Gomez∗ †
University of Toronto aidan@cs.toronto.edu Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
## Abstract
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

text无法恢复换行的连续结构,上一行和下一行断开,而Markdown会将其解析为完整的一段;如果PDF的结构稍微复杂一点,text就会将不同位置上完成不相关的文字解析合并在一起,比如例子中的“7102”是论文左侧的发表时间,实际为2017年,最后Markdown相比于text能识别出层次结构,比如表格、标题等,整体而言Markdown解析的质量更高。


marker简介和安装

marker是github上一个一个基于Python语言实现的开源的项目,它基于多个OCR模型的组合流水线来完成PDF转Markdown的任务,模型包括

  • ORC文字提取
  • 页面布局和阅读顺序识别
  • 分模块的清洗和格式化
  • 模型合并和后处理

使用pip可以安装marker

pip install marker-pdf

安装完之后在环境变量路径下会安装对应的转化工具marker_single

$ which marker_single 
/home/anaconda3/envs/my_env/bin/marker_single

额外的marker实际上是调用了众多的模型对PDF进行识别和推理,因此需要下载一些模型文件,marker默认在使用的时候下载,我们先在HuggingFace上先离线下载好所有需要的模型,放在vikp目录下,所需的模型文件如下

[root@xxxx vikp]# ls -lt
总用量 0
drwxr-xr-x 2 root root 132 5月  14 16:37 surya_order
drwxr-xr-x 2 root root  10 5月  14 15:35 order_bench
drwxr-xr-x 2 root root  10 5月  14 15:32 publaynet_bench
drwxr-xr-x 2 root root  10 5月  14 15:32 rec_bench
drwxr-xr-x 2 root root 229 5月  14 15:31 surya_rec
drwxr-xr-x 2 root root  10 5月  14 15:28 doclaynet_bench
drwxr-xr-x 2 root root  98 5月  14 15:27 surya_det_math
drwxr-xr-x 2 root root  98 5月  14 15:26 surya_det2
drwxr-xr-x 2 root root 159 5月  14 15:20 pdf_postprocessor_t5
drwxr-xr-x 2 root root  98 5月  14 15:18 surya_layout2
drwxr-xr-x 2 root root 319 5月  14 15:17 texify

以surya_order为例,在HuggingFace都能够找到对应的模型

marker模型下载准备

marker快速开始

使用环境变量下的marker_single命令即可运行marker,输入为单篇PDF文档的位置,输出为一个结果目录,先切换到上一层目录,确保vikp文件夹在当前执行目录的同一级

root@1fc83e178b80:/home/marker-pdf/1# marker_single 606addeff4c0070ce300ff0adc88eceb.pdf ./ --batch_multiplier 2 --max_pages 1 --langs Chinese
Loading detection model vikp/surya_det2 on device cuda with dtype torch.float16
Loading detection model vikp/surya_layout2 on device cuda with dtype torch.float16
Loading reading order model vikp/surya_order on device cuda with dtype torch.float16
Loaded texify model to cuda with torch.float16 dtype
Detecting bboxes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.22s/it]
Detecting bboxes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.16it/s]
Finding reading order: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.44s/it]
Saved markdown to the ./606addeff4c0070ce300ff0adc88eceb folder

日志表明marker分别加载了三个模型到GPU,然后将推理的结果写到了606addeff4c0070ce300ff0adc88eceb目录,目录下有识别出的图片,Markdown文件,以及配置文件

识别目录下内容

打开Markdown,识别的内容如下

## 创业板投资风险提示

本次股票发行后拟在创业板市场上市,该市场具有较高的投资风险。创业板公司 具有创新投入大、新旧产业融合成功与否存在不确定性、尚处于成长期、经营风险高、 业绩不稳定、退市风险高等特点,投资者面临较大的市场风险。投资者应充分了解创 业板市场的投资风险及本公司所披露的风险因素,审慎作出投资决定。

湖北亨迪药业股份有限公司

![0_image_0.png](0_image_0.png)

![0_image_1.png](0_image_1.png)

Hubei Biocause Heilen Pharmaceutical Co., Ltd.
(荆门高新区·掇刀区杨湾路 122 号)

![0_image_2.png](0_image_2.png)

![0_image_3.png](0_image_3.png)

首次公开发行股票并在创业板上市 招股说明书 保荐人(主承销商)
(中国(上海)自由贸易试验区商城路 618 号)

使用Triton服务化

使用命令行的方式不能实现跨机器跨语言的场景,因此需要将marker服务化,marker本质上是torch模型组成的pipeline,因此很适合使用Triton Inference Server进行部署,最终暴露出HTTP API服务用于调用。
首先用Docker拉取Triton基础镜像nvcr.io/nvidia/tritonserver:23.10-py3,并在其中pip安装marker-pdf包,安装完成后重新commit为新容器,例如命名为tritonserver:marker-pdf-env。
然后设置模型的目录结构,Triton的模型统一存放在model_repository目录下,在model_repository下创建marker-pdf目录,其结构如下

[root@zx-61 marker-pdf]# tree
.
├── 1
│   ├── model.py
│   ├── vikp
│   │   ├── doclaynet_bench
│   │   ├── order_bench
│   │   ├── pdf_postprocessor_t5
│   │   ├── publaynet_bench
│   │   ├── rec_bench
│   │   ├── surya_det2
│   │   ├── surya_det_math
│   │   ├── surya_layout2
│   │   ├── surya_order
│   │   ├── surya_rec
│   │   └── texify
└── config.pbtxt

1目录代表模型版本,其下有后端逻辑代码model.py以及所需要的ORC模型目录vikp,config.pbtxt为模型服务的配置文件,里面定义了输入和输出,设备资源配置等,其内容如下

[root@zx-61 marker-pdf]# cat config.pbtxt
name: "marker-pdf"
backend: "python"

max_batch_size: 0
input [
    {
        name: "text"
        dims: [ -1 ]
        data_type: TYPE_STRING
    },
    {
        name: "max_pages"
        dims: [ 1 ]
        data_type: TYPE_INT64
    }
]
output [
    {
        name: "output"
        dims: [ -1 ]
        data_type: TYPE_STRING
    }
]

instance_group [
{
  count: 1
  kind: KIND_GPU
  gpus: [ 0 ]
}
]

输入参数为text和max_pages,分别代表PDF的二进制文件经过base64编码之后的字符串内容,以及marker转化的最大页数,比如max_pages设置为5则转化PDF的前5页。输出字段为output,直接输出Markdown的字符串内容,不需要其他图片等信息。
本质上是对marker_single命令的服务话,而marker_single是调用的marker.convert下的convert_single_pdf,对其稍作修改,将输入改为PDF的二进制文件经过base64编码之后的字符串,将输入只取Markdown的内容即可,model.py内容如下

import os

# 设置显存空闲block最大分割阈值
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:32'
# 设置work目录

os.environ['TRANSFORMERS_CACHE'] = os.path.dirname(os.path.abspath(__file__)) + "/work/"
os.environ['HF_MODULES_CACHE'] = os.path.dirname(os.path.abspath(__file__)) + "/work/"
# os.environ["CUDA_VISIBLE_DEVICES"] = '0,1,2'

import gc
import json
import base64

import torch
import numpy as np
from marker.convert import convert_single_pdf
from marker.logger import configure_logging
from marker.models import load_all_models

import triton_python_backend_utils as pb_utils

gc.collect()


class TritonPythonModel:
    def initialize(self, args):
        device = "cuda" if args["model_instance_kind"] == "GPU" else "cpu"
        device_id = args["model_instance_device_id"]
        self.device = f"{device}:{device_id}"
        # output config
        self.model_config = json.loads(args['model_config'])
        output_config = pb_utils.get_output_config_by_name(self.model_config, "output")
        self.output_response_dtype = pb_utils.triton_string_to_numpy(output_config['data_type'])
        # load model
        configure_logging()
        self.model_lst = load_all_models(device=self.device, dtype=torch.float16)

    def execute(self, requests):
        responses = []
        for request in requests:
            text = pb_utils.get_input_tensor_by_name(request, "text").as_numpy().astype("S")
            text = np.char.decode(text, "utf-8").tolist()[0]
            max_pages = pb_utils.get_input_tensor_by_name(request, "max_pages").as_numpy()[0]
            fname = base64.b64decode(text)
            full_text, images, out_meta = convert_single_pdf(fname, self.model_lst, max_pages=max_pages,
                                                             langs=['Chinese'], batch_multiplier=2)
            response = np.char.encode(np.array(full_text))
            response_output_tensor = pb_utils.Tensor("output", response.astype(self.output_response_dtype))

            final_inference_response = pb_utils.InferenceResponse(output_tensors=[response_output_tensor])
            responses.append(final_inference_response)

        return responses

    def finalize(self):
        print('Cleaning up...')

核心的推理过程为

full_text, images, out_meta = convert_single_pdf(fname, self.model_lst, max_pages=max_pages,
                                                             langs=['Chinese'], batch_multiplier=2)

其中full_text就是Markdown的结果。
下一步启动Triton服务,注意使用-w设置容器内的执行目录到vikp所在的同一级目录

docker run --rm --gpus=all --shm-size=1g -p18999:8000 -p18998:8001 -p18997:8002 \
-e PYTHONIOENCODING=utf-8 -w /models/marker-pdf/1 \
-v /home/model_repository/:/models \
tritonserver:marker-pdf-env \
tritonserver --model-repository=/models --model-control-mode explicit --load-model marker-pdf --log-format ISO8601

启动成功后使用Python请求调用服务测试

import time
import base64
import requests
import json

data = base64.b64encode(open("/home/桌面/论文/1706.03762.pdf", "rb").read()).decode("utf-8")
url = "http://10.2.13.31:18999/v2/models/marker-pdf/infer"
raw_data = {
    "inputs": [{"name": "text", "datatype": "BYTES", "shape": [1], "data": [data]},
               {"name": "max_pages", "datatype": "INT64", "shape": [1], "data": [1]}],
    "outputs": [{"name": "output", "shape": [1]}]
}
t1 = time.time()
res = requests.post(url, json.dumps(raw_data, ensure_ascii=True), headers={"Content_Type": "application/json"},
                    timeout=2000)
t2 = time.time()
print(t2 - t1)

print(res.json()["outputs"][0]["data"][0])

客户端先读取PDF文件转化为二进制的base64编码字符串,请求结果打印如下

# Attention Is All You Need
| Ashish Vaswani∗ Google Brain   |                                                 |
|--------------------------------|-------------------------------------------------|
| avaswani@google.com            | Noam Shazeer∗ Google Brain                      |
| noam@google.com                | Niki Parmar∗                                    |
| Google Research                |                                                 |
| nikip@google.com               | Jakob Uszkoreit∗ Google Research usz@google.com |
Łukasz Kaiser∗
Google Brain lukaszkaiser@google.com Aidan N. Gomez∗ †
University of Toronto aidan@cs.toronto.edu Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
## Abstract
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
## 1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and
∗Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.
†Work performed while at Google Brain.
‡Work performed while at Google Research.
| Llion Jones∗     |
|------------------|
| Google Research  |
| llion@google.com |

全文完毕


©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 212,185评论 6 493
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 90,445评论 3 385
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 157,684评论 0 348
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 56,564评论 1 284
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 65,681评论 6 386
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 49,874评论 1 290
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 39,025评论 3 408
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 37,761评论 0 268
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 44,217评论 1 303
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 36,545评论 2 327
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 38,694评论 1 341
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 34,351评论 4 332
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 39,988评论 3 315
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 30,778评论 0 21
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 32,007评论 1 266
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 46,427评论 2 360
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 43,580评论 2 349

推荐阅读更多精彩内容