Eino中的组件-Loader & Parser

Eino 的 Loader 组件是数据管道的起点，把各种来源（本地文件、URL、S3对象存储…）的原始内容，统一变成 []*schema.Document 列表，供下游 Transformer / Embedding / Indexer 继续处理。”

type Loader interface {
    Load(ctx context.Context, src Source, opts ...LoaderOption) ([]*schema.Document, error)
}

type Source struct {
    URI string // 本地路径 / http(s) URL / s3://bucket/key ...
}

输入：URI (// 本地路径 / http(s) URL / s3://bucket/key ...)
输出：标准 *schema.Document（含 Content + Metadata）
自带回调钩子，可监听加载进度与错误

官方已实现三种 Loader

Loader	包路径	特点	典型配置字段
File Loader	`eino-ext/components/document/loader/file`	本地文件 / 递归目录 / 黑白名单过滤	`Path`, `Recursive`, `IncludePatterns`, `MaxFileSize`
Web Loader	`eino-ext/components/document/loader/web`	抓取网页，自动提取正文，可自定义 HTTP Client	`Client`, `RequestBuilder`, `Parser`
S3 Loader	`eino-ext/components/document/loader/s3`	从兼容 S3 的对象存储批量拉取	`Bucket`, `Prefix`, `Region`, `AccessKey` 等

Loader 用于load 本地文件，并转换成DocumentTransformer能够直接处理的Document类型， Loader的实现主要基于Parser 实现，用于从reader中读取文本数据，并转换为Document
Parse parses the given reader and returns a list of documents.

type FileLoaderConfig struct {
    UseNameAsID bool
    Parser      parser.Parser
}

// FileLoader loads a local file and use its content directly as Document's content.
type FileLoader struct {
    FileLoaderConfig
}

// Parser is a document parser, can be used to parse a document from a reader.
type Parser interface {
    Parse(ctx context.Context, reader io.Reader, opts ...Option) ([]*schema.Document, error)
}

//  fmt.Println(docs[0].Content) // "hello world"
type TextParser struct{}

// Parse reads the text from a reader and returns a single document.
func (dp TextParser) Parse(ctx context.Context, reader io.Reader, opts ...Option) ([]*schema.Document, error) {
    data, err := io.ReadAll(reader)
    if err != nil {
        return nil, err
    }

    opt := GetCommonOptions(&Options{}, opts...)

    meta := make(map[string]any)
    meta[MetaKeySource] = opt.URI

    for k, v := range opt.ExtraMeta {
        meta[k] = v
    }

    doc := &schema.Document{
        Content:  string(data),
        MetaData: meta,
    }

    return []*schema.Document{doc}, nil
}

type ExtParser struct {
    parsers map[string]Parser
    fallbackParser Parser
}

// ExtParser 是一种根据文件扩展名来决定使用哪个具体的解析器。
// 你可以通过调用 RegisterParser 方法注册自己的解析器。
// Fallback parser to use when no other parser is found.
// Default is TextParser if not set.

// 在调用parse 时，是通过 filepath.Ext(uri) 的方式找到对应的 parser，因此使用时需要：
// ① 必须使用 parser.WithURI 在请求时传入 URI
// ② URI 必须能通过 filepath.Ext 来解析出符合预期的 ext

Eino 官方开源的代码仅有TextParser 和 MarkdownHeaderSplitter（按标题切片）解释器的实现，其他如PDF，DOC， HTML等在eino-ext项目实现（pdf解析可以基于unipdf 、unidoc 实现，这些库需要付费使用），可以基于其他免费的开源库自己实现，并无缝接入Eino

库	协议	功能覆盖	现状
rsc.io/pdf	BSD-3	只读文本、元数据、页面尺寸；0 依赖	维护一般，功能极简
pdfcpu	Apache-2.0	读取、写入、合并、拆分、表单、水印、压缩	活跃维护，文本提取 `pdfcpu extract -mode text` 可直接用
go-fitz (MuPDF CGo)	MIT	文本/图片抽取、渲染、OCR；功能最全	需 CGo，交叉编译稍麻烦
gopdf	MIT	仅写 PDF（生成报告）	不适合解析

完整示例代码(基于默认的TextParser)：

package main

import (
    "context"
    "github.com/cloudwego/eino/components/document/parser"
    "log"
    "os"
)

func main() {

    ctx := context.Background()
    conf := &parser.ExtParserConfig{}

    p, err := parser.NewExtParser(ctx, conf)
    if err != nil {
        panic(err)
    }

    f, err := os.Open("./app/eino/component_loader/testdata/test.md")
    if err != nil {
        panic(err)
    }
    defer f.Close()

    docs, err := p.Parse(ctx, f, parser.WithURI("./app/eino/component_loader/testdata/test.md"))
    if err != nil {
        panic(err)
    }
    for idx, doc := range docs {
        log.Printf("doc segment %v: %v", idx, doc.Content)
    }
}

Eino中的组件-Loader & Parser

Eino中的组件-Loader & Parser

官方已实现三种 Loader

相关阅读更多精彩内容

友情链接更多精彩内容