论文和仓库
核心贡献
主要贡献:在alpaca数据集上实验得到一个抽样子集进行训练,在这样节省训练资源和时间的情况下,得到更好的实验结果
优点:采用小模型(IQS)评分,小模型训练非常快,且有针对性;
缺点:IQS训练需要标注数据,如果利用更大的模型评分,比如gpt4,则不需要标注;但有个问题是,如果你本身的sft数据就是足够优秀的LLM生成的(比如 gpt4),那么再使用该模型来进行评分,可能不那么准确
局限性: 该方案中的cluster 和 rank分别引入了两个小模型embedding 和 IQS,小模型的上限就是该方案的上限
主要步骤
- 根据一个评分模型对sft数据集进行评分(score),并根据评分结果进行排序(rank)
- 对sft数据进行聚类成N个簇,对簇中选取num个条数据
代码框架解读
一、评分并排序
使用COMET项目在标注的数据集上微调([原始标注数据来源]),得到一个IQS模型,下载地址
使用IQS模型进行推理,得到评分score
-
训练IQS的原数据集下载地址:Expert Revision Dataset;标注数据集格式如下,其中distance是计算的修改前后数据(Instruction+Input+Response)的编辑距离,用于根据阈值删掉一些距离过小的,大致阈值是200, 最终保留1211条数据
```json { "Raw Instruction": "Select two of the given topics and write a paragraph discussing the similarities between them.", "Raw Input": "The Internet of Things, Artificial Intelligence, and Blockchain Technology", "Raw Response": "The Internet of Things (IoT) and Artificial Intelligence (AI) both involve the use of machines to collect and process data. Both technologies rely heavily on automation and machine learning in order to operate efficiently and accurately. Similarly, both technologies have the potential to improve existing processes, automate tedious tasks, and create new opportunities. Furthermore, both technologies require a large amount of data to be collected and analyzed to become truly intelligent and efficient.\\n\\nSimilarly, blockchain technology and AI both involve the use of computers to store and process data. This data can be used to create new applications and services, or to improve existing ones. Additionally, both technologies are designed to be secure, decentralized, and resilient, making them suitable for applications across a variety of fields.", "Revised Instruction": "Select two of the given topics and write a paragraph discussing the similarities between them.", "Revised Input": "The Internet of Things, Artificial Intelligence, and Blockchain Technology", "Revised Response": "Here is a paragraph discussing the similarities between Artificial Intelligence and Blockchain Technology:\\n\\nArtificial Intelligence (AI) and Blockchain Technology are two emerging fields that have the potential to transform various industries and sectors. Both of them rely on data and algorithms to perform complex tasks and solve problems. AI uses data to train machines to learn from experience and perform human-like functions, such as speech recognition, image analysis, and natural language processing. Blockchain Technology uses data to create a distributed ledger that records transactions and ensures their security, transparency, and immutability. Both of them can enhance efficiency, accuracy, and innovation in various domains, such as healthcare, finance, education, and governance.", "Distance": 640 } ```
-
用于comet微调的数据集格式如下,数据路径Ranking/data/expert-revised-comet:
{ "src": "Instruction: Given natural language sentences, identify the relations between entities. Input: Sharad and Sayansh are brothers. ", "pos": "Response: The relation between Sharad and Sayansh is brothers.", "neg": "Response: Sharad and Sayansh: Relation - Brothers" }
-
用于ISQ微调的数据集格式如下(0,1标注,修改前是0, 修改后是1),数据路径Ranking/data/expert-revised:
{"src": "Instruction: List four countries in Africa Response: Egypt, South Africa, Nigeria, Morocco.", "score": 0.0} {"src": "Instruction: List four countries in Africa Response: Here are four countries in Africa: Egypt, South Africa, Nigeria, Morocco.", "score": 1.0}
-
训练ISQ模型,终端执行脚本,进入到Ranking目录下:
python -m comet.cli.train --cfg configs/models/instruction_score.yaml --early_stopping configs/early_stopping.yaml --model_checkpoint configs/model_checkpoint.yaml --trainer configs/trainer.yaml --instruction_metric configs/models/instruction_score.yaml
注意:有时可能会报一些这样的error “No action for key "amp_backend" to check its value”,这是因为pytorch_lightning不同版本中Trainer的参数发生变化导致
二、聚类
- 使用一个embedding模型对数据进行向量化
- 使用PCA对句子向量进行降维
- 对降维后的数据进行聚类,这里alpaca聚类了161类( sqrt(52k/2))
三、数据选择
- 先根据rank之后的score选择前1000条数据
- 根据聚类之后的结果,每个簇选score最高的n条数据