qwen14b 转int8量化

转化成int8的模型

AutoGPTQ的方式量化:
https://github.com/QwenLM/Qwen/issues/464
https://github.com/AutoGPTQ/AutoGPTQ/issues/133
搞半天,最后能work的代码:

from transformers import AutoTokenizer, TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig


model_path="Qwen-14B-datayes"
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True,trust_remote_code=True)
quantize_config = BaseQuantizeConfig(
        bits=8,  # quantize model to 4-bit
        group_size=128,  # it is recommended to set the value to 128
        desc_act=False,  # set to False can significantly speed up inference but the perplexity may slightly bad
    )

model = AutoGPTQForCausalLM.from_pretrained(model_path, quantize_config,trust_remote_code=True)
examples = [
            tokenizer(
                        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
                            )
            ]

quantized_model_dir="Qwen-14B-datayes-int8"
model.quantize(examples)


# save quantized model
model.save_quantized(quantized_model_dir, use_safetensors=True)
tokenizer.save_pretrained(quantized_model_dir)
~                                                                         

测试完之后,还要能推理:
需要把Qwen下面的py 全部cp过来,然后还需要根据提示的问题各种测试:
目前只能用AutoGPTQForCausalLM.from_quantized来加载。
另外注意model_basename 为量化保存的模型名字,不要带后缀,不然找不到,坑得要死。

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM  # 注意:这里假设存在类似的接口  
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

torch.cuda.set_device(1)

quantized_model_dir="Qwen-14B-datayes-int8"
model_basename="gptq_model-8bit-128g"

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir,use_fast=True, trust_remote_code=True)
model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,model_basename=model_basename,
        device_map="auto",use_safetensors=True,trust_remote_code=True)

# 准备输入  
input_text = "分析一下宁德时代的投资价值"
input_ids = tokenizer(input_text, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=256, min_new_tokens=100)
print(tokenizer.decode(output[0]))

©著作权归作者所有,转载或内容合作请联系作者
平台声明:文章内容(如有图片或视频亦包括在内)由作者上传并发布,文章内容仅代表作者本人观点,简书系信息发布平台,仅提供信息存储服务。

推荐阅读更多精彩内容