llama.cpp实现大模型格式转换、量化、推理

1.llama.cpp介绍

llama.cpp是一个开源项目，专门为在本地CPU上部署量化模型而设计。它提供了一种简单而高效的方法，将训练好的量化模型转换为可在CPU上运行的低配推理版本。

1.1 工作原理

llama.cpp的核心是一个优化的量化推理引擎。这个引擎能够高效地在CPU上执行量化模型的推理任务。它通过一系列的优化技术，如使用定点数代替浮点数进行计算、批量处理和缓存优化等，来提高推理速度并降低功耗。

1.2 优点

高效性能：llama.cpp针对CPU进行了优化，能够在保证精度的同时提供高效的推理性能。
低资源占用：由于采用了量化技术，llama.cpp可以显著减少模型所需的存储空间和计算资源。
易于集成：llama.cpp提供了简洁的API和接口，方便开发者将其集成到自己的项目中。
跨平台支持：llama.cpp可在多种操作系统和CPU架构上运行，具有很好的可移植性。

1.3 应用场景

llama.cpp适用于各种需要部署量化模型的应用场景，如智能家居、物联网设备、边缘计算等。在这些场景中，llama.cpp可以帮助开发者在资源受限的环境中实现实时推断和高能效计算。

2.下载编译

2.1 下载

git clone https://github.com/ggerganov/llama.cpp

2.2 编译

cd llama.cpp-master
make

make前目录内容如下：

image.png

make后目录内容如下：

image.png

make前后多了一些llama-xx命令，来执行大模型相关的操作；

3.LLM操作

本文是使用面壁MiniCPM-2B-sft-bf16来进行试验，llama.cpp有支持的可操作模型列表，支持转换的模型格式有PyTorch的 .bin 、huggingface 的 .safetensors，根据支持列表进行下载操作即可。

3.1 格式转换

格式转换主要是将下载的模型进行gguf格式转换，使用convert-hf-to-gguf.py转换脚本读取模型配置、分词器、张量名称+数据，并将它们转换为GGUF元数据和张量，以便在CPU上进行快速推理，而不需要GPU

GGUF格式是GPT-Generated Unified Format，由Georgi Gerganov定义发布的一种大模型文件格式。
它设计用于快速加载和保存模型，支持各种模型，并允许添加新功能同时保持兼容性。
GGUF文件格式专为存储推断模型而设计，特别适用于语言模型如GPT

转换命令：

python3 convert_hf_to_gguf.py ./models/MiniCPM-2B-sft-bf16/

INFO:hf-to-gguf:Loading model: MiniCPM-2B-sft-bf16
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'pytorch_model.bin'
INFO:hf-to-gguf:token_embd.weight,           torch.bfloat16 --> F16, shape = {2304, 122753}
INFO:hf-to-gguf:output_norm.weight,          torch.bfloat16 --> F32, shape = {2304}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.bfloat16 --> F32, shape = {2304}
........
INFO:hf-to-gguf:Set meta model
INFO:hf-to-gguf:Set model parameters
INFO:hf-to-gguf:Set model tokenizer
INFO:gguf.vocab:Setting special token type bos to 1
INFO:gguf.vocab:Setting special token type eos to 2
INFO:gguf.vocab:Setting special token type unk to 0
INFO:gguf.vocab:Setting add_bos_token to True
INFO:gguf.vocab:Setting add_eos_token to False
INFO:gguf.vocab:Setting chat_template to {% for message in messages %}{% if message['role'] == 'user' %}{{'<用户>' + message['content'].strip() + '<AI>'}}{% else %}{{message['content'].strip()}}{% endif %}{% endfor %}
INFO:hf-to-gguf:Set model quantization version
INFO:gguf.gguf_writer:Writing the following files:
INFO:gguf.gguf_writer:models/MiniCPM-2B-sft-bf16/CPM-2B-sft-F16.gguf: n_tensors = 362, total_size = 5.5G
Writing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.45G/5.45G [00:11<00:00, 456Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to models/MiniCPM-2B-sft-bf16/CPM-2B-sft-F16.gguf

可以看到，在执行转换后，会在model目录下生成对应的F16 gguf文件，大小约为5.45G

3.2 量化

量化主要是为了减少模型推理对硬件资源的要求，提高推理效率，但是模型的精度也会降低，通过牺牲模型参数的精度，来换取模型的推理速度

使用 llama-quantize量化模型
量化模型的命名方法遵循: Q + 量化比特位 + 变种。量化位数越少，对硬件资源的要求越低，推理速度越快，但是模型的精度也越低。

量化指令：

./llama-quantize ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-F16.gguf ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf Q4_K_M

main: build = 0 (unknown)
main: built with cc (Ubuntu 11.2.0-19ubuntu1) 11.2.0 for x86_64-linux-gnu
main: quantizing './models/MiniCPM-2B-sft-bf16/CPM-2B-sft-F16.gguf' to './models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 30 key-value pairs and 362 tensors from ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-F16.gguf (version GGUF V3 (latest))
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = CPM 2B
llama_model_loader: - kv   3:                       general.organization str              = Openbmb
........
llama_tensor_get_type : tensor cols 5760 x 2304 are not divisible by 256, required for q6_K - using fallback quantization q8_0
converting to q8_0 .. size =    25.31 MiB ->    13.45 MiB
[ 354/ 362]              blk.39.attn_norm.weight - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[ 355/ 362]                 blk.39.attn_q.weight - [ 2304,  2304,     1,     1], type =    f16, converting to q4_K .. size =    10.12 MiB ->     2.85 MiB
[ 356/ 362]                 blk.39.attn_k.weight - [ 2304,  2304,     1,     1], type =    f16, converting to q4_K .. size =    10.12 MiB ->     2.85 MiB
[ 357/ 362]                 blk.39.attn_v.weight - [ 2304,  2304,     1,     1], type =    f16, converting to q6_K .. size =    10.12 MiB ->     4.15 MiB
[ 358/ 362]            blk.39.attn_output.weight - [ 2304,  2304,     1,     1], type =    f16, converting to q4_K .. size =    10.12 MiB ->     2.85 MiB
[ 359/ 362]               blk.39.ffn_norm.weight - [ 2304,     1,     1,     1], type =    f32, size =    0.009 MB
[ 360/ 362]               blk.39.ffn_gate.weight - [ 2304,  5760,     1,     1], type =    f16, converting to q4_K .. size =    25.31 MiB ->     7.12 MiB
[ 361/ 362]                 blk.39.ffn_up.weight - [ 2304,  5760,     1,     1], type =    f16, converting to q4_K .. size =    25.31 MiB ->     7.12 MiB
[ 362/ 362]               blk.39.ffn_down.weight - [ 5760,  2304,     1,     1], type =    f16, 

llama_tensor_get_type : tensor cols 5760 x 2304 are not divisible by 256, required for q6_K - using fallback quantization q8_0
converting to q8_0 .. size =    25.31 MiB ->    13.45 MiB
llama_model_quantize_internal: model size  =  5197.65 MB
llama_model_quantize_internal: quant size  =  1716.20 MB
llama_model_quantize_internal: WARNING: 40 of 281 tensor(s) required fallback quantization

main: quantize time = 29242.62 ms
main:    total time = 29242.62 ms

量化后的模型gguf文件为：CPM-2B-sft-Q4_K_M.gguf，大小为：1.8G

3.3 推理

推理命令：

./llama-cli -m ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf -n 128 --prompt "<用户>你知道openmbmb么<AI>"

推理过程及输出如下：

Log start
main: build = 0 (unknown)
main: built with cc (Ubuntu 11.2.0-19ubuntu1) 11.2.0 for x86_64-linux-gnu
main: seed = 1725847164
llama_model_loader: loaded meta data with 30 key-value pairs and 362 tensors from ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = minicpm
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = CPM 2B
llama_model_loader: - kv   3:                       general.organization str              = Openbmb
......
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
sampling params: 
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
 sampler constr: 
    logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 128, n_keep = 1


 <用户>你知道openmbmb么<AI> OpenMBMB是一个开源的、面向对象的多语言模型框架，可以轻松地实现自然语言处理任务。 [end of text]

llama_perf_print:    sampling time =       2.35 ms /    36 runs   (    0.07 ms per token, 15286.62 tokens per second)
llama_perf_print:        load time =     513.93 ms
llama_perf_print: prompt eval time =     150.72 ms /    12 tokens (   12.56 ms per token,    79.62 tokens per second)
llama_perf_print:        eval time =    1178.25 ms /    23 runs   (   51.23 ms per token,    19.52 tokens per second)
llama_perf_print:       total time =    1334.43 ms /    35 tokens
Log end

通过以下命令可以看到支持的参数：

./llama-cli -h

-s,    --seed SEED                      RNG seed (default: -1, use random seed for < 0)
-t,    --threads N                      number of threads to use during generation (default: -1)
                                        (env: LLAMA_ARG_THREADS)
-tb,   --threads-batch N                number of threads to use during batch and prompt processing (default:
                                        same as --threads)
-C,    --cpu-mask M                     CPU affinity mask: arbitrarily long hex. Complements cpu-range
                                        (default: "")
........

Conversation Mode:

./llama-cli -m ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf -cnv

.....
.....
== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.


> 你好
 你好！有什么我可以帮助您的吗？

> 你是谁
 作为一个AI语言模型，我没有个人身份或情感。我被设计为帮助回答问题和提供信息。我通过接受来自各种来源的数据来工作，这些数据来自互联网、书籍、论文、数据库和其他资源。我的目标是根据输入提供有用和相关的答案。如果您有任何问题，请随时问我！

> /bye
 再见！

3.4 API服务

llama.cpp提供了与OpenAI API兼容的API接口，使用make生成的llama-server来启动API服务

./llama-server -m ./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf --host 0.0.0.0 --port 1234

INFO [                    init] initializing slots | tid="140241878706112" timestamp=1725859292 n_slots=1
INFO [                    init] new slot | tid="140241878706112" timestamp=1725859292 id_slot=0 n_ctx_slot=4096
INFO [                    main] model loaded | tid="140241878706112" timestamp=1725859292
INFO [                    main] chat template | tid="140241878706112" timestamp=1725859292 chat_example="You are a helpful assistant<用户>Hello<AI>Hi there<用户>How are you?<AI>" built_in=true
INFO [            update_slots] all slots are idle | tid="140241878706112" timestamp=1725859292
INFO [   launch_slot_with_task] slot is processing task | tid="140241878706112" timestamp=1725859313 id_slot=0 id_task=0
INFO [            update_slots] kv cache rm [p0, end) | tid="140241878706112" timestamp=1725859313 id_slot=0 id_task=0 p0=0
INFO [                 release] slot released | tid="140241878706112" timestamp=1725859351 id_slot=0 id_task=0 n_past=687 truncated=false
INFO [           print_timings] prompt eval time     =      59.60 ms /     2 tokens (   29.80 ms per token,    33.56 tokens per second) | tid="140241878706112" timestamp=1725859351 id_slot=0 id_task=0 t_prompt_processing=59.6 n_prompt_tokens_processed=2 t_token=29.8 n_tokens_second=33.557046979865774
INFO [           print_timings] generation eval time =   37964.14 ms /   686 runs   (   55.34 ms per token,    18.07 tokens per second) | tid="140241878706112" timestamp=1725859351 id_slot=0 id_task=0 t_token_generation=37964.139 n_decoded=686 t_token=55.34131049562683 n_tokens_second=18.069684130068115
INFO [           print_timings]           total time =   38023.74 ms | tid="140241878706112" timestamp=1725859351 id_slot=0 id_task=0 t_prompt_processing=59.6 t_token_generation=37964.139 t_total=38023.739
INFO [            update_slots] all slots are idle | tid="140241878706112" timestamp=1725859351
INFO [      log_server_request] request | tid="140241853523520" timestamp=1725859385 remote_addr="127.0.0.1" remote_port=49130 status=200 method="POST" path="/completion" params={}

本地可以是curl命令来进行请求：

curl --request POST     --url http://localhost:1234/completion
     --header "Content-Type: application/json"
     --data '{"prompt": "介绍一下MiniCpm"}'

server端打印如下：

INFO [   launch_slot_with_task] slot is processing task | tid="140241878706112" timestamp=1725859435 id_slot=0 id_task=1016
INFO [            update_slots] kv cache rm [p0, end) | tid="140241878706112" timestamp=1725859435 id_slot=0 id_task=1016 p0=0
INFO [                 release] slot released | tid="140241878706112" timestamp=1725859466 id_slot=0 id_task=1016 n_past=581 truncated=false
INFO [           print_timings] prompt eval time     =      92.93 ms /     6 tokens (   15.49 ms per token,    64.56 tokens per second) | tid="140241878706112" timestamp=1725859466 id_slot=0 id_task=1016 t_prompt_processing=92.932 n_prompt_tokens_processed=6 t_token=15.488666666666667 n_tokens_second=64.5633366332372
INFO [           print_timings] generation eval time =   31077.38 ms /   576 runs   (   53.95 ms per token,    18.53 tokens per second) | tid="140241878706112" timestamp=1725859466 id_slot=0 id_task=1016 t_token_generation=31077.377 n_decoded=576 t_token=53.95377951388889 n_tokens_second=18.53438274407779
INFO [           print_timings]           total time =   31170.31 ms | tid="140241878706112" timestamp=1725859466 id_slot=0 id_task=1016 t_prompt_processing=92.932 t_token_generation=31077.377 t_total=31170.309
INFO [            update_slots] all slots are idle | tid="140241878706112" timestamp=1725859466

client端打印如下：

{"content":"\nMiniCpm是一种基于深度学习的超参数优化方法，其核心思想是通过学习数据的统计特性，利用贝叶斯优化算法进行超参数的搜索和优化。
在MiniCpm中，超参数通常表示为一个概率分布的函数，即P(参数|数据)。通过学习数据的统计特性，MiniCpm可以找到最优的P(参数|数据)，
从而得到最佳的超参数。\n\n在MiniCpm中，首先需要定义一个贝叶斯优化算法。常见的贝叶斯优化算法有NUTS、SAM、Nelder-Mead等。
在MiniCpm中，我们使用NUTS算法作为贝叶斯优化算法。NUTS算法通过从参数空间中随机选择一些候选参数，计算出它们的期望值，
然后根据期望值计算出一个概率分布P(参数|数据)。接着，根据P(参数|数据)计算得到的新参数集合，再次计算出它们的期望值，
以此类推。重复这个过程，直到得到一个接近最优的P(参数|数据)，从而得到最佳的超参数。
\n\nMiniCpm的步骤如下：\n\n1. 定义一个贝叶斯优化算法。在MiniCpm中，我们使用NUTS算法作为贝叶斯优化算法。
\n\n2. 选择一个合适的超参数搜索空间。超参数的搜索空间应该足够大，以覆盖数据的统计特性。
\n\n3. 初始化一个超参数搜索空间，通常是一个连续的参数空间。
\n\n4. 定义一个概率分布函数，即P(参数|数据)。在MiniCpm中，P(参数|数据)通常表示为一个概率分布的函数，即P(参数|数据) = P(参数|数据)。
\n\n5. 选择一个搜索策略，用于在超参数搜索空间中搜索最优的超参数。常见的搜索策略有NUTS、SAM、Nelder-Mead等。在MiniCpm中，我们使用NUTS算法作为搜索策略。
\n\n6. 搜索超参数的过程。在搜索过程中，通过计算期望值得到新的参数集合，并重复计算期望值直到得到一个接近最优的超参数。
\n\n7. 评估超参数的性能。通过计算目标函数的梯度，来评估超参数的性能。
\n\n8. 调整超参数的搜索策略。根据超参数的性能，调整搜索策略的参数，以获得更好的搜索效果。
\n\n9. 停止搜索。当超参数搜索空间变得非常小，或超参数的性能不再提高时，停止搜索。
\n\n10. 输出最优的超参数。根据搜索的结果，输出最优的超参数。
\n\n以上就是关于MiniCpm的基本知识点。在实际使用中，还需要根据具体的问题和数据，选择合适的超参数搜索空间和搜索策略，以获得更好的效果。
","id_slot":0,"stop":true,"model":"./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf","tokens_predicted":576,
"tokens_evaluated":6,"generation_settings":{"n_ctx":4096,"n_predict":-1,"model":"./models/MiniCPM-2B-sft-bf16/CPM-2B-sft-Q4_K_M.gguf","seed":1725859291,
"temperature":0.800000011920929,"dynatemp_range":0.0,"dynatemp_exponent":1.0,"top_k":40,"top_p":0.949999988079071,
"min_p":0.05000000074505806,"tfs_z":1.0,"typical_p":1.0,"repeat_last_n":64,"repeat_penalty":1.0,"presence_penalty":0.0,"frequency_penalty":0.0,
"mirostat":0,"mirostat_tau":5.0,"mirostat_eta":0.10000000149011612,"penalize_nl":false,"stop":[],"max_tokens":-1,"n_keep":0,"n_discard":0,"ignore_eos":false,
"stream":false,"n_probs":0,"min_keep":0,"grammar":"","samplers":["top_k","tfs_z","typ_p","top_p","min_p","temperature"]},
"prompt":"介绍一下MiniCpm","truncated":false,"stopped_eos":true,"stopped_word":false,"stopped_limit":false,"stopping_word":"","tokens_cached":581,"timings":{"prompt_n":6,"prompt_ms":92.932,"prompt_per_token_ms":15.488666666666667,"prompt_per_second":64.5633366332372,"predicted_n":576,"predicted_ms":31077.377,"predicted_per_token_ms":53.95377951388889,"predicted_per_second":18.53438274407779},"index":0}

以上简单介绍了一下llama.cpp实现大模型格式转换、量化、推理，记录一下本地操作过程，操作过程中，参考了以下两篇文章，非常感谢！
https://blog.csdn.net/abcd51685168/article/details/140806221
https://developer.baidu.com/article/details/3185708