
Chimera: Communication Fusion for Hybrid Parallelism in Large Language M...
Distributed On-Device LLM Inference With Over-the-Air Computation 这篇文章针对...
Less is More: Optimizing Function Calling for LLM Execution on Edge Dev...
Jupiter: Fast and Resource-Efficient Collaborative Inference of Generati...
解码阶段的kvcache是什么数据构成的,每个新的token是计算后再合并到KVcache中吗 在 Transformer 模型的解码阶段(尤其...
Can LLMs Learn from Previous Mistakes? Investigating LLMs’ Errors to Boo...
最近看了一篇文章是关于CoT微调的,所以看了一下关于CoT的内容。看之前其实是有两个疑问的,1)思考链这个是否是额外生成的,还是说模型的每一次推...
WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Train...
MOBA: MIXTURE OF BLOCK ATTENTION FOR LONG-CONTEXT LLMS 如图所示,就是模仿moe的形式来...