简介
- CausalConv3D VAE,降采样倍数为时间上4倍,空间上8x8倍
-
扩散模型:FLUX的结构(Dual-Stream DiT Blcok+Single-Stream DiT Block),Full Attention机制,能统一生成视频和图片,130亿参数
- Rotary Position Embedding
- Text Encoder: CLIP-large and MLLM(
xtuner/llava-llama-3-8b-v1_1-transformers
) -
数据清洗
训练数据:图片几十亿+视频: 356P+360P+540P+720P+SFT(约一百万人工筛选和人工标注,包括短描述,细致描述,背景,风格,镜头类型,光照,氛围等等)
- 训练
图片预训练:
- 256p training
- mix-scale training
图片视频联合训练
1.低分辨率,短视频训练
2.低分辨率,长视频训练
3.高分辨率,长视频训练
High-performance Model Fine-tuning
- 推理
A100-SXM4-80GB 50 steps takes ~50min
74GB/80GB with cpu-offload off
56GB/80GB with cpu-offload on