Kling Logo

Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis

Yikang Ding*, Jiwen Liu*, Wenyuan Zhang, Zekun Wang, Wentao Hu, Liyuan Cui, Mingming Lao,

Yingchao Shao, Hui Liu, Xiaohan Li, Ming Chen, Xiaoqiang Liu, Yu-Shen Liu, Pengfei Wan
*Equal contribution
Kling Team, Kuaishou Technology
High-Quality Videos with Accurate Lip–Audio Alignment
Multimodal Instruction Control

(Dynamics) The woman slowly turns her body, placing her hands on her waist.

(Dynamics) The girl turned her face left and right, occasionally touching her cheek with her hand.

(Dynamics) The girl raised her arms and turned half a circle to the left and right, showing her clothes.

(Emotion) The man is extremely angry and emotional.

(Emotion) The man looks very surprised, his eyes widened involuntarily and his eyebrows raised high.

(Emotion) The boy looked quite happy, with his eyes curved and smiling.

(Emotion) The woman looks stern and unapproachable, projecting a sense of dominance.

(Emotion) The girl is talking excitedly with a smile on her face.

(Emotion) The boy was excited, his eyebrows kept twitching and his eyes were shining with excitement.

(Camera) The camera gradually moves upward.

(Camera) The camera moves to the left around the subject.

Long-Duration Video Generation
Generalization to Open Scenarios
Pipeline
Pipeline
An MLLM Director first interprets multimodal instructions into high-level semantics and tells a storyline. Guided by this global planning, the first stage generates a blueprint video. In the second stage, keyframes are extracted from the blueprint and used as first–last frame conditions for parallel sub-clip generation, refining local details and dynamics to synthesize long-duration videos.
BibTeX

@article{ding2025kling-avatar,
  title={Kling-Avatar: Grounding Multimodal Instructions for Cascaded Long-Duration Avatar Animation Synthesis},
  author={Ding, Yikang and Liu, Jiwen and Zhang, Wenyuan and Wang, Zekun and Hu, Wentao and Cui, Liyuan and Lao, Mingming and Shao, Yingchao and Liu, Hui and Li, Xiaohan and Chen, Ming and Liu, Xiaoqiang and Liu, Yu-shen and Wan, Pengfei},
  journal={arXiv preprint arXiv:2509.09595},
  year={2025}
}