general

Tencent's HY-Motion Generates 3D Animation from Text

Tencent launches HY-Motion 1.0, a billion-parameter text-to-3D animation model that converts natural language descriptions into skeletal character motion

Tencent Releases HY-Motion: Text-to-3D Animation

What It Is

HY-Motion 1.0 is Tencent’s new text-to-3D animation model that generates character motion sequences from natural language descriptions. The system translates prompts like “character running and jumping over obstacle” into skeletal animation data that works directly in game engines and 3D software like Blender or Unity.

The model operates with 1 billion parameters and employs flow matching architecture rather than traditional diffusion approaches. This technical choice produces smoother transitions between poses and more physically plausible movement patterns. The training dataset spans over 200 motion categories, covering everything from basic locomotion to complex actions like combat sequences and dance choreography.

What sets HY-Motion apart from earlier text-to-motion systems is its reinforcement learning training phase. After initial supervised fine-tuning, the model underwent additional optimization to improve instruction following. This means it generates animations that more accurately match specific prompt details rather than defaulting to generic interpretations of movement types.

Why It Matters

Animation production remains one of the most time-intensive bottlenecks in game development and 3D content creation. Traditional keyframe animation requires specialized skills and hours of manual work for even simple motion sequences. Motion capture offers realism but demands expensive equipment, studio space, and cleanup time.

HY-Motion addresses this production gap for indie developers, small studios, and prototyping workflows. A solo developer can now generate placeholder animations or even production-ready sequences without hiring animators or booking mocap sessions. The model’s export compatibility with standard 3D formats means animations integrate directly into existing pipelines without conversion tools or middleware.

The reinforcement learning component has broader implications for generative AI development. Most text-to-motion models struggle with precise instruction following because they’re trained purely on paired text-motion datasets. Adding RL optimization creates a feedback loop that rewards accuracy, potentially setting a new standard for how these models should be trained.

For the AI research community, HY-Motion demonstrates that flow matching can outperform diffusion models in temporal domains. While diffusion has dominated image and video generation, flow-based approaches may prove superior for sequential data like animation where temporal coherence matters more than frame-by-frame quality.

Getting Started

The model is available through multiple channels. Developers can access the code repository at https://github.com/Tencent-Hunyuan/HY-Motion-1.0 or download model weights from https://huggingface.co/tencent/HY-Motion-1.0. The official documentation and demo interface live at https://hunyuan.tencent.com/motion.

For local deployment, the basic workflow involves installing dependencies and loading the model:


model = HYMotion.from_pretrained("tencent/HY-Motion-1.0")
animation = model.generate("warrior swinging sword in combat stance")
animation.export("output.fbx")

The exported FBX files contain skeletal animation data that imports cleanly into Blender, Maya, Unity, and Unreal Engine. No retargeting or cleanup should be necessary for standard humanoid rigs, though custom character proportions may require minor adjustments.

Hardware requirements are reasonable for a billion-parameter model. Generation runs on consumer GPUs with 8GB+ VRAM, though inference speed scales with available compute. CPU-only operation is possible but significantly slower.

Context

HY-Motion enters a growing field of text-to-motion models. OpenAI’s work on motion generation and academic projects like MotionGPT have explored similar territory, but few offer production-ready output with this level of category coverage.

The main limitation is creative control. Text prompts provide high-level direction but lack the precision of keyframe animation. Animators can’t specify exact timing, pose holds, or subtle weight shifts through natural language alone. The model works best for generating base animations that artists can refine, not as a complete replacement for manual animation.

Quality varies across motion categories. Common actions like walking and running benefit from abundant training data, while niche movements may produce less polished results. The 200+ category coverage is impressive but still represents a fraction of possible human motion.

Flow matching’s computational efficiency compared to diffusion models makes real-time or near-real-time generation feasible. This could enable interactive workflows where developers iterate on prompts and immediately preview results, fundamentally changing how animation prototyping works.