Generative AI

Tencent Released Tencent HY-Motion 1.0: A Billion-Parameter Text-to-Motion Model Built on Diffusion Transformer (DiT) Architecture and Flow Simulation

Tencent Hunyuan's 3D Digital Human team has released HY-Motion 1.0, a next-generation family of texture to 3D scaling Diffusion Transformer based Flow Matching to 1B parameters in the motion domain. The models transform natural language information and expected durations into 3D animation clips in the integrated SMPL-H skeleton and are available on GitHub and Hugging Faces with code, testbeds and Gradio's interface for local use.

That's what HY-Motion 1.0 offers to developers?

HY-Motion 1.0 is a series of text-to-3D human motion generation models built on the Diffusion Transformer, DiT, trained for the purpose of Flow simulation. The model series features 2 variants, HY-Motion-1.0 with 1.0B parameters as a standard model and HY-Motion-1.0-Lite with 0.46B parameters as a lightweight option.

Both models generate skeleton-based 3D character animations from simple text commands. The output is a sequence of movements in the SMPL-H skeleton that can be combined with 3D animation or game pipelines, for example for digital people, cinematography and interactive characters. The release includes descriptive documentation, a batch-oriented CLI and Gradio web application, and supports macOS, Windows and Linux.

Data engine and taxonomy

The training data comes from 3 sources, human motion videos, motion capture data and 3D animation assets for game production. The research team starts from 12M of high-quality video clips from HunyuanVideo, runs sharp boundary detection to classify scenes and a human detector to save clips with people, and then uses the GVHMR algorithm to reconstruct the motion tracks of SMPL X. The motion capture sessions and 3D animation libraries contribute about 500 hours of additional motion sequences.

All data was redirected to the integrated SMPL-H skeleton using the mesh installation and redirection tools. A multi-stage filter removes duplicate clips, odd poses, out-of-sync speed, awkward transitions, long static segments and artifacts such as foot sliding. The motion is then canonicalized, resampled to 30 fps and divided into clips shorter than 12 seconds with a fixed world frame, the Y axis up and the character facing the positive Z axis. The final corpus contains more than 3,000 hours of motion, of which 400 hours are high quality 3D motion and subtitles.

On top of this, the research team describes a 3-level taxonomy. At the top level there are 6 classes, Locomotion, Sports and Athletics, Fitness and Outdoor Active, Daily Activities, Social Interaction and Recreation and Game Character Actions. These expand into more than 200 well-characterized movement categories in the leaves, which include both simple atomic actions and combinations of parallel or sequential movements.

Motion representation and HY-Motion DiT

HY-Motion 1.0 uses the SMPL-H skeleton with 22 body joints except hands. Each frame is a 201-dimensional vector consisting of global root translations in 3D space, global body orientation in 6D rotational representation, 21 local joint rotations in 6D form and 22 local joint positions in 3D coordinates. Speed ​​and contact labels on the feet are removed because they slow down the training and do not help the final quality. This representation is consistent with the animation workflow and is close to the representation of the DART model.

The core network is a hybrid HY Motion DiT. First two streaming blocks are used which process encrypted moving and text tokens separately. In these blocks, each modality has its own projection QKV and MLP, and the joint attention module allows motion tokens to query semantic features from text tokens while maintaining a specific modality structure. The network then switches to single-stream blocks that combine motion and text tokens into a single sequence and processes them with the same local modules and channel attention modules to perform deep multimodal integration.

For text processing, the system uses a dual coding scheme. The Qwen3 8B provides token-level embedding, while the CLIP-L model provides global text features. The Bidirectional Token Refiner corrects the causal bias of LLM in random generation. These signals feed the DiT through the normalization of the dynamic layer. Attention is not the same, motion tokens can attend to all text tokens, but text tokens do not weave motion, which prevents noisy motion from corrupting the language representation. Temporal attention within a moving branch uses a small sliding window of 121 frames, which focuses volume on local kinematics while keeping costs manageable for long clips. Full Rotary Position Embedding is used after combining the text and motion tokens to combine relative positions throughout the sequence.

Flow matching, fast rewriting and training

HY-Motion 1.0 uses Flow Matching instead of traditional distributed denoising. The model learns the velocity field in a continuous path that interpolates between Gaussian noise and real motion data. During training, the objective is the mean squared error between the predicted speed and the ground truth in this method. At the time of consideration, the learned classification equation is integrated from the noise to the clean route, which provides stable training for long sequences and fits the DiT structure.

A separate Duration prediction module and a Fast Rewrite module improve the following instructions. It uses the Qwen3 30B A3B as a base model and is trained on user-style input from motion captions with a VLM and LLM pipeline, for example Gemini 2.5 Pro. This module predicts appropriate movement durations and rewrites informal instructions into standard text that is easy for DiT to follow. It is first trained with supervised optimization and then refined with Group Relative Policy Optimization, using Qwen3 235B A22B as a reward model that achieves semantic similarity and duration transparency.

Training follows a 3-phase curriculum. Stage 1 performs large-scale pre-training on a total dataset of 3,000 hours to learn broad movements before basic text movement alignment. Stage 2 of 400 hours of high-quality beautiful songs set to sharpen movement details and improve semantic accuracy at a low learning rate. Phase 3 uses reinforcement learning, first Direct Preference Optimization using 9,228 human preference pairs drawn from about 40,000 generated pairs, and then Flow GRPO with composite reward. The award includes a semantic score from the Text Motion Retrieval model and a physics score that penalizes artifacts such as foot slip and root drift, under the KL normalization term to stay close to the supervised model.

Benchmarks, measurement behavior and limitations

For testing, the team created a test set of over 2,000 commands covering 6 taxonomy categories and including simple, consistent and sequential actions. Human raters score tracking instructions and movement quality on a scale from 1 to 5. The HY-Motion 1.0 achieves an average instruction following of 3.24 points and an SSAE score of 78.6 percent. Basic text-to-move systems such as DART, LoM, GoToZero and MoMask score between 2.17 and 2.31 with SSAE between 42.7 percent and 58.0 percent. For motion quality, HY-Motion 1.0 scores 3.43 on average compared to 3.11 on the best basis.

The benchmark test studies DiT models with 0.05B, 0.46B, 0.46B trained only for 400 hours and 1B parameters. The following orders gradually improve with model size, with model 1B reaching a mean of 3.34. The quality of the movement fills in the 0.46B scale, where the 0.46B and 1B models reach the same ratio between 3.26 and 3.34. A comparison of the 0.46B model trained in 3,000 hours and the 0.46B model trained in only 400 hours shows that a large amount of data is the key to instruction alignment, while high-quality processing mainly improves realism.

Key Takeaways

  • Billion Rate DiT Flow Comparison of motion: HY-Motion 1.0 is the first Diffusion Transformer based Flow Matching model scaled at the 1B parameter level specifically to document 3D human motion, targeting high fidelity following various actions.
  • A large scale, curated corpus of movements: The model has been pre-trained on over 3,000 hours of regenerated, mocap and animation data and fine-tuned on a 400-hour high-quality subset, all reoriented to the SMPL H skeleton and organized into over 200 motion classes.
  • Hybrid DiT architecture with strong text mode: HY-Motion 1.0 uses mixed dual stream and single stream DiT with heterogeneous attention, narrow band temporal attention and dual text encoders, Qwen3 8B and CLIP L, to combine token level and global semantics in motion trajectories.
  • RL is aligned with a rapid rewrite pipeline and training: A dedicated Qwen3 30B-based module predicts movement duration and rewrites user commands, and DiT is also compatible with Direct Preference Optimization and Flow GRPO using semantic and physics rewards, improving realism and instruction following supervised training.

Check it out Paper again Full Codes here. Also, feel free to follow us Twitter and don't forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.


Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the power of Artificial Intelligence for the benefit of society. His latest endeavor is the launch of Artificial Intelligence Media Platform, Marktechpost, which stands out for its extensive coverage of machine learning and deep learning stories that sound technically sound and easily understood by a wide audience. The platform boasts of more than 2 million monthly views, which shows its popularity among viewers.

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button