The team at Tencent’s Hunyuan Lab has created a new AI called “Hunyuan Video-Foley.” It is designed to listen to videos and produce high-quality soundtracks that are perfectly synchronized with on-screen actions.
Have you ever seen videos generated by AI and felt like something is missing? The visuals may be great, but there is often an eerie silence that breaks the spell. In the film industry, the sounds that fill that silence: the rustling of leaves, the applause of thunder, the chunks of glass – are called Foley Art, and are laborious crafts performed by experts.
Matching that level of detail is a major challenge for AI. For years, automated systems have struggled to create a sound that is trustworthy for videos.
How does Tencent solve AI-generated audio due to video issues?
One of the biggest reasons why video-to-audio (V2A) models are often lacking in the sound sector was what researchers call “modality imbalances.” Essentially, the AI was listening to more prompts for the given text than he was watching the actual video.
For example, you might provide models with videos of beaches busy walking and gulls flying, but the text prompts simply say “sea waves sound” and you’ll get the sound of waves. AI completely ignores footsteps in the sand and the bird’s appeal, filling the scene with excitement.
Plus, the audio quality was often poor, so there was not enough high quality video to train the model effectively.
Tencent’s Hunyuan team addressed these issues from three different angles.
- Tencent realized that AI needed better education, so they built a huge 100,000 hours of video, audio and textual descriptions to learn from it. They created an automatic pipeline that filters low-quality content from the internet, stripping clips with long silence or compressed fuzzy audio to ensure AI learned from the best possible material.
- They designed a smarter architecture for AI. Think of teaching your model properly multitasking. This system first pays very close attention to the visual audio link and gets the timing right. For example, it’s like matching footsteps at the exact moment your shoes hit the pavement. Once that timing is locked down, a text prompt is built in to understand the overall mood and context of the scene. This dual approach ensures that certain details of the video are not overlooked.
- To ensure that the sound was of high quality, they used a training strategy called Representational Alignment (REPA). This is like having a professional audio engineer constantly watching the shoulders of AI during training. It guides AI work to produce cleaner, richer, more stable sounds compared to the capabilities of pre-trained professional-grade audio models.
result talk The sound of itself
When Tencent tested the Hunyuan Video-Foley against other major AI models, the audio results were clear. It wasn’t just about computer-based metrics being superior. Human listeners consistently rated its output as high quality, matching the video better and timing it more accurately.
Overall, AI has improved the sound to match on-screen actions, both in content and timing. Results across multiple evaluation datasets support this.
Tencent’s work helps bridge the gap between silent AI video and immersive viewing experiences with high quality audio. It brings the magic of Foley Art into the world of automated content creation. This can be a powerful ability anywhere for filmmakers, animators and creators.
reference: Google Vids Gets AI Avatars and Inter-Images Tools
Want to learn more about AI and big data from industry leaders? Check out the AI & Big Data Expo in Amsterdam, California and London. The comprehensive event is part of TechEx and will be held in collaboration with other major technology events. Click here for more information.
AI News is equipped with TechForge Media. Check out upcoming Enterprise Technology events and webinars here.