ByteDance Introduces Magic-Me: A New AI Framework for Creating Videos with Custom Identity

Written By Adarsh Shankar Jha

Text-to-Image (T2I) and Text-to-Video (T2V) generation has made significant strides in production models. While T2I models can control subject identity well, extending this capability to T2V remains a challenge. Existing T2V methods need more precise control of the generated content, especially the generation of specific identity for human-related scenarios. Efforts to leverage T2I advances for video creation need help in maintaining stable identities and stable backgrounds in frames. These challenges stem from different reference images affecting identity tokens and the struggle of drive units to ensure temporal consistency between different identity inputs.

Researchers from ByteDance Inc. and UC Berkeley developed Video Custom Diffusion (VCD), a simple yet powerful framework for creating subject-identified video. VCD uses three main components: an ID module for accurate identity extraction, a 3D Gaussian Noise Prior for frame-to-frame consistency, and V2V modules to improve video quality. By decoupling ID information from background noise, VCD aligns IDs precisely, ensuring stable video outputs. The flexibility of the framework allows it to work seamlessly with various AI-generated content models. Contributions include significant advances in ID-specific video generation, robust denoising techniques, resolution improvement, and a training approach for mitigating noise in discrete IDs.

TK0hVSa3n8hT6VxkDCJelwO9behFRxSalSQViRA0AYbpBLuoed1s tPB ichzYsLuGw0M 0bzaPAA1n o4bwjc1hR5WBLYP5 BDHcCwvV1ea GA24fIuRQdTmi5tMyXD8 m2TI28R2alotn 2CZ0 Kk

In production models, the developments of the T2I generation led to customizable models capable of creating realistic portraits and imaginative compositions. Techniques such as Textual Inversion and DreamBooth optimize pre-trained models with subject-specific images, generating unique identifiers associated with desired subjects. This progress extends to multi-subject generation, where models learn to composite multiple subjects into single images. The transition to the T2V generation presents new challenges due to the need for spatial and temporal consistency between frames. While early methods used GANs and VAEs for low-resolution video, recent approaches used diffusion models for higher-quality output.

A preprocessing unit, an ID unit and a motion unit have been used in the VCD framework. Additionally, an optional ControlNet Tile module upscales video for higher resolution. VCD boosts an off-the-shelf motion module with 3D Gaussian noise before mitigating exposure bias during inference. The ID module incorporates extended ID tokens with masked loss and prompt segmentation, effectively removing background noise. The study also mentions two V2V VCD pipelines: Face VCD, which enhances facial features and resolution, and Tiled VCD, which further upscales video while preserving identity. These units collectively ensure high-quality video output while maintaining identity.

gsDtmYWA AYqYpqj7Zr PuuzBhaJB03wE4

The VCD model preserves the identity of the character in various realistic and stylized models. The researchers carefully selected subjects from different datasets and evaluated the method against multiple baselines using CLIP-I and DINO for identity alignment, text alignment, and temporal smoothness. Training details included using Stable Diffusion 1.5 for the ID module and adjusting the learning rates and batch sizes accordingly. The study drew data from DreamBooth and CustomConcept101 datasets and evaluated the model’s performance against various metrics. The study highlighted the critical role of the 3D Gaussian Noise Prior and the prompt-to-segmentation module in improving video smoothness and image alignment. Realistic Vision generally outperformed Stable Diffusion, highlighting the importance of model selection.

In conclusion, VCD revolutionizes subject identity controlled video creation by seamlessly integrating identity information and frame correlation. Through innovative components such as the segmentation-prompted ID module for accurate ID decoupling and the T2V VCD module for improved frame consistency, VCD sets a new benchmark for video identity preservation. Its compatibility with existing text-to-image models enhances practicality. With features like 3D Gaussian Noise Prior and Face/Tiled VCD modules, VCD ensures stability, sharpness and higher resolution. Extensive experiments confirm its superiority over existing methods, making it an indispensable tool for creating stable, high-quality identity-preserving videos.


check it Paper, Githuband Work. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us Twitter and Google news. Participation Our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channeland LinkedIn Groops.

If you like our work, you will love our work newsletter..

Don’t forget to join us Telegram channel

You might also like ours FREE AI Courses….


author profile Sana Hassan

Sana Hassan, an intern consultant at Marktechpost and a graduate student at IIT Madras, is passionate about applying technology and artificial intelligence to address real-world challenges. With a strong interest in solving practical problems, he brings a fresh perspective to the intersection of artificial intelligence and real-world solutions.


You May Also Like

0 Comments