Google's Gemini Omni Can Convert Images, Audio, and Text into Video and More

gemini google multimodal

2026-05-21 | Source: TechCrunch on MSN | Original article

Google's Gemini Omni model generates videos from images, audio, and text. It uses conversation to edit videos.

Google's Gemini Omni has taken a significant leap forward, enabling users to generate videos from images, audio, and text through simple conversation. This multimodal model, which we first reported on with the release of Gemini Omni, processes various forms of media to create and edit videos. As we reported on May 20, Gemini Omni is a unified AI model that handles text, images, audio, and video, and this new development showcases its capabilities. This breakthrough matters because it has the potential to revolutionize content creation, making it more accessible and efficient. With Gemini Omni, users can create high-quality videos up to 30 minutes long, with native sound, using just text and images. This technology could be a game-changer for industries such as marketing, education, and entertainment. As Google continues to develop and refine Gemini Omni, it will be interesting to watch how this technology is integrated into various applications and industries. With Google I/O 2026 on the horizon, we can expect to see more updates and demonstrations of Gemini Omni's capabilities. As the model continues to evolve, we can anticipate new features and use cases that will further transform the way we create and interact with video content.

Sources

Back to AIPULSEN