When Google launched Gemini three years ago, the goal was to build a multimodal large-scale language model: a single neural network that can be trained on text, images, audio, and video and generate content in any of these formats.
Today, the company took a concrete step toward its Gemini Omni goal at the Google I/O developer conference. Gemini Omni is a new family of multimodal models that, according to Google CEO Sundar Pichai, “can create anything from any input.”
Omni starts with a video. Users can now combine images, audio, video, and text, and instead of simply splicing these inputs together, Omni integrates them all to produce a coherent output. The result is high-quality videos that reflect an understanding of physics, culture, history, and science.
Omni also lets you edit photos using plain text commands, rather than complex editing software like Google’s Nano Banana.
Google already has a dedicated video model, Veo, which allows users to convert text and images into videos, as well as direct and customize avatars. But Nicole Brichtova, director of product management at Google DeepMind, says today’s release is more than an update to Veo: “It’s the next step in our advancement of combining the intelligence of Gemini with the rendering capabilities of our media models.”
Koray, Chief Engineer at DeepMind One example, Kavukcuoglu told reporters at Monday’s media briefing, is that when Omni was given a simple prompt like “claymation explainer of protein folding,” it rendered a stop-motion explainer video with a voiceover saying, “Proteins start as chains of amino acids. They fold into alpha-helix-like patterns and flat sections called beta sheets, forming a perfect three-dimensional structure.” shape. ”
Omni’s long-term vision is broader and includes models used to generate images from audio and audio from video.
“When we first announced Gemini, it was the first AI model that was natively multimodal,” Pichai said during the briefing. “We knew that by training a combination of text, code, audio, images, and video, we could gain a deeper understanding of the world. World models are moving AI from predicting text to simulating reality. Gemini Omni is the next step in that direction.”
As part of the release, users will also be able to create videos using their own digital avatars. This is what OpenAI popularized with Cameos in its now-defunct Sora app. According to Brichtova, to prevent deepfakes, users must go through dedicated product onboarding. This involves recording yourself and reading out a series of numbers. The avatar is then saved for future use.
Additionally, all videos created with Omni include Google’s SynthID watermark, so users can tell if a video was generated via a Gemini product.
The first model in this family is Gemini Omni Flash, which is rolling out today to Gemini apps, YouTube Shorts, and the AI creative studio Flow. Flash will be able to render 10-second videos, but Brichtova said this was a decision based both on the desire to reach more users and the expectation that most users don’t want to create videos that long yet, rather than a model limitation. However, longer video times will be developed in the near future.
Google seems to be marketing Omni Flash more as a consumer tool. The examples of uses for digital avatars that Brichtova and Gabe Barth-Maron, a research engineer at DeepMind, cited in a conference call with TechCrunch were all personal, from creating videos of winning awards and going to the moon to removing passersby from the background of videos taken while on vacation.
Put more simply, Barth-Maron says, “They’re like personalized memes.”
“We definitely focused on making this easy for consumers to use,” Brichtova said. “There aren’t many video models that have been able to break through the gap with consumers, so this is what we’re doing.”
There are some caveats regarding ease of use. Brichtova and Barth-Maron noted that editing prompts need to be very specific. Otherwise, you run the risk of Omni over-editing or unintentionally changing elements that you want to keep. This is a problem that Nano Banana users may encounter.

Despite the near-term focus on consumers, it’s clear that Omni will have an impact on enterprise and creative, and Google plans to make Omni available via API in the coming weeks. The avatar generator is a feature currently available in Shorts that Google hopes content creators will adopt. But more broadly, end-to-end multimodal workflows could be transformative for advertisers and filmmakers.
Startup Luma AI is building something similar. This is an agent tool that leverages a unique “integration” model and can generate entire advertising campaigns based on short summaries and product images.
“We are actually very proud of the text rendering capabilities of this model, which is very useful for things like advertising,” Brichtova says. “If you want it to be a product somewhere, or even just a slogan, it needs to be accurate. We definitely expect filmmakers and other types of creators to use this model as well.”
More professional use cases may be better served by the Omni Pro model, which should provide better performance across all Omni tasks. Google hasn’t yet said when it will release a Pro version, but Brichtova said it will happen at a time when “we feel like we’re at a point where it’s going to be even more gradual than Flash.”
Check out the rest of the important news from Google IO 2026
As you know, Google search is dead
Google updates Gemini app to support ChatGPT and Claude
Google launches Gemini Spark, a 24/7 agent assistant integrated with Gmail
How to use Google’s new information agent
If you buy through links in our articles, we may earn a small commission. This does not affect editorial independence.
