Show HN: Gemini can now natively embed video, so I built sub-second video search
embeddings gemini google multimodal
| Source: HN | Original article
Google’s Gemini API has taken a decisive step toward truly multimodal AI with the public preview of Gemini‑Embedding‑2, a model that can embed text, images, audio, PDFs and, for the first time, raw video into a single vector space. The announcement sparked a “Show HN” post on Hacker News where developer Mikael Svensson demonstrated a prototype that indexes a 30‑minute YouTube clip and returns relevant moments in under a second.
The breakthrough lies in Gemini’s native video encoder, which processes frames and audio jointly rather than treating video as a sequence of separate image embeddings. By collapsing an entire clip into a 768‑dimensional vector, the model enables similarity search across the temporal dimension without the need for costly frame‑by‑frame indexing. Svensson’s demo leverages the Gemini‑Embedding‑2‑preview endpoint, stores the vectors in a Pinecone index, and runs a cosine‑similarity query that instantly surfaces the exact second where a spoken phrase or visual cue appears.
Why it matters is twofold. First, it lowers the barrier for developers to build searchable video archives, a capability long limited to large tech firms with bespoke pipelines. Second, it expands Google’s competitive edge against OpenAI’s multimodal embeddings and Anthropic’s Claude Code, both of which still rely on separate image or audio models. For Nordic media firms, e‑learning platforms, and surveillance providers, sub‑second video retrieval could translate into faster content moderation, richer recommendation engines, and new revenue streams from searchable video libraries.
What to watch next includes Google’s rollout schedule for the full‑scale Gemini‑Embedding‑2 service, pricing details, and integration with Vertex AI pipelines. Industry observers will also be keen on how quickly third‑party tools adopt the model for real‑time video analytics, and whether competitors respond with comparable native video embeddings before the end of the year.
Sources
Back to AIPULSEN