Skip to content
Jason Peterson Jason Peterson
Go back

From Prototype to Production: Building a Multimodal Video Search Engine

In my last post, I wrote about the unreasonable effectiveness of model stacking for media search—combining CLIP, Whisper, and ArcFace to find video content through visual descriptions, dialog, and faces. Over the holidays I expanded that afternoon hack into something more production-like.

Live demo: fennec.jasongpeterson.com Starter code: github.com/JasonMakes801/fennec-search

Try This

  1. Go to fennec.jasongpeterson.com (desktop browser)
  2. Enter older man on phone, harbor background in Visual Content → click +
  3. Click the face of the older guy with glasses sitting with the harbor at his back
  4. Enter the Americans had launched their missiles in Dialog (Semantic mode) → click +
  5. Play the clip

You’ve drilled down to an exact shot without metadata, timecodes, or remembering exact words. The semantic search is fuzzy—he actually says “What it was telling him was that the US had launched their ICBMs,” but that’s close enough.

Search result showing the scene

What’s Under the Hood

The Postgres + pgvector setup turned out cleaner than expected—vector similarity combined with metadata filtering in a single query just works.


Demo footage from Pioneer One, a Creative Commons-licensed Canadian drama. Built with significant help from Claude Code.



Previous Post
Did You Know CLIP Works as an AI Image Detector?
Next Post
I Stacked 3 Small ML Models and Got Video Search That Feels Like Magic