Google just shipped music generation in Gemini via Lyria 3. You can prompt it with text, images, or video and it composes a 30-second track. I wanted to test how well it scores moody, cinematic scenes.
The Videos
Generated with Kling 3.0 (Kuaishou) via fal.ai. Three scenes: a slow push through an empty 70s house, a Severance-style corporate corridor with a lone figure, and a woman crying in blue TV light. Kling is pricey — about $3.36 for 10 seconds with audio — but the cinematic quality is genuinely impressive. It also generates its own ambient audio.
The Music
Fed each scene to Gemini on the web to generate scores. Key finding: video input didn’t seem to actually influence the output much. What worked was image + text prompt — extracting a center frame from each video and pairing it with a detailed mood/style prompt. Thinking mode produced noticeably better results.
Getting Gemini to go sparse took iteration — it defaults to overcomposing. Prompts that explicitly say “no melody, no rhythm, no percussion” and describe silence and space worked best.
The 70s house is my favorite — the score landed exactly what I had in my head. Eerie, spare, perfectly matched to the slow drift through that wood-paneled hallway. First try.
The Mix
Claude Code handled all the ffmpeg work — extracting frames, mixing Kling’s ambient audio (ducked to 25%) with Gemini’s score, dialing in offsets to find the right section of each 30-second track, and adding fade in/out. The whole mixing workflow was conversational: “offset by 14s,” “too busy, try again,” “2s offset for this one.”
The Result
Three short AI films, zero manual audio editing, made with a video model, a music model, and an AI coding assistant as the mixing board.
Jason Peterson