Three films, three styles, one source document. Built end-to-end by Claude Code calling Gemini 2.5 Pro, Veo 3, Imagen 4, and Chirp 3 HD on Vertex AI — with no human in the rendering loop after "go".
Source: a 4 KB ankle-fracture discharge note for Jane Public + a doctor photo + a patient photo. Claude planned the storyboard, then asked Veo 3 to image-to-video each scene from the appropriate stylized character reference.
Claude Code didn't generate the videos. It wrote the code that calls Gemini, discovered which models you had access to, babysat 25 minutes of background Veo jobs, and recovered from two safety-filter rejections automatically.
orchestrate.py · scene_planner.py · veo_client.py · tts_client.py · ffmpeg_stitch.pygoogle-genai SDK. Loops scenes, retries on safety rejection, skips already-rendered scenes for resume.
qr33-vertex-247397Same pipeline runs all three style variants. The only thing that changes between variants is a one-sentence style suffix appended to every Veo prompt — plus the reference image fed in as the first frame.
Discharge note in. Structured 10-scene JSON out. The system instruction forces a 2:1 patient-demonstration-to-doctor ratio and bans invented dosages.
"""You are a medical patient-education film director. Given a hospital
discharge note, produce a short film (8-12 scenes, each one continuous
8-second shot) that VISUALLY DEMONSTRATES every actionable instruction.
CORE PRINCIPLE: This is a demonstration film, not a lecture. Aim for
2 PATIENT demonstration scenes for every 1 DOCTOR scene.
Structure:
- Scene 1 (doctor): warm intro naming the diagnosis.
- Middle scenes: every concrete instruction gets its OWN dedicated
patient demonstration scene. Break multi-step actions into multiple
scenes (crutch walking vs. crutch stairs are separate).
- Final scene (doctor): gently remind the patient about red flags.
Keep this scene calm and reassuring, NOT dramatic.
Visual prompt rules (these become Veo prompts — make them VIVID):
- One continuous shot, no cuts, no on-screen text.
- Patient scenes: describe the EXACT physical motion — body position,
hand placement, the object being manipulated, the camera angle.
- 8 seconds is short — pick ONE clear motion per scene.
Never invent dosages, drug names, or actions not in the discharge note.
Return strict JSON matching the schema."""
{
"title": string,
"scenes": [{
"character": "doctor" | "patient",
"visual_prompt": string,
"narration": string,
"caption": string
}]
}
Each scene's visual_prompt + the style suffix is sent to Veo with the matching character image as the first frame. The image anchors face / style consistency across all 10 scenes.
# scripts/veo_client.py op = client.models.generate_videos( model="veo-3.0-generate-001", prompt=scene.visual_prompt + STYLE_SUFFIX, # style chosen per run image=types.Image.from_file(reference_image), # doctor.jpg OR patient.jpg config=types.GenerateVideosConfig( aspect_ratio="16:9", duration_seconds=8, number_of_videos=1, person_generation="allow_adult", ), ) # poll the long-running op every 15s up to 600s
realistic: "photorealistic cinematic shot, soft natural lighting, warm and calm tone, single continuous 8-second take" animation: "Pixar-style 3D animated film, smooth cel-shaded rendering, expressive friendly character animation, vibrant but gentle color palette" anime: "hand-drawn 2D Japanese anime illustration, soft watercolor backgrounds, Studio Ghibli short film tone"
The narration field from each scene gets routed to one of two voices based on who is speaking. Each ≈20-word line lands in roughly 8 seconds at speaking_rate=0.95, matching the clip duration.
# scripts/tts_client.py DOCTOR_VOICE = "en-US-Chirp3-HD-Charon" # warm male PATIENT_VOICE = "en-US-Chirp3-HD-Aoede" # calm female response = client.synthesize_speech( input=tts.SynthesisInput(text=scene.narration), voice=tts.VoiceSelectionParams( language_code="en-US", name=DOCTOR_VOICE if scene.character == "doctor" else PATIENT_VOICE, ), audio_config=tts.AudioConfig( audio_encoding=tts.AudioEncoding.MP3, speaking_rate=0.95, ), )
Pillow renders each caption as a transparent PNG strip (Homebrew's ffmpeg ships without freetype, so drawtext isn't available). ffmpeg overlays the strip, muxes the MP3, then concatenates all 10 scenes into the final MP4.
# scripts/ffmpeg_stitch.py caption_png = _render_caption_png(scene.caption) # Pillow → 1280×96 PNG ffmpeg -i raw_video.mp4 -i narration.mp3 -i caption.png \ -filter_complex "[0:v]scale=1280:720,pad=...[v0]; [v0][2:v]overlay=0:574[outv]" \ -map [outv] -map 1:a \ -c:v libx264 -c:a aac -shortest scene_NN.mp4 ffmpeg -f concat -i scenes.txt -c:v libx264 -c:a aac final.mp4
You said "go". Claude said yes once, and handled everything underneath.
Probed your project's Vertex catalog with curl to find the strongest model available — gemini-3-pro-preview returned 404, fell back to gemini-2.5-pro. Did the same for Veo (3.1 not enabled → used 3.0-generate-001).
Caught a PERMISSION_DENIED on Cloud TTS, called Service Usage API with your ADC token to enable it, retried. Recorded the action in memory for next session.
5 Python modules + FastAPI service + Dockerfile + dry-run mode + Imagen avatar generator, all from a one-line user request. Pillow caption overlay because Homebrew ffmpeg ships without freetype.
Two long-running Veo jobs running concurrently. Claude responded to other user messages while each batch ran, then auto-resumed when the system fired the completion notification.
Twice. First by editing the storyboard to soften the closer ("call 911" → "call your care team"). Second by adding retry-with-softer-prompt logic to veo_client.py so the run continues past intermittent rejections.
Saved your Vertex project ID, model availability matrix, and the team rename to CareClarity into ~/.claude/.../memory/ so the next session starts pre-loaded with this context.
Veo's safety filter is conservative on medical-emergency framing. Both films would have failed without on-the-fly fixes.
Veo's long-running operation completed with generated_videos = None. Claude detected the empty response, traced the cause to the emergency-medicine framing in the narration, rewrote scene 10 to a calmer "If anything feels off, call your care team", and concatenated the 9 successful scenes into a deliverable instead of crashing.
Lesson encoded into the system: the Gemini system prompt now explicitly tells the director to keep the closer "calm and reassuring, NOT dramatic — avoid graphic medical language."
A normally-benign patient-demonstration scene tripped Veo's filter on the anime run only. Claude added VeoSafetyRejection detection to veo_client.py with one automatic retry using the prefix "A calm, friendly, family-safe educational scene." — and a --skip-existing resume mode so re-runs don't redo successful scenes.
Outcome: The anime film completed all 10 scenes on the resume run, in 7.8 minutes, costing only the 7 scenes that still needed rendering.
discharge_film/ ├── samples/ │ ├── discharge_note_anklefracture.md # 4 KB source note │ ├── doctor.jpg, patient.jpg # real photos │ ├── doctor_avatar.png, patient_avatar.png # Imagen Pixar refs │ └── doctor_anime.jpg, patient_anime.png # anime refs ├── scripts/ │ ├── orchestrate.py # CLI entry, --style + --storyboard resume │ ├── scene_planner.py # Gemini 2.5 Pro → JSON storyboard │ ├── veo_client.py # Veo 3 + safety-retry │ ├── tts_client.py # Chirp 3 HD │ ├── ffmpeg_stitch.py # Pillow captions + concat │ ├── generate_reference_images.py # Imagen 4 realistic refs │ └── generate_avatar_refs.py # Imagen 4 Pixar avatars ├── app/ │ ├── main.py # FastAPI POST /generate (multipart) │ └── Dockerfile # Cloud Run image (ffmpeg pre-installed) ├── output/ │ ├── home_care_for_your_ankle_fracture/film.mp4 # Realistic │ ├── home_care_animated/film.mp4 # Pixar │ └── home_care_anime/film.mp4 # Anime └── upload_bundle/ # Renamed assets ready for Gemini App Builder