CareClarity — Claude Code orchestrating the Gemini stack

ArchitectureClaude is the conductor — not the orchestra.

Claude Code didn't generate the videos. It wrote the code that calls Gemini, discovered which models you had access to, babysat 25 minutes of background Veo jobs, and recovered from two safety-filter rejections automatically.

Claude Code Opus 4.7 · 1M context

Conductor. Decided which models to call, designed every prompt, wrote 5 Python modules + FastAPI service, probed Vertex availability via curl, opened APIs via Service Usage, monitored long-running renders via background Bash, recovered from failures by editing code on the fly.

↓ writes code · runs Bash · monitors background jobs

Python orchestrator 5 modules, ~400 LoC

orchestrate.py · scene_planner.py · veo_client.py · tts_client.py · ffmpeg_stitch.py
Calls Vertex AI through google-genai SDK. Loops scenes, retries on safety rejection, skips already-rendered scenes for resume.

↓ REST / gRPC on Vertex AI

Google Vertex AI 4 models on qr33-vertex-247397

Gemini 2.5 Pro → storyboard JSON · Veo 3 → image-to-video per scene · Imagen 4 → avatar reference frames · Chirp 3 HD → distinct doctor/patient voiceover

PipelineFour steps. Every prompt Claude designed.

Same pipeline runs all three style variants. The only thing that changes between variants is a one-sentence style suffix appended to every Veo prompt — plus the reference image fed in as the first frame.

Plan the storyboard

gemini-2.5-pro · response_mime_type=application/json

Discharge note in. Structured 10-scene JSON out. The system instruction forces a 2:1 patient-demonstration-to-doctor ratio and bans invented dosages.

System instruction (excerpt)

"""You are a medical patient-education film director. Given a hospital
discharge note, produce a short film (8-12 scenes, each one continuous
8-second shot) that VISUALLY DEMONSTRATES every actionable instruction.

CORE PRINCIPLE: This is a demonstration film, not a lecture. Aim for
2 PATIENT demonstration scenes for every 1 DOCTOR scene.

Structure:
- Scene 1 (doctor): warm intro naming the diagnosis.
- Middle scenes: every concrete instruction gets its OWN dedicated
  patient demonstration scene. Break multi-step actions into multiple
  scenes (crutch walking vs. crutch stairs are separate).
- Final scene (doctor): gently remind the patient about red flags.
  Keep this scene calm and reassuring, NOT dramatic.

Visual prompt rules (these become Veo prompts — make them VIVID):
- One continuous shot, no cuts, no on-screen text.
- Patient scenes: describe the EXACT physical motion — body position,
  hand placement, the object being manipulated, the camera angle.
- 8 seconds is short — pick ONE clear motion per scene.

Never invent dosages, drug names, or actions not in the discharge note.
Return strict JSON matching the schema."""

Output schema (forced structured JSON)

{
  "title": string,
  "scenes": [{
    "character": "doctor" | "patient",
    "visual_prompt": string,
    "narration":    string,
    "caption":      string
  }]
}

Render each scene as 8-second video

veo-3.0-generate-001 · image-to-video · us-central1

Each scene's visual_prompt + the style suffix is sent to Veo with the matching character image as the first frame. The image anchors face / style consistency across all 10 scenes.

Veo call (Python)

# scripts/veo_client.py
op = client.models.generate_videos(
    model="veo-3.0-generate-001",
    prompt=scene.visual_prompt + STYLE_SUFFIX,    # style chosen per run
    image=types.Image.from_file(reference_image), # doctor.jpg OR patient.jpg
    config=types.GenerateVideosConfig(
        aspect_ratio="16:9",
        duration_seconds=8,
        number_of_videos=1,
        person_generation="allow_adult",
    ),
)
# poll the long-running op every 15s up to 600s

The three style suffixes (the only difference between variants)

realistic: "photorealistic cinematic shot, soft natural lighting,
            warm and calm tone, single continuous 8-second take"

animation: "Pixar-style 3D animated film, smooth cel-shaded
            rendering, expressive friendly character animation,
            vibrant but gentle color palette"

anime:     "hand-drawn 2D Japanese anime illustration, soft
            watercolor backgrounds, Studio Ghibli short film tone"

Synthesize narration

Cloud TTS · Chirp 3 HD voices

The narration field from each scene gets routed to one of two voices based on who is speaking. Each ≈20-word line lands in roughly 8 seconds at speaking_rate=0.95, matching the clip duration.

# scripts/tts_client.py
DOCTOR_VOICE  = "en-US-Chirp3-HD-Charon"   # warm male
PATIENT_VOICE = "en-US-Chirp3-HD-Aoede"    # calm female

response = client.synthesize_speech(
    input=tts.SynthesisInput(text=scene.narration),
    voice=tts.VoiceSelectionParams(
        language_code="en-US",
        name=DOCTOR_VOICE if scene.character == "doctor" else PATIENT_VOICE,
    ),
    audio_config=tts.AudioConfig(
        audio_encoding=tts.AudioEncoding.MP3,
        speaking_rate=0.95,
    ),
)

Composite + caption + concatenate

Pillow · ffmpeg (no drawtext / freetype dependency)

Pillow renders each caption as a transparent PNG strip (Homebrew's ffmpeg ships without freetype, so drawtext isn't available). ffmpeg overlays the strip, muxes the MP3, then concatenates all 10 scenes into the final MP4.

# scripts/ffmpeg_stitch.py
caption_png = _render_caption_png(scene.caption)  # Pillow → 1280×96 PNG

ffmpeg -i raw_video.mp4 -i narration.mp3 -i caption.png \
  -filter_complex "[0:v]scale=1280:720,pad=...[v0];
                   [v0][2:v]overlay=0:574[outv]" \
  -map [outv] -map 1:a \
  -c:v libx264 -c:a aac -shortest scene_NN.mp4

ffmpeg -f concat -i scenes.txt -c:v libx264 -c:a aac final.mp4

What Claude Code actually didThe orchestration moves you didn't see.

You said "go". Claude said yes once, and handled everything underneath.

🧭

Decided the model lineup

Probed your project's Vertex catalog with curl to find the strongest model available — gemini-3-pro-preview returned 404, fell back to gemini-2.5-pro. Did the same for Veo (3.1 not enabled → used 3.0-generate-001).

🔌

Enabled APIs you didn't have on

Caught a PERMISSION_DENIED on Cloud TTS, called Service Usage API with your ADC token to enable it, retried. Recorded the action in memory for next session.

🛠️

Wrote the entire pipeline

5 Python modules + FastAPI service + Dockerfile + dry-run mode + Imagen avatar generator, all from a one-line user request. Pillow caption overlay because Homebrew ffmpeg ships without freetype.

🌗

Babysat 25 min of background renders

Two long-running Veo jobs running concurrently. Claude responded to other user messages while each batch ran, then auto-resumed when the system fired the completion notification.

🛡️

Recovered from safety filter trips

Twice. First by editing the storyboard to soften the closer ("call 911" → "call your care team"). Second by adding retry-with-softer-prompt logic to veo_client.py so the run continues past intermittent rejections.

💾

Built durable memory

Saved your Vertex project ID, model availability matrix, and the team rename to CareClarity into ~/.claude/.../memory/ so the next session starts pre-loaded with this context.

Failure & recoveryThe two times Claude saved a render.

Veo's safety filter is conservative on medical-emergency framing. Both films would have failed without on-the-fly fixes.

🟠 Realistic run · Scene 10 — "Call 911 for chest pain"

Veo's long-running operation completed with generated_videos = None. Claude detected the empty response, traced the cause to the emergency-medicine framing in the narration, rewrote scene 10 to a calmer "If anything feels off, call your care team", and concatenated the 9 successful scenes into a deliverable instead of crashing.

Lesson encoded into the system: the Gemini system prompt now explicitly tells the director to keep the closer "calm and reassuring, NOT dramatic — avoid graphic medical language."

🟠 Anime run · Scene 4 — Patient crutch demo

A normally-benign patient-demonstration scene tripped Veo's filter on the anime run only. Claude added VeoSafetyRejection detection to veo_client.py with one automatic retry using the prefix "A calm, friendly, family-safe educational scene." — and a --skip-existing resume mode so re-runs don't redo successful scenes.

Outcome: The anime film completed all 10 scenes on the resume run, in 7.8 minutes, costing only the 7 scenes that still needed rendering.

Repo layoutEverything Claude built lives here.

discharge_film/
├── samples/
│   ├── discharge_note_anklefracture.md   # 4 KB source note
│   ├── doctor.jpg, patient.jpg            # real photos
│   ├── doctor_avatar.png, patient_avatar.png    # Imagen Pixar refs
│   └── doctor_anime.jpg, patient_anime.png      # anime refs
├── scripts/
│   ├── orchestrate.py                     # CLI entry, --style + --storyboard resume
│   ├── scene_planner.py                   # Gemini 2.5 Pro → JSON storyboard
│   ├── veo_client.py                      # Veo 3 + safety-retry
│   ├── tts_client.py                      # Chirp 3 HD
│   ├── ffmpeg_stitch.py                   # Pillow captions + concat
│   ├── generate_reference_images.py       # Imagen 4 realistic refs
│   └── generate_avatar_refs.py            # Imagen 4 Pixar avatars
├── app/
│   ├── main.py                            # FastAPI POST /generate (multipart)
│   └── Dockerfile                         # Cloud Run image (ffmpeg pre-installed)
├── output/
│   ├── home_care_for_your_ankle_fracture/film.mp4   # Realistic
│   ├── home_care_animated/film.mp4                  # Pixar
│   └── home_care_anime/film.mp4                     # Anime
└── upload_bundle/                         # Renamed assets ready for Gemini App Builder

How Claude Code orchestrated the Gemini stack to turn a discharge note into a film.

The deliverableTwo animated films from one discharge note + two photos.

3D animated avatars

Studio Ghibli watercolor