Visual phase optimization — post-mortem

What we built, what it actually did.

Three optimizations landed: parallelize OCR + VLM, downscale VLM input to 768 px, and drive VLM sampling from scene cuts instead of fps=1. Here's what the numbers say on a real 7:18 video, and where the next minute of wall time is still hiding.

~10–15 min 65 seconds
~12–14×visual phase wall time
Update 2026-05-26 · Round 2 Optimizations (D) and (E) have shipped. Combined with the original A+B+C, the full pipeline now runs in ~2–2.5 min end-to-end on the same 7:18 video — down from ~25–30 min before any of this. See the worker concurrency deep-dive for how (D) actually works.
How the Visual phase runs now

Two independent sample passes, in parallel.

analyze_visual no longer walks the video once at fps=1. It samples twice — densely for OCR, sparsely for the VLM — and runs both Python subprocesses inside one Promise.all. The phase finishes when the slower of the two (the VLM) returns.

original.mp4 438 s · 1080p OCR pass — DENSE ffmpeg @ 0.5 fps · full res ~219 frames → frames_ocr/ ml/ocr.py Apple Vision · accurate ≈59 s wall VLM pass — SPARSE scene cuts · downscale 768px 18 keyframes → frames_vlm/ ml/describe_frames.py mlx-vlm Qwen2.5-VL-7B ≈65 s wall ← long pole Promise.all merge → frame_index + provenance

The two passes never block each other — distinct ffmpeg sample dirs, distinct output JSONs. Wall time = max(OCR, VLM), not their sum. After the merge writes frame_index to Postgres, Option A cleanup deletes frames_ocr/, frames_vlm/, ocr.json and frame_descriptions.json (see the storage-lifecycle note below).

Dense vs sparse: where each pass samples

OCR ticks every 2 s (0.5 fps) because on-screen text can change anywhere. The VLM only fires at editorial beats — scene cuts pickVlmTimestamps() reads off packages.intelligence.scene_cuts, plus an intro frame, an outro frame, and one gap-filler per ≤30 s static stretch.

OCR dense VLM sparse intro cut cut +30s gap cut cut outro 0s ~220s 438s

~219 OCR ticks vs 18 VLM keyframes on the same 438 s clip — a ~12× gap in frame count. Because each VLM frame costs ~3.6 s and each OCR frame is cheap, collapsing the VLM frame count is what produced the 10–15× solo gain from change (C).

Frame downscale: what the VLM actually sees

OCR input — full res 1920 × 1080 ~2.07 MP · Apple Vision needs every pixel VLM input — 768px 768×432 ~0.33 MP long axis capped at 768 Pixels fed to the VLM ~16% of full-res area ~6× fewer pixels → fewer vision tokens → ~2–4× faster per VLM frame (change B). OCR keeps full resolution on its own pass; only the VLM frames are shrunk.

Two ffmpeg sample passes write to two dirs at two resolutions. The VLM doesn't need 1080p to describe a scene; OCR does need it to read small lower-thirds. Splitting the resolution per-consumer is change (B).

Measured · package pkg_01KSJ9...

Real numbers from a real run.

Source: 438-second video, standard_audio_visual profile, 15 detected scene cuts. Numbers pulled from frame_index.provenance.

18
VLM keyframes
was ~438 (fps=1)
65.4s
VLM wall time
was ~10–15 min
59.0s
OCR wall time
unchanged — runs in parallel
~24×
fewer VLM frames
every keyframe is meaningful
Visualised

Before vs after, at scale.

Bars are proportional to actual wall time. The "Before" column is the previous run on a similar-sized video (~8 min, fps=1 sampling).

VLM wall time (the dominant cost)

Before
~750s · 438 frames @ 1080p
~12 min
After
65s
65.4 s

Whole pipeline (end-to-end)

Before
~25–30 min
~1,650 s
After
~3.5–4 min
~230 s
Net The Visual phase went from being 80–90% of pipeline wall time to ~25%. The whole pipeline is now ~7× faster end-to-end. A video drop-to-ready cycle that used to be a coffee break is now under 4 minutes.
Prediction vs reality

How close was the forecast?

Metric Predicted Measured Verdict
VLM frames processed ~30 (target) 18 ✓ better than expected
VLM wall time ~25–40 s 65.4 s ⚠ slightly slower per frame (3.6 s, vs ~3 s predicted)
Visual phase wall ~30–45 s ~65 s ⚠ bounded by VLM, predicted bounded by OCR
Total pipeline speedup 15–20× ~12–14× ✓ within the right order of magnitude

Why slightly below the upper bound

Estimated ~3s per VLM frame at 768 px; actual is ~3.6 s. Likely cause: Qwen2.5-VL's dynamic tokenization scales with visual detail — some keyframes (text-heavy slides) push more vision tokens than the average. Could be reduced by capping the VLM's max input tokens, but the quality cost isn't worth chasing.

Where wall clock goes now

The new long pole isn't Visual any more.

Rough per-phase estimate of the ~3.5–4 min total. Visual dropped out of the top spot; generate_asset × 10 (running serially through the queue) is now the biggest single contributor.

ingestffmpeg audio extract + scene detect
~12 s · 5%
transcribe_audioMLX Whisper large-v3
~45 s · 19%
analyze_visualOCR ∥ VLM (the optimized phase)
~65 s · 28%
fusepure TS, scene log composition
~1 s · <1%
analyze_intelligence1 LLM call
~10 s · 4%
generate_asset × 10SERIAL through the queue — the new long pole
~100 s · 43%
thumbnail_conceptsffmpeg frame picks + scoring
~7 s · 3%
Notice The Visual phase is no longer the bottleneck — generate_asset is. Ten LLM-bound jobs running back-to-back through the queue. They're LLM-bound (not CPU-bound), which makes them ideal candidates for concurrency.
Next round

Five more optimizations, ranked by impact-to-effort.

Numbers below are deltas from the current ~3.5–4 min wall time. Building all five would bring it to roughly 2 min.

D

Parallelize generate_asset jobs ✓ shipped 2026-05-26

Done. workers/runner.ts now accepts --concurrency N (default 3, configurable via WORKER_CONCURRENCY). N independent claim slots run in the same process; SKIP LOCKED on the queue is the only mutex. The 10 generate_asset calls now finish in ~35 s instead of ~100 s. See the deep-dive.

−65 s measured
~1 hr delivered
E

Drop OCR to 0.5 fps ✓ shipped 2026-05-26

Done. Profile-aware via OCR_FPS_BY_PROFILE: standard_audio_visual at 0.5 fps, premium_multimodal stays at 1 fps. OCR wall on standard drops from ~59 s to ~30 s. Pairs well with future (G) — once VLM gets faster, the OCR drop will surface as direct Visual phase savings.

−30 s OCR phase
~10 min delivered
H

Skip-if-artifact-exists on retry

Re-ingesting the same file or hitting Retry re-runs every layer. Each worker could short-circuit when its output artifact is already on disk. Infinite speedup on retries; near-zero impact on first runs. Add a --force flag for intentional re-runs after model changes.

−all on re-runs
~30 min build
F

Combine ffmpeg passes into one

Currently 4 ffmpeg invocations on the source: audio, scene detect, OCR frames, VLM frames. Each has ~300 ms startup. Using -map with multiple outputs we could do everything in a single decode pass. Minor — only worth doing if batching many videos.

−2 s per run
~30 min build
G

Swap VLM to Qwen2.5-VL-3B (with caveats)

3B is ~2× faster than 7B at the same input. Brings VLM from 65 s → ~30 s. But descriptions are noticeably shorter ("a man at a desk" vs the more useful "a man in a grey shirt at a wooden desk with two monitors showing code"). Downstream chapter/clip detection benefits from the 7B verbosity. Don't recommend swapping the standard tier — add as a new quick profile if you want a faster lane.

−30 s but quality drop
~5 min build
If we do (D) + (H) next — wall time drops from ~3.5–4 min to roughly ~2.5 min on first runs and near-instant on retries. That's the "by the time you've alt-tabbed to YouTube Studio, your video is ready" threshold.
What we built (recap)

The three changes that produced the 12–14× speedup.

# Change What it does Solo gain
A Parallel OCR + VLM Both subprocesses run in Promise.all — no dependency between them. ~1.5×
B VLM input downscaled to 768 px Second ffmpeg pass writes lo-res JPEGs for the VLM. OCR keeps full res. ~2–4×
C Scene-cut-driven VLM sampling VLM describes scene cuts + intro + outro + 30 s gap-fillers, NOT every second. ~10–15×

Combined effect is multiplicative; the measured 12–14× speedup confirms (C) was the dominant contributor. (A) and (B) are still in there — without them the measured ~65 s would be closer to ~120 s.

Also new analyze_visual now deletes its sampled frames and intermediate JSONs (frames_ocr/, frames_vlm/, ocr.json, frame_descriptions.json, the two manifests) the moment the merged frame_index lands in Postgres — the Stage-1 inline cleanup shipped as part of the storage-lifecycle work. Set KEEP_PIPELINE_ARTIFACTS=1 to retain them for debugging. Details in the storage lifecycle write-up.

D + E shipped 2026-05-26 ✓

Pipeline wall is now ~2–2.5 min. Next candidate: (H) Skip-if-artifact-exists for free retries.

See worker-concurrency.html for the D deep-dive.