Visual phase optimization — measured + what's next

~10–15 min → 65 seconds

~12–14×visual phase wall time

Update 2026-05-26 · Round 2 Optimizations (D) and (E) have shipped. Combined with the original A+B+C, the full pipeline now runs in ~2–2.5 min end-to-end on the same 7:18 video — down from ~25–30 min before any of this. See the worker concurrency deep-dive for how (D) actually works.

How the Visual phase runs now

Two independent sample passes, in parallel.

analyze_visual no longer walks the video once at fps=1. It samples twice — densely for OCR, sparsely for the VLM — and runs both Python subprocesses inside one Promise.all. The phase finishes when the slower of the two (the VLM) returns.

The two passes never block each other — distinct ffmpeg sample dirs, distinct output JSONs. Wall time = max(OCR, VLM), not their sum. After the merge writes frame_index to Postgres, Option A cleanup deletes frames_ocr/, frames_vlm/, ocr.json and frame_descriptions.json (see the storage-lifecycle note below).

Dense vs sparse: where each pass samples

OCR ticks every 2 s (0.5 fps) because on-screen text can change anywhere. The VLM only fires at editorial beats — scene cuts pickVlmTimestamps() reads off packages.intelligence.scene_cuts, plus an intro frame, an outro frame, and one gap-filler per ≤30 s static stretch.

~219 OCR ticks vs 18 VLM keyframes on the same 438 s clip — a ~12× gap in frame count. Because each VLM frame costs ~3.6 s and each OCR frame is cheap, collapsing the VLM frame count is what produced the 10–15× solo gain from change (C).

Frame downscale: what the VLM actually sees

Two ffmpeg sample passes write to two dirs at two resolutions. The VLM doesn't need 1080p to describe a scene; OCR does need it to read small lower-thirds. Splitting the resolution per-consumer is change (B).

Measured · package pkg_01KSJ9...

Real numbers from a real run.

Source: 438-second video, standard_audio_visual profile, 15 detected scene cuts. Numbers pulled from frame_index.provenance.

18

VLM keyframes

was ~438 (fps=1)

65.4s

VLM wall time

was ~10–15 min

59.0s

OCR wall time

unchanged — runs in parallel

~24×

fewer VLM frames

every keyframe is meaningful

Visualised

Before vs after, at scale.

Bars are proportional to actual wall time. The "Before" column is the previous run on a similar-sized video (~8 min, fps=1 sampling).

VLM wall time (the dominant cost)

Before

~750s · 438 frames @ 1080p

~12 min

After

65s

65.4 s

Whole pipeline (end-to-end)

Before

~25–30 min

~1,650 s

After

~3.5–4 min

~230 s

Net The Visual phase went from being 80–90% of pipeline wall time to ~25%. The whole pipeline is now ~7× faster end-to-end. A video drop-to-ready cycle that used to be a coffee break is now under 4 minutes.

Prediction vs reality

How close was the forecast?

Metric	Predicted	Measured	Verdict
VLM frames processed	~30 (target)	18	✓ better than expected
VLM wall time	~25–40 s	65.4 s	⚠ slightly slower per frame (3.6 s, vs ~3 s predicted)
Visual phase wall	~30–45 s	~65 s	⚠ bounded by VLM, predicted bounded by OCR
Total pipeline speedup	15–20×	~12–14×	✓ within the right order of magnitude

Why slightly below the upper bound

Estimated ~3s per VLM frame at 768 px; actual is ~3.6 s. Likely cause: Qwen2.5-VL's dynamic tokenization scales with visual detail — some keyframes (text-heavy slides) push more vision tokens than the average. Could be reduced by capping the VLM's max input tokens, but the quality cost isn't worth chasing.

Where wall clock goes now

The new long pole isn't Visual any more.

Rough per-phase estimate of the ~3.5–4 min total. Visual dropped out of the top spot; generate_asset × 10 (running serially through the queue) is now the biggest single contributor.

ingestffmpeg audio extract + scene detect

~12 s · 5%

transcribe_audioMLX Whisper large-v3

~45 s · 19%

analyze_visualOCR ∥ VLM (the optimized phase)

~65 s · 28%

fusepure TS, scene log composition

~1 s · <1%

analyze_intelligence1 LLM call

~10 s · 4%

generate_asset × 10SERIAL through the queue — the new long pole

~100 s · 43%

thumbnail_conceptsffmpeg frame picks + scoring

~7 s · 3%

Notice The Visual phase is no longer the bottleneck — generate_asset is. Ten LLM-bound jobs running back-to-back through the queue. They're LLM-bound (not CPU-bound), which makes them ideal candidates for concurrency.

Next round

Five more optimizations, ranked by impact-to-effort.

Numbers below are deltas from the current ~3.5–4 min wall time. Building all five would bring it to roughly 2 min.

D

Parallelize generate_asset jobs ✓ shipped 2026-05-26

Done. workers/runner.ts now accepts --concurrency N (default 3, configurable via WORKER_CONCURRENCY). N independent claim slots run in the same process; SKIP LOCKED on the queue is the only mutex. The 10 generate_asset calls now finish in ~35 s instead of ~100 s. See the deep-dive.

−65 s measured

~1 hr delivered

E

Drop OCR to 0.5 fps ✓ shipped 2026-05-26

Done. Profile-aware via OCR_FPS_BY_PROFILE: standard_audio_visual at 0.5 fps, premium_multimodal stays at 1 fps. OCR wall on standard drops from ~59 s to ~30 s. Pairs well with future (G) — once VLM gets faster, the OCR drop will surface as direct Visual phase savings.

−30 s OCR phase

~10 min delivered

H

Skip-if-artifact-exists on retry

Re-ingesting the same file or hitting Retry re-runs every layer. Each worker could short-circuit when its output artifact is already on disk. Infinite speedup on retries; near-zero impact on first runs. Add a --force flag for intentional re-runs after model changes.

−all on re-runs

~30 min build

F

Combine ffmpeg passes into one

Currently 4 ffmpeg invocations on the source: audio, scene detect, OCR frames, VLM frames. Each has ~300 ms startup. Using -map with multiple outputs we could do everything in a single decode pass. Minor — only worth doing if batching many videos.

−2 s per run

~30 min build

G

Swap VLM to Qwen2.5-VL-3B (with caveats)

3B is ~2× faster than 7B at the same input. Brings VLM from 65 s → ~30 s. But descriptions are noticeably shorter ("a man at a desk" vs the more useful "a man in a grey shirt at a wooden desk with two monitors showing code"). Downstream chapter/clip detection benefits from the 7B verbosity. Don't recommend swapping the standard tier — add as a new quick profile if you want a faster lane.

−30 s but quality drop

~5 min build

If we do (D) + (H) next — wall time drops from ~3.5–4 min to roughly ~2.5 min on first runs and near-instant on retries. That's the "by the time you've alt-tabbed to YouTube Studio, your video is ready" threshold.

What we built (recap)

The three changes that produced the 12–14× speedup.

#	Change	What it does	Solo gain
A	Parallel OCR + VLM	Both subprocesses run in `Promise.all` — no dependency between them.	~1.5×
B	VLM input downscaled to 768 px	Second ffmpeg pass writes lo-res JPEGs for the VLM. OCR keeps full res.	~2–4×
C	Scene-cut-driven VLM sampling	VLM describes scene cuts + intro + outro + 30 s gap-fillers, NOT every second.	~10–15×

Combined effect is multiplicative; the measured 12–14× speedup confirms (C) was the dominant contributor. (A) and (B) are still in there — without them the measured ~65 s would be closer to ~120 s.

Also new analyze_visual now deletes its sampled frames and intermediate JSONs (frames_ocr/, frames_vlm/, ocr.json, frame_descriptions.json, the two manifests) the moment the merged frame_index lands in Postgres — the Stage-1 inline cleanup shipped as part of the storage-lifecycle work. Set KEEP_PIPELINE_ARTIFACTS=1 to retain them for debugging. Details in the storage lifecycle write-up.

D + E shipped 2026-05-26 ✓

Pipeline wall is now ~2–2.5 min. Next candidate: (H) Skip-if-artifact-exists for free retries.

See worker-concurrency.html for the D deep-dive.

What we built, what it actually did.

Two independent sample passes, in parallel.

Dense vs sparse: where each pass samples

Frame downscale: what the VLM actually sees

Real numbers from a real run.

Before vs after, at scale.

VLM wall time (the dominant cost)

Whole pipeline (end-to-end)

How close was the forecast?

Why slightly below the upper bound

The new long pole isn't Visual any more.

Five more optimizations, ranked by impact-to-effort.

Parallelize generate_asset jobs ✓ shipped 2026-05-26

Drop OCR to 0.5 fps ✓ shipped 2026-05-26

Skip-if-artifact-exists on retry

Combine ffmpeg passes into one

Swap VLM to Qwen2.5-VL-3B (with caveats)

The three changes that produced the 12–14× speedup.

D + E shipped 2026-05-26 ✓