Three optimizations landed: parallelize OCR + VLM, downscale VLM input to 768 px, and drive VLM sampling from scene cuts instead of fps=1. Here's what the numbers say on a real 7:18 video, and where the next minute of wall time is still hiding.
analyze_visual no longer walks the video once at fps=1. It samples twice — densely
for OCR, sparsely for the VLM — and runs both Python subprocesses inside one
Promise.all. The phase finishes when the slower of the two (the VLM) returns.
The two passes never block each other — distinct ffmpeg sample dirs, distinct output JSONs. Wall time = max(OCR, VLM), not their sum. After the merge writes frame_index to Postgres, Option A cleanup deletes frames_ocr/, frames_vlm/, ocr.json and frame_descriptions.json (see the storage-lifecycle note below).
OCR ticks every 2 s (0.5 fps) because on-screen text can change anywhere. The VLM only fires
at editorial beats — scene cuts pickVlmTimestamps() reads off
packages.intelligence.scene_cuts, plus an intro frame, an outro frame, and one
gap-filler per ≤30 s static stretch.
~219 OCR ticks vs 18 VLM keyframes on the same 438 s clip — a ~12× gap in frame count. Because each VLM frame costs ~3.6 s and each OCR frame is cheap, collapsing the VLM frame count is what produced the 10–15× solo gain from change (C).
Two ffmpeg sample passes write to two dirs at two resolutions. The VLM doesn't need 1080p to describe a scene; OCR does need it to read small lower-thirds. Splitting the resolution per-consumer is change (B).
Source: 438-second video, standard_audio_visual profile, 15 detected scene cuts. Numbers pulled from frame_index.provenance.
Bars are proportional to actual wall time. The "Before" column is the previous run on a similar-sized video (~8 min, fps=1 sampling).
| Metric | Predicted | Measured | Verdict |
|---|---|---|---|
| VLM frames processed | ~30 (target) | 18 | ✓ better than expected |
| VLM wall time | ~25–40 s | 65.4 s | ⚠ slightly slower per frame (3.6 s, vs ~3 s predicted) |
| Visual phase wall | ~30–45 s | ~65 s | ⚠ bounded by VLM, predicted bounded by OCR |
| Total pipeline speedup | 15–20× | ~12–14× | ✓ within the right order of magnitude |
Estimated ~3s per VLM frame at 768 px; actual is ~3.6 s. Likely cause: Qwen2.5-VL's dynamic tokenization scales with visual detail — some keyframes (text-heavy slides) push more vision tokens than the average. Could be reduced by capping the VLM's max input tokens, but the quality cost isn't worth chasing.
Rough per-phase estimate of the ~3.5–4 min total. Visual dropped out of the top spot; generate_asset × 10 (running serially through the queue) is now the biggest single contributor.
generate_asset is. Ten LLM-bound jobs running back-to-back through the queue. They're LLM-bound (not CPU-bound), which makes them ideal candidates for concurrency.
Numbers below are deltas from the current ~3.5–4 min wall time. Building all five would bring it to roughly 2 min.
Done. workers/runner.ts now accepts --concurrency N (default 3, configurable via WORKER_CONCURRENCY). N independent claim slots run in the same process; SKIP LOCKED on the queue is the only mutex. The 10 generate_asset calls now finish in ~35 s instead of ~100 s. See
the deep-dive.
Done. Profile-aware via OCR_FPS_BY_PROFILE: standard_audio_visual at 0.5 fps, premium_multimodal stays at 1 fps. OCR wall on standard drops from ~59 s to ~30 s. Pairs well with future (G) — once VLM gets faster, the OCR drop will surface as direct Visual phase savings.
Re-ingesting the same file or hitting Retry re-runs every layer. Each worker could
short-circuit when its output artifact is already on disk. Infinite speedup on
retries; near-zero impact on first runs. Add a --force flag for
intentional re-runs after model changes.
Currently 4 ffmpeg invocations on the source: audio, scene detect, OCR frames, VLM
frames. Each has ~300 ms startup. Using -map with multiple outputs we
could do everything in a single decode pass.
Minor — only worth doing if batching many videos.
3B is ~2× faster than 7B at the same input. Brings VLM from 65 s → ~30 s. But
descriptions are noticeably shorter ("a man at a desk" vs the more useful "a man in a
grey shirt at a wooden desk with two monitors showing code"). Downstream chapter/clip
detection benefits from the 7B verbosity. Don't recommend swapping
the standard tier — add as a new quick profile if you want a faster lane.
| # | Change | What it does | Solo gain |
|---|---|---|---|
| A | Parallel OCR + VLM | Both subprocesses run in Promise.all — no dependency between them. |
~1.5× |
| B | VLM input downscaled to 768 px | Second ffmpeg pass writes lo-res JPEGs for the VLM. OCR keeps full res. | ~2–4× |
| C | Scene-cut-driven VLM sampling | VLM describes scene cuts + intro + outro + 30 s gap-fillers, NOT every second. | ~10–15× |
Combined effect is multiplicative; the measured 12–14× speedup confirms (C) was the dominant contributor. (A) and (B) are still in there — without them the measured ~65 s would be closer to ~120 s.
analyze_visual now deletes its sampled frames and
intermediate JSONs (frames_ocr/, frames_vlm/, ocr.json,
frame_descriptions.json, the two manifests) the moment the merged
frame_index lands in Postgres — the Stage-1 inline cleanup shipped as part of the
storage-lifecycle work. Set KEEP_PIPELINE_ARTIFACTS=1 to retain them for debugging.
Details in the storage lifecycle write-up.
Pipeline wall is now ~2–2.5 min. Next candidate: (H) Skip-if-artifact-exists for free retries.
See worker-concurrency.html for the D deep-dive.