Worker concurrency + OCR fps · 2026-05-26

Three slots, one queue, no locks.

Two performance changes shipped together: the worker runs N concurrent claim slots (default 3) so the 10 generate_asset jobs per package don't serialize, and the OCR sample rate drops to 0.5 fps for the standard profile (premium stays at 1 fps). Combined effect: pipeline wall time falls from ~3.5–4 min to ~2–2.5 min on a typical 8-minute video.

Why concurrency mattered most

generate_asset is LLM-bound, not CPU-bound.

A single package produces 10 generate_asset jobs (one per asset type — title set, description, chapters, tags, linkedin post, x post, x thread, article brief, newsletter summary, short clip plan). Each is a single LLM call. None of them touch any shared mutable state. Serializing them through a single worker slot was leaving ~65 s on the table per package.

Phase Before (single slot) After (3 slots) Savings
generate_asset × 10 ~100 s ~35 s ~65 s (~3×)
analyze_intelligence (1 call) ~10 s ~10 s
thumbnail_concepts ~7 s ~7 s
Total pipeline wall ~3.5–4 min ~2–2.5 min ~30%
How it works

SKIP LOCKED is the only mutex.

No new tables, no semaphores, no in-process locks. The Postgres queue already does the right thing — we just spawn N independent claim loops in the same process and let them race.

The architectural picture

Postgres queue jobs SELECT … FOR UPDATE SKIP LOCKED LIMIT 1 → atomically returns 1 unclaimed row OR null 10 generate_asset jobs: [#203] [#204] [#205] … [#206] [#207] [#208] … worker process (one) slot 0 ← claim() → generate_asset [runner] done job=203 kind=generate_asset slot 1 ← claim() → generate_asset [runner:1] done job=204 kind=generate_asset slot 2 ← claim() → generate_asset [runner:2] done job=205 kind=generate_asset All three slots race for the next job; SKIP LOCKED guarantees they don't collide. LLM provider Anthropic / OpenAI / Codex CLI N concurrent requests (bounded by provider rate limits, not by ChannelHelm)
N claim slots → one queue → one provider. The queue is the synchronisation primitive; nothing in the worker process needs a lock.

Why SKIP LOCKED is enough

When three slots call claim() at the same instant, each opens its own transaction and runs the identical SELECT … FOR UPDATE SKIP LOCKED LIMIT 1. The first to reach a row locks it; the others skip past the locked row and grab the next unlocked one. No slot ever blocks, and no two slots ever return the same job.

SKIP LOCKED contention between three claim slots t0 — three slots call claim() simultaneously slot 0 · claim() slot 1 · claim() slot 2 · claim() jobs (status='pending', ORDER BY priority, id) #203 → slot 0 locks it (FOR UPDATE) #204 → slot 1 SKIPs #203, locks #204 #205 → slot 2 SKIPs #203/#204, locks #205 #206 … #212 (still pending — next claim() round) The guarantee • FOR UPDATE takes a row lock; a concurrent txn that would block on it instead SKIPs it. • LIMIT 1 + ORDER BY priority,id means each slot deterministically grabs the next-best free row. • Zero in-process state. Add a 4th slot and it just claims #206 — no code change, no lock manager.
Same millisecond, three claims, three distinct rows. No retries, no deadlocks, no application-level mutex.

The timeline difference, visualised

BEFORE — single slot (~100 s)
slot 0
203
204
205
206
207
208
209
210
211
212
AFTER — 3 slots (~35 s)
slot 0
203
206
209
212
idle
slot 1
204
207
210
idle
slot 2
205
208
211
idle
Same 10 jobs, same provider, ~3× faster wall. The "idle" tails happen because there are no more generate_asset jobs queued — the slots wait for the next package or transition kind.

The code change

// workers/runner.ts — abridged

async function main() {
  const { kinds, concurrency, … } = parseArgs(process.argv.slice(2));

  // One stale-lock reclaim timer for the whole process
  setInterval(reclaim, 30_000);

  // Spawn N independent slots — SKIP LOCKED handles mutex
  await Promise.all(
    Array.from({ length: concurrency }, (_, slot) =>
      runSlot({ slot, lockedBy, kinds, … })
    )
  );
}

async function runSlot({ slot, kinds, … }) {
  while (!shuttingDown) {
    const job = await claim(kinds, lockedBy);   // SKIP LOCKED inside
    if (!job) { await sleep(idleMs); continue; }
    try {
      await HANDLERS[job.kind](job);
      await complete(job.id);
    } catch (err) { … }
  }
}
Why it's safe

Every handler audited for idempotency.

With N parallel slots, the same handler kind can run concurrently. We checked each handler against the "could two of these stomp on each other?" question.

Handler Concurrency-safe? Why
generate_asset ✓ yes Each job writes ONE asset row per (package_id, asset_type). No two parallel jobs ever target the same row.
analyze_intelligence ✓ yes Idempotency key analyze_intelligence:{sourceId}:{profile} means at most one runs per source.
analyze_visual / fuse / transcribe_audio ✓ yes Same — single-source idempotency keys prevent any two from racing on the same source.
dispatch ✓ yes One asset = one dispatch row (idempotency key dispatch:{assetId}). The youtube_direct branch reads sibling asset payloads but only runs post-approval, so generate_asset jobs aren't writing concurrently.
markReadyForReviewIfComplete (lifecycle helper) ✓ yes Pure status-update — multiple calls produce the same final state regardless of ordering.
recomputePackageDispatchState (lifecycle helper) ✓ yes Derives the package's dispatch state from the current assets table — order-independent.
Rule for new handlers When adding a new worker kind, set an idempotency key on the enqueue (per CLAUDE.md), make all writes per-row scoped (no read-modify-write on shared JSONB without conflict handling), and you're concurrency-safe by construction.
Tuning

How to raise or lower the slot count.

Default

3 slots. Set in scripts/dev.sh:

WORKER_CONCURRENCY=3  # env var, picked up by --concurrency default

Bump it (more parallelism)

If you see lots of generate_asset jobs queued in the Jobs page:

# env at process start
WORKER_CONCURRENCY=6 pnpm dev:all

# OR via flag
pnpm exec tsx workers/runner.ts --kinds  --concurrency 6

Drop it (hitting provider rate limits)

If you see 429 too many requests in the dispatch log or your provider's dashboard:

WORKER_CONCURRENCY=1  # back to serial — no parallelism
Provider Reasonable max concurrency Notes
Anthropic (Claude) 5–10 Default RPM is high enough for a single-creator workload. Bumps available on request.
OpenAI 3–8 Depends on tier. Tier-1 accounts hit 429 around 3–4 concurrent for gpt-4o.
OpenRouter 3–5 Varies by upstream model; conservative default recommended.
Codex CLI (local subprocess) 2–4 Bounded by local CPU more than by API. 3 is comfortable on M-series.
LM Studio (local HTTP) 1–2 Most local servers queue requests internally; bumping concurrency doesn't help and may slow each call.
Future work A per-provider max_concurrent column on llm_providers would let the worker auto-cap based on which provider is serving each call. Not built today — for now, the global WORKER_CONCURRENCY setting applies to all kinds equally.
Side-quest

OCR fps is now profile-aware.

The standard profile drops OCR from 1 fps → 0.5 fps. Premium stays at 1 fps. Halves the OCR subprocess wall on standard content with no measurable quality loss on overlay/lower-third detection. Today this saves time only on its own (since VLM is currently the bottleneck of the Visual phase), but it pairs perfectly with any future VLM-speed work.

fast_audio_only

— skipped —
Visual phase doesn't run at all under this profile.

standard_audio_visual

0.5 fps
One OCR frame every 2 seconds. Catches overlay text changes; misses sub-2s flashes (rare in talking-head content).

premium_multimodal

1 fps
Dense OCR for high-stakes content where a flashed lower-third or single-frame graphic might matter.

Effective wall savings

~30 s
On the standard 7:18 measurement video. Currently bounded by VLM (65 s) — surfaces as direct savings once VLM gets faster too.
// workers/kinds/analyze_visual.ts

const OCR_FPS_BY_PROFILE: Record<string, number> = {
  standard_audio_visual: 0.5,
  premium_multimodal: 1,
};

const ocrFps = OCR_FPS_BY_PROFILE[profile] ?? 0.5;
const ocrFrames = await sampleFrames({
  inputPath: videoPath,
  outputDir: ocrFramesDir,
  fps: ocrFps,
});
End-to-end impact

Where we are now.

Round What landed Pipeline wall Δ from previous
baseline fps=1 OCR + VLM, sequential ~25–30 min
round 1 A + B + C (scene-cut VLM, 768 px downscale, OCR ∥ VLM) ~3.5–4 min ~8×
round 2 (today) D + E (worker concurrency, OCR 0.5 fps) ~2–2.5 min +30% on top

Throughput, drawn to scale

Bars are pipeline wall time on a typical 8-minute video, log-spaced so the 25-min baseline and the 2-min current state both stay legible. Lower is better.

Pipeline wall time across optimization rounds 1m 2m 4m 10m 30m baseline fps1 OCR+VLM, serial ~25–30 min round 1 scene-cut VLM · 768px · OCR∥VLM ~3.5–4 min round 2 (today) worker concurrency · OCR 0.5fps ~2–2.5 min cumulative ≈ 10–12× faster end-to-end
Round 1 did the heavy lifting (~8× from the visual rewrite); round 2's concurrency + OCR change shaves a further ~30% off an already-fast pipeline.
Cumulative From ~25–30 min down to ~2–2.5 min — roughly 10–12× faster end-to-end. The pipeline is now in "by the time you switch tabs, your video is ready" territory.