Every file ChannelHelm creates has exactly one lifecycle β most are throwaway by the time the pipeline finishes, some are review-only, some are archive-after-publish, a few are permanent. Here's the complete map plus the cleanup work that's now live.
archive_package worker + daily cron) moves published media to an external drive;
Option C (operator hard-delete) is a πΌ Delete video button
on the package page. The map and diagrams below reflect what the code actually does today.
Each artifact's stage is determined by its last legitimate consumer. After that read, it's either re-readable later (move to a later stage) or never touched again (delete).
Created during the pipeline, read once by the next step, never touched again. Pure throwaway.
Needed while the operator reviews and edits in the Studio. Deletable after publish.
Useful for re-renders or audit but not for normal operation. Archive to external drive or delete after N days.
Single source of truth. The audit trail. Never delete; back up regularly.
Sizes are for a typical 8-minute 1080p YouTube video at standard_audio_visual profile. Stage badge says when each file becomes deletable.
| Artifact | Last consumer | Size | Stage |
|---|---|---|---|
original.mp4 |
clip_render (any future re-render) | 40β60 MB | post-publish |
audio.wav |
transcribe_audio (one read) | 13β15 MB | pipeline |
frames_ocr/ |
ocr.py (one read) | 25β50 MB | pipeline |
frames_vlm/ |
describe_frames.py (one read) | 3β5 MB | pipeline |
frame_manifest_ocr.json Β· frame_manifest_vlm.json |
The two Python CLIs (one read each) | ~100 KB | pipeline |
ocr.json |
analyze_visual's merge step | ~420 KB | pipeline |
frame_descriptions.json |
analyze_visual's merge step | 30β180 KB | pipeline |
frame_index.json |
fuse (one read) Β· also mirrored in Postgres | ~800 KB | pipeline |
scene_log.json |
analyze_intelligence (one read) Β· also mirrored in Postgres | ~730 KB | pipeline |
transcript.json |
Shorts editor word-snap Β· also mirrored in Postgres | ~150 KB | review |
thumbs/concept_*.jpg |
Studio thumbnail selection + dispatch upload | ~100 KB | review |
clips/clip_NNN.mp4 |
dispatch + Studio preview | 5β15 MB each | post-publish |
clips/clip_NNN.ass / .vtt |
clip_render only (subtitles embedded into MP4) | few KB | post-publish |
Postgres rowssources Β· packages Β· assets Β· jobs Β· dispatches Β· signals Β· webhook_events Β· settings Β· llm_providers Β· voice_examples |
Everything | ~500 KB / pkg | permanent |
frame_index.json, scene_log.json, transcript.json) are also mirrored into Postgres via the existing packages.intelligence JSONB column. The disk copies are pure duplication β every downstream reader can use the Postgres version. So deleting them at the right pipeline step is genuinely safe.
Each colour band shows when that artifact is live on disk. The pipeline runs in the first ~3 minutes; the operator reviews in the Studio; eventually the package gets published. The dimmed tails are now real: Stage-1 bands end where Option A deletes them, and the "archived off local" tail is the Option B archive_package worker doing its move.
All three are independent β each can stand alone or stack. A, B and C all shipped (A was the highest ROI for least effort, so it landed first; B followed; C is the operator-triggered escape hatch for purging a specific package's source on demand).
An rm at the success tail of each pipeline worker deletes its single-consumer
inputs. No policy, no settings, no DB changes. Future ingests average ~40 MB on disk
instead of ~85 MB. Escape hatch: KEEP_PIPELINE_ARTIFACTS=1.
The archive_package worker (daily cron) moves original.mp4 +
clips/ to an external drive N days after the latest dispatch. Sources stay
restorable for re-renders via sources.archive_path. Settings:
ARCHIVE_ROOT, ARCHIVE_AFTER_DAYS, ARCHIVE_DELETE_CLIPS.
Operator-triggered: a πΌ Delete video button (with confirm) on the package
page. deleteSourceVideo removes the local media dir (MEDIA_ROOT-guarded) AND
the archived copy, then nulls sources.local_media_path + archive_path
so future clip_render / Backlog-Revival re-runs fail cleanly. Postgres history kept.
An rm step on the success path of four workers. Every Stage-1 artifact gets removed seconds after the next pipeline step starts. No future code ever reads those files; Postgres holds whatever data downstream readers actually need. This is live in transcribe_audio.ts, analyze_visual.ts, fuse.ts, and analyze_intelligence.ts.
| At the end of⦠| Delete | Saves |
|---|---|---|
transcribe_audio | audio.wav, diarization.json | ~13 MB |
analyze_visual | frames_ocr/, frames_vlm/, frame_manifest_ocr.json, frame_manifest_vlm.json, ocr.json, frame_descriptions.json | ~30β55 MB |
fuse | frame_index.json (Postgres mirror exists) | ~800 KB |
analyze_intelligence | scene_log.json (Postgres mirror exists) | ~730 KB |
// workers/kinds/transcribe_audio.ts (tail, after success) if (process.env.KEEP_PIPELINE_ARTIFACTS !== '1') { await rm(audioPath, { force: true }); await rm(diarizationPath, { force: true }); } // workers/kinds/analyze_visual.ts (tail) if (process.env.KEEP_PIPELINE_ARTIFACTS !== '1') { await Promise.all([ rm(framesOcrDir, { recursive: true, force: true }), rm(framesVlmDir, { recursive: true, force: true }), rm(ocrPath, { force: true }), rm(descriptionsPath, { force: true }), rm(ocrManifestPath, { force: true }), rm(vlmManifestPath, { force: true }), ]); }
Same pattern in fuse.ts and analyze_intelligence.ts. KEEP_PIPELINE_ARTIFACTS=1 is the debugging escape hatch β set it when you want to ls frames_vlm/ to inspect what the VLM saw.
analyze_visual on the same source has to re-sample frames (~5 s extra)The worker kind archive_package (workers/kinds/archive_package.ts) runs on a schedule. For each package whose latest successful dispatch happened more than ARCHIVE_AFTER_DAYS ago AND that hasn't been archived yet, it sets packages.archived_at. The physical file move (original.mp4 + clips/ from MEDIA_ROOT to ARCHIVE_ROOT) only fires when this is the last unarchived package on its source β then it records the new location in sources.archive_path so any future re-render still resolves. The copy is cross-filesystem-safe: copy β verify size β delete, never a bare rename.
Eligibility is recomputed by the daily enqueuer; the file move is gated on last-package-wins so a source with several packages keeps its bytes until every package referencing it is archived.
scripts/enqueue-recurring.ts runs daily (cron / launchd). It fans out one archive_package job per eligible package.success = true, AND MAX(dispatched_at) < now() - interval '1 day' Γ ARCHIVE_AFTER_DAYS, AND packages.archived_at IS NULL.archive_package:{packageId}; reruns are no-ops (and the worker itself early-returns when archived_at is already set).mkdirs the destination first; if that fails it throws, the queue backs off, and the next cycle retries. The source files stay on local until the copy verifies β no data-loss path.| Key | Purpose | Default |
|---|---|---|
ARCHIVE_ROOT boot-only | Filesystem path to external drive root. Boot-only β workers cache the absolute path at startup. | unset (disables archiving) |
ARCHIVE_AFTER_DAYS | Days since latest dispatch before a package is eligible | 14 |
ARCHIVE_DELETE_CLIPS | Delete rendered clip MP4s instead of moving them (original.mp4 still moves) | false |
0007_archive_lifecycle.sql (applied)ALTER TABLE sources ADD COLUMN archive_path text; ALTER TABLE packages ADD COLUMN archived_at timestamptz; -- archive_path: clip_render's fallback for re-renders after the source moves -- archived_at: the audit timestamp + the eligibility gate (IS NULL)
A source can back several packages (re-ingests, multiple profiles). The bytes only leave local once the last package referencing them is archived β until then each archived sibling just flips its own archived_at.
archived_at)A πΌ Delete video button (behind a confirm) on the package page. deleteSourceVideo(packageId) removes original.mp4 (the whole local media dir) from MEDIA_ROOT β guarded β AND the archived copy from ARCHIVE_ROOT if Option B already moved it. It then nulls sources.local_media_path + sources.archive_path so re-renders refuse with a friendly "media deleted" error instead of an ENOENT crash.
It pairs naturally with Option B: archive after 14 days, then hard-delete operator-triggered when "I genuinely don't need this anymore." This is the on-demand escape hatch for purging a specific package's source for storage or sensitivity reasons.
A confirm step before the irreversible delete. After deletion, any future clip_render or Backlog-Revival re-run on the source fails cleanly with a "media deleted" error rather than crashing.
| A Β· Inline cleanup shipped | B Β· Archive job shipped | C Β· Hard delete shipped | |
|---|---|---|---|
| When it runs | Automatically, at the end of each pipeline step | Daily cron, N days after the latest dispatch | Manual, operator clicks the πΌ Delete video button |
| What it touches | Stage-1 artifacts only (audio.wav, diarization.json, frames, intermediate JSONs) | Stage-3 artifacts (original.mp4, clips/) β moved only when last package on the source archives | Stage-3 artifacts (original.mp4 + local clips); pairs with B (deletes the archived copy too) |
| Destination | Trash β files deleted forever | External drive β files moved, restorable | Trash β files deleted forever |
| Operator control | Off via env KEEP_PIPELINE_ARTIFACTS=1 |
Per-package via cadence config + ad-hoc trigger | Per-package, requires explicit click + confirm |
| Database changes | None | 2 new columns (archived_at, archive_path) β migration 0007, applied |
None (nulls existing local_media_path + archive_path) |
| Settings keys | 1 env (KEEP_PIPELINE_ARTIFACTS) |
3 (ARCHIVE_ROOT, ARCHIVE_AFTER_DAYS, ARCHIVE_DELETE_CLIPS) |
0 |
| Re-render impact | Re-running visual pipeline re-samples frames (~5 s extra) | Re-render works as long as external drive is mounted | Re-render fails cleanly with "source video deleted" |
| Disk saved per video | ~45 MB | ~45 MB + clip size | ~40 MB + clip size |
| Status | β shipped | β shipped | β shipped (deleteSourceVideo + πΌ button + confirm) |
| Reversibility | Irreversible (but artifacts are derived β re-runnable) | Fully reversible (files moved, not deleted) | Irreversible |
| Best paired with | Always β independent of B/C | A (already saved transient files; B handles published files) | A + B (post-archive cleanup of selected packages) |
| Videos published | Before cleanup | Option A live | Option A + B live (archive after 14d) | + C delete some live |
|---|---|---|---|---|
| 10 | ~1.7 GB | ~0.9 GB | ~0.4 GB | ~0.4 GB |
| 100 | ~17 GB | ~9 GB | ~4 GB local + 13 GB archive | ~4 GB local + selective archive |
| 1,000 | ~170 GB | ~90 GB | ~40 GB local + 130 GB archive | ~40 GB local + selective archive |
| 10,000 | ~1.7 TB | ~900 GB | ~400 GB local + 1.3 TB archive | ~400 GB local + selective archive |
"+ A" rows assume Option A is active on every ingest going forward. "+ A + B" rows show local disk split from the external drive. Postgres footprint stays tiny throughout (~500 KB per package).
archive_package
worker plus the daily recurring enqueuer move published media to ARCHIVE_ROOT
once a package's latest dispatch is older than ARCHIVE_AFTER_DAYS (default 14).
Option C β operator-triggered hard delete β is now shipped too: the πΌ Delete video
button on the package page purges a specific package's source on demand. All three are live.
All three were designed to be independent. A is self-contained (no dependencies on B or C). B adds the archive system that C leverages β deleting the archived copy alongside the local one. C also works standalone: with no ARCHIVE_ROOT configured it just deletes from MEDIA_ROOT directly.