Storage lifecycle

What's temporary, what's forever.

Every file ChannelHelm creates has exactly one lifecycle β€” most are throwaway by the time the pipeline finishes, some are review-only, some are archive-after-publish, a few are permanent. Here's the complete map plus the cleanup work that's now live.

Shipped All three options below are now in production. Option A (inline Stage-1 cleanup) runs at the tail of every pipeline worker; Option B (the archive_package worker + daily cron) moves published media to an external drive; Option C (operator hard-delete) is a πŸ“Ό Delete video button on the package page. The map and diagrams below reflect what the code actually does today.
The four lifecycle stages

One artifact, one stage.

Each artifact's stage is determined by its last legitimate consumer. After that read, it's either re-readable later (move to a later stage) or never touched again (delete).

STAGE 1

Pipeline-only

Created during the pipeline, read once by the next step, never touched again. Pure throwaway.

STAGE 2

Review-only

Needed while the operator reviews and edits in the Studio. Deletable after publish.

STAGE 3

Post-publish

Useful for re-renders or audit but not for normal operation. Archive to external drive or delete after N days.

STAGE 4

Permanent

Single source of truth. The audit trail. Never delete; back up regularly.

Every artifact, classified

The complete map.

Sizes are for a typical 8-minute 1080p YouTube video at standard_audio_visual profile. Stage badge says when each file becomes deletable.

Artifact Last consumer Size Stage
original.mp4 clip_render (any future re-render) 40–60 MB post-publish
audio.wav transcribe_audio (one read) 13–15 MB pipeline
frames_ocr/ ocr.py (one read) 25–50 MB pipeline
frames_vlm/ describe_frames.py (one read) 3–5 MB pipeline
frame_manifest_ocr.json Β· frame_manifest_vlm.json The two Python CLIs (one read each) ~100 KB pipeline
ocr.json analyze_visual's merge step ~420 KB pipeline
frame_descriptions.json analyze_visual's merge step 30–180 KB pipeline
frame_index.json fuse (one read) Β· also mirrored in Postgres ~800 KB pipeline
scene_log.json analyze_intelligence (one read) Β· also mirrored in Postgres ~730 KB pipeline
transcript.json Shorts editor word-snap Β· also mirrored in Postgres ~150 KB review
thumbs/concept_*.jpg Studio thumbnail selection + dispatch upload ~100 KB review
clips/clip_NNN.mp4 dispatch + Studio preview 5–15 MB each post-publish
clips/clip_NNN.ass / .vtt clip_render only (subtitles embedded into MP4) few KB post-publish
Postgres rows
sources Β· packages Β· assets Β· jobs Β· dispatches Β· signals Β· webhook_events Β· settings Β· llm_providers Β· voice_examples
Everything ~500 KB / pkg permanent
The duplication insight Three JSON files on disk (frame_index.json, scene_log.json, transcript.json) are also mirrored into Postgres via the existing packages.intelligence JSONB column. The disk copies are pure duplication β€” every downstream reader can use the Postgres version. So deleting them at the right pipeline step is genuinely safe.
Visualised

One video's storage over time.

Each colour band shows when that artifact is live on disk. The pipeline runs in the first ~3 minutes; the operator reviews in the Studio; eventually the package gets published. The dimmed tails are now real: Stage-1 bands end where Option A deletes them, and the "archived off local" tail is the Option B archive_package worker doing its move.

Ingest Transcribe Analyze visual Fuse + Intel. Review Publish original.mp4 audio.wav frames_ocr/ frames_vlm/ ocr.json + frame_descriptions frame_index.json scene_log.json transcript.json thumbs/concept_*.jpg clips/clip_NNN.mp4 Postgres rows Stage 1 β€” pipeline only Stage 2 β€” review-only Stage 3 β€” archive after publish Stage 4 β€” permanent Archived (off local)
Three cleanup options Β· all shipped

Same goal, different scope.

All three are independent β€” each can stand alone or stack. A, B and C all shipped (A was the highest ROI for least effort, so it landed first; B followed; C is the operator-triggered escape hatch for purging a specific package's source on demand).

A

Stage 1 inline cleanup βœ“ shipped

An rm at the success tail of each pipeline worker deletes its single-consumer inputs. No policy, no settings, no DB changes. Future ingests average ~40 MB on disk instead of ~85 MB. Escape hatch: KEEP_PIPELINE_ARTIFACTS=1.

βˆ’45 MB per video
live in production
B

Post-publish archive job βœ“ shipped

The archive_package worker (daily cron) moves original.mp4 + clips/ to an external drive N days after the latest dispatch. Sources stay restorable for re-renders via sources.archive_path. Settings: ARCHIVE_ROOT, ARCHIVE_AFTER_DAYS, ARCHIVE_DELETE_CLIPS.

βˆ’45 MB+ per archived pkg
live in production
C

Hard delete source video βœ“ shipped

Operator-triggered: a πŸ“Ό Delete video button (with confirm) on the package page. deleteSourceVideo removes the local media dir (MEDIA_ROOT-guarded) AND the archived copy, then nulls sources.local_media_path + archive_path so future clip_render / Backlog-Revival re-runs fail cleanly. Postgres history kept.

βˆ’40 MB+ per deletion
live in production
A
SHIPPED Β· runs on every pipeline run

Stage 1 inline cleanup

Delete each pipeline artifact the moment its sole consumer is done with it.

What it does

An rm step on the success path of four workers. Every Stage-1 artifact gets removed seconds after the next pipeline step starts. No future code ever reads those files; Postgres holds whatever data downstream readers actually need. This is live in transcribe_audio.ts, analyze_visual.ts, fuse.ts, and analyze_intelligence.ts.

What gets deleted, when

At the end of…DeleteSaves
transcribe_audioaudio.wav, diarization.json~13 MB
analyze_visualframes_ocr/, frames_vlm/, frame_manifest_ocr.json, frame_manifest_vlm.json, ocr.json, frame_descriptions.json~30–55 MB
fuseframe_index.json (Postgres mirror exists)~800 KB
analyze_intelligencescene_log.json (Postgres mirror exists)~730 KB

Implementation

// workers/kinds/transcribe_audio.ts (tail, after success)
if (process.env.KEEP_PIPELINE_ARTIFACTS !== '1') {
  await rm(audioPath, { force: true });
  await rm(diarizationPath, { force: true });
}

// workers/kinds/analyze_visual.ts (tail)
if (process.env.KEEP_PIPELINE_ARTIFACTS !== '1') {
  await Promise.all([
    rm(framesOcrDir, { recursive: true, force: true }),
    rm(framesVlmDir, { recursive: true, force: true }),
    rm(ocrPath, { force: true }),
    rm(descriptionsPath, { force: true }),
    rm(ocrManifestPath, { force: true }),
    rm(vlmManifestPath, { force: true }),
  ]);
}

Same pattern in fuse.ts and analyze_intelligence.ts. KEEP_PIPELINE_ARTIFACTS=1 is the debugging escape hatch β€” set it when you want to ls frames_vlm/ to inspect what the VLM saw.

Pros
  • Biggest disk saving per video (~50% reduction)
  • No DB changes, no settings, no migration, no policy decision
  • ~20 lines of code total across 4 files
  • Easy debugging via env escape hatch
  • Functional impact is zero β€” every consumer is already done
Cons
  • Re-running analyze_visual on the same source has to re-sample frames (~5 s extra)
  • Lose the "ls into the frames folder to see what was sampled" debugging convenience (mitigated by the env flag)
B
SHIPPED Β· daily cron + archive_package worker

Post-publish archive job

Cron-driven worker moves published media to an external drive N days after the latest dispatch.

What it does

The worker kind archive_package (workers/kinds/archive_package.ts) runs on a schedule. For each package whose latest successful dispatch happened more than ARCHIVE_AFTER_DAYS ago AND that hasn't been archived yet, it sets packages.archived_at. The physical file move (original.mp4 + clips/ from MEDIA_ROOT to ARCHIVE_ROOT) only fires when this is the last unarchived package on its source β€” then it records the new location in sources.archive_path so any future re-render still resolves. The copy is cross-filesystem-safe: copy β†’ verify size β†’ delete, never a bare rename.

Dispatch succeeds dispatches.success Wait N days latest dispatch < now() βˆ’ ARCHIVE_AFTER_DAYS Eligible package archived_at IS NULL set packages.archived_at always β€” per eligible package MOVE original.mp4 + clips/ β†’ ARCHIVE_ROOT only if LAST unarchived package on source Β· copyβ†’verifyβ†’delete Sibling package still unarchived β†’ flip the flag, keep bytes local files stay pinned until the last package on the source is archived

Eligibility is recomputed by the daily enqueuer; the file move is gated on last-package-wins so a source with several packages keeps its bytes until every package referencing it is archived.

Trigger model

Settings (live on the /settings page)

KeyPurposeDefault
ARCHIVE_ROOT boot-onlyFilesystem path to external drive root. Boot-only β€” workers cache the absolute path at startup.unset (disables archiving)
ARCHIVE_AFTER_DAYSDays since latest dispatch before a package is eligible14
ARCHIVE_DELETE_CLIPSDelete rendered clip MP4s instead of moving them (original.mp4 still moves)false

Two DB columns β€” migration 0007_archive_lifecycle.sql (applied)

ALTER TABLE sources  ADD COLUMN archive_path text;
ALTER TABLE packages ADD COLUMN archived_at timestamptz;
-- archive_path: clip_render's fallback for re-renders after the source moves
-- archived_at:  the audit timestamp + the eligibility gate (IS NULL)

Last-package-wins: which package triggers the move?

src_… media on disk original.mp4 + clips/ shared by N packages pkg_A archived βœ“ flag set Β· bytes kept pkg_B archived βœ“ flag set Β· bytes kept pkg_C β€” last one now being archived Any other unarchived? no MOVE bytes β†’ ARCHIVE_ROOT set sources.archive_path yes Flip archived_at only leave bytes on local disk

A source can back several packages (re-ingests, multiple profiles). The bytes only leave local once the last package referencing them is archived β€” until then each archived sibling just flips its own archived_at.

Pros
  • Real archive workflow with audit trail (archived_at)
  • External drive holds cold-storage backup of published work
  • Re-renders still work as long as the drive is mounted
  • Operator policy is configurable, not hardcoded
Cons
  • One DB migration (two new columns, additive β€” safe)
  • External drive must be mounted at the same path consistently for re-renders to find files
  • If the drive is unmounted, ALL re-renders for archived packages refuse cleanly (operator-visible error)
  • More moving parts to test (cron, eligibility, mount-detection)
C
SHIPPED Β· πŸ“Ό Delete video button on the package page

Hard delete source video

Operator-triggered button. Removes the file permanently. Postgres history stays.

What it does

A πŸ“Ό Delete video button (behind a confirm) on the package page. deleteSourceVideo(packageId) removes original.mp4 (the whole local media dir) from MEDIA_ROOT β€” guarded β€” AND the archived copy from ARCHIVE_ROOT if Option B already moved it. It then nulls sources.local_media_path + sources.archive_path so re-renders refuse with a friendly "media deleted" error instead of an ENOENT crash.

It pairs naturally with Option B: archive after 14 days, then hard-delete operator-triggered when "I genuinely don't need this anymore." This is the on-demand escape hatch for purging a specific package's source for storage or sensitivity reasons.

What stays

What's lost forever

UX guards

A confirm step before the irreversible delete. After deletion, any future clip_render or Backlog-Revival re-run on the source fails cleanly with a "media deleted" error rather than crashing.

Pros
  • Maximum recovery of disk + archive space
  • Operator-controlled, not automatic
  • Pairs cleanly with Option B (archive first, delete later)
  • Postgres audit trail untouched
Cons
  • Irreversible β€” once deleted, no re-renders ever
  • Needs careful confirmation UX
  • Re-render flow needs to detect + refuse gracefully (a regression test is essential)
Side by side

How they differ.

A Β· Inline cleanup shipped B Β· Archive job shipped C Β· Hard delete shipped
When it runs Automatically, at the end of each pipeline step Daily cron, N days after the latest dispatch Manual, operator clicks the πŸ“Ό Delete video button
What it touches Stage-1 artifacts only (audio.wav, diarization.json, frames, intermediate JSONs) Stage-3 artifacts (original.mp4, clips/) β€” moved only when last package on the source archives Stage-3 artifacts (original.mp4 + local clips); pairs with B (deletes the archived copy too)
Destination Trash β€” files deleted forever External drive β€” files moved, restorable Trash β€” files deleted forever
Operator control Off via env KEEP_PIPELINE_ARTIFACTS=1 Per-package via cadence config + ad-hoc trigger Per-package, requires explicit click + confirm
Database changes None 2 new columns (archived_at, archive_path) β€” migration 0007, applied None (nulls existing local_media_path + archive_path)
Settings keys 1 env (KEEP_PIPELINE_ARTIFACTS) 3 (ARCHIVE_ROOT, ARCHIVE_AFTER_DAYS, ARCHIVE_DELETE_CLIPS) 0
Re-render impact Re-running visual pipeline re-samples frames (~5 s extra) Re-render works as long as external drive is mounted Re-render fails cleanly with "source video deleted"
Disk saved per video ~45 MB ~45 MB + clip size ~40 MB + clip size
Status βœ“ shipped βœ“ shipped βœ“ shipped (deleteSourceVideo + πŸ“Ό button + confirm)
Reversibility Irreversible (but artifacts are derived β€” re-runnable) Fully reversible (files moved, not deleted) Irreversible
Best paired with Always β€” independent of B/C A (already saved transient files; B handles published files) A + B (post-archive cleanup of selected packages)
What it costs at scale

1, 100, 1,000, 10,000 videos.

Videos published Before cleanup Option A live Option A + B live (archive after 14d) + C delete some live
10~1.7 GB~0.9 GB~0.4 GB~0.4 GB
100~17 GB~9 GB~4 GB local + 13 GB archive~4 GB local + selective archive
1,000~170 GB~90 GB~40 GB local + 130 GB archive~40 GB local + selective archive
10,000~1.7 TB~900 GB~400 GB local + 1.3 TB archive~400 GB local + selective archive

"+ A" rows assume Option A is active on every ingest going forward. "+ A + B" rows show local disk split from the external drive. Postgres footprint stays tiny throughout (~500 KB per package).

Where this landed Option A shipped first β€” biggest win, smallest change, zero policy decisions β€” and now runs at the tail of every pipeline worker. Option B followed: the archive_package worker plus the daily recurring enqueuer move published media to ARCHIVE_ROOT once a package's latest dispatch is older than ARCHIVE_AFTER_DAYS (default 14). Option C β€” operator-triggered hard delete β€” is now shipped too: the πŸ“Ό Delete video button on the package page purges a specific package's source on demand. All three are live.

All three were designed to be independent. A is self-contained (no dependencies on B or C). B adds the archive system that C leverages β€” deleting the archived copy alongside the local one. C also works standalone: with no ARCHIVE_ROOT configured it just deletes from MEDIA_ROOT directly.