Raw files are transientDerived artifacts persist0 assets staged locally

Ingestion pipeline

The app server stays CPU-only. It parses, chunks, ranks, and records provenance, while paid model APIs are reached through LiteLLM only.

Transient upload

Accept teacher files long enough to parse them. Raw bytes are not a permanent asset class in this system.

Extract page, slide, section, transcript, and layout data first. OCR is the fallback, not the primary path.

Create retrieval chunks with lexical and dense search surfaces, then attach concept metadata and citation offsets.

Teachers review the ingestion summary before the course brain becomes the active retrieval source.

Persistent artifacts

asset metadata

page or slide text

layout JSON

chunk embeddings

citation offsets

optional cited-page thumbnails

Guardrails

Never keep raw files by default after successful parsing.

Never mix chunks from different course spaces in retrieval.

Never let the frontend call model vendors directly.

Always capture enough provenance to show page, slide, or timestamp-level citations.

Live course asset register

0 pending parse

No course exists yet. Create a course on the dashboard and stage the first asset there.