One of the newest features in Scribe, my instant YouTube transcript web app, is AI-generated chapters and what makes it special is that the entire process runs locally, inside your browser.
No API calls, no server inference, just pure on-device intelligence powered by Transformers.js from Xenova and Hugging Face.
Running locally offers three big wins:
- Privacy: your transcript text never leaves your computer.
- Speed: chapters are generated instantly, without waiting on a server round trip.
- Scalability: there’s no backend load — everything happens on the client side.
The chapter generation process happens inside a web worker to keep the UI responsive. The key logic lives in a function called chapterize, which takes the transcript paragraphs and automatically detects where the topic shifts.
Here’s the high-level pipeline:
The transcript is broken into chunks (≈200 tokens each) with a bit of overlap, ensuring we don’t miss semantic boundaries.
Using the excellent local model Xenova/all-MiniLM-L6-v2, Scribe creates vector embeddings for each chunk directly in the browser:
const embedder = await pipeline(
'feature-extraction',
'Xenova/all-MiniLM-L6-v2',
{ device, dtype, quantized }
);
const q = await embedder('query: ' + chunk, { pooling: 'mean', normalize: true });
embeddings.push(q.data);Each chunk becomes a high-dimensional vector representing its semantic meaning.
To find where the video naturally changes topics, Scribe computes the cosine similarity between adjacent sentence vectors.
Low similarity means the content probably shifted focus:
const sims = [];
for (let i = kWindow - 1; i < embeddings.length - kWindow; i++) {
const left = meanVector(embeddings.slice(i - kWindow + 1, i + 1));
const right = meanVector(embeddings.slice(i + 1, i + 1 + kWindow));
sims.push(cosine(left, right));
}After smoothing the similarity curve, the algorithm finds local minima, points where similarity dips, which correspond to semantic boundaries.
Between these split points, Scribe uses another local model — ldenoue/Title_Generation_T5Small_Model to generate concise, human-readable titles for each chapter.
This model was trained specifically to title text passages, making it ideal for labeling each semantic segment with a short, meaningful summary.
Each generated title is streamed back to the main thread for rendering in real time.
You end up with a neatly structured list of AI-generated chapters, computed entirely in the browser. No cloud APIs, no GPU servers, just WebAssembly and ONNX magic running locally.
All of this is made possible by Transformers.js, which brings transformer architectures like MiniLM and T5 directly to JavaScript. Whether running with WebGPU acceleration or a WebAssembly fallback, it’s fast enough to handle transcripts of substantial length on typical laptops and tablets.
By combining local embeddings for semantic segmentation and a title generation model for labeling, Scribe delivers a full end-to-end chaptering pipeline that’s private, responsive, and surprisingly lightweight.
AI chaptering was once a cloud-only capability, but modern browser AI tooling has changed the landscape.
With local models like Xenova/all-MiniLM-L6-v2 for embeddings and ldenoue/Title_Generation_T5Small_Model for chapter titles, Scribe now performs intelligent YouTube transcript analysis, instantly and privately, right in your browser.