The audio capture pipeline, Whisper transcription, question classification, RAG from your resume, SSE streaming, and Picture-in-Picture rendering — explained for ML-literate readers.
TL;DR — The 6-Stage Pipeline
Session Lifecycle
Upload resume and job description
Before the interview, upload your resume (PDF or DOCX) and paste the job description into OphyAI. The system chunks both documents and embeds them into a vector index that will be used for retrieval during the interview.
Enable audio capture
OphyAI requests microphone access (getUserMedia) and system audio access (getDisplayMedia with audio:true). Both streams are captured and mixed into a single interleaved 16kHz mono PCM stream.
Start real-time transcription
The audio stream is chunked into 1–3 second segments and sent to the transcription service (Whisper-architecture model). The rolling transcript buffer is continuously updated as the interview proceeds.
Question detected and classified
A lightweight classifier (fine-tuned BERT/DeBERTa) monitors the transcript for interview questions. When detected, it classifies the question type (behavioral, technical, situational) and triggers the retrieval step.
Retrieval-augmented generation
The system retrieves semantically relevant chunks from the pre-indexed resume and JD vector store. These chunks are injected into the LLM prompt as context, and the model generates a STAR-structured answer specific to the candidate.
Answer streamed to Whisper Mode overlay
The LLM response is streamed via SSE or WebSockets. Tokens appear progressively in the Document Picture-in-Picture overlay window — which is invisible to screen sharing — within 1.5–3 seconds of the question ending.
Deep Dive
Stage 1
Two browser APIs capture audio simultaneously: getUserMedia() for the microphone and getDisplayMedia() with audio:true for system audio (which includes the interviewer's voice from the video call). Both streams are mixed client-side using the Web Audio API into a single interleaved stream, downsampled to 16kHz mono PCM — the format Whisper-architecture models expect.
getUserMedia() + getDisplayMedia(audio:true) → Web Audio API mix → 16kHz PCMStage 2
The PCM audio stream is chunked into 1–3 second segments and sent over a WebSocket to the transcription service. Server-side Whisper Large v3 (or a fine-tuned derivative) processes each chunk and returns a partial transcript in 300–600ms. The rolling transcript buffer is updated continuously. Speaker diarization — distinguishing the interviewer's voice from the candidate's — uses a lightweight pyannote.audio model to label turns.
Audio chunks → WebSocket → Whisper v3 → rolling transcript bufferStage 3
A fine-tuned DeBERTa classifier (≈150M parameters) monitors the rolling transcript for interview question signals: interrogative phrasing, turn-ending silence detected by VAD (Voice Activity Detection), and question-type keywords. When a question boundary is detected, the classifier assigns a category — behavioral, technical, situational, competency, or case/framework — that will guide downstream prompt construction. Inference runs in under 50ms on a T4 GPU.
Transcript → VAD boundary → DeBERTa classifier → question categoryStage 4
At session setup, the user's resume and JD are chunked (512-token chunks, 64-token overlap) and embedded using text-embedding-3-small. Embeddings are stored in a per-session vector index. When a question arrives, the question text is embedded and a top-k=5 similarity search retrieves the most relevant resume chunks — specific projects, metrics, job titles, technologies. These chunks are injected into the generation prompt as grounding context.
Question embed → cosine similarity → top-5 resume chunks → prompt contextStage 5
The assembled prompt — system instructions, question type, retrieved resume context, job description summary, and the raw question — is sent to a hosted LLM (GPT-4o, Claude 3.5 Sonnet, or a fine-tuned Llama-3 variant depending on implementation). Generation is requested with streaming=true. The first token arrives in 200–400ms. Tokens stream via Server-Sent Events back to the client, which appends them to the overlay in real time. STAR structure (Situation, Task, Action, Result) is enforced via the prompt.
Prompt → LLM (streaming) → SSE → progressive token renderStage 6
The streamed tokens are rendered into a Document Picture-in-Picture window (documentPictureInPicture.requestWindow()). This window exists at the OS compositor layer — above the capture boundary of screen-sharing software. It is displayed on the user's screen but is not captured when they share a tab or application window in Zoom, Teams, or Meet. The PiP window is styled as a compact overlay: bullet-pointed answer structure, dark background, high-contrast text for fast scanning.
documentPictureInPicture.requestWindow() → compositor layer → invisible to screen sharePerformance Engineering
A human interviewer pauses naturally for 3–5 seconds before answering a difficult question. That is the total budget for the copilot pipeline. Every stage has to fit within that window — or the suggestion arrives too late to be useful.
| Pipeline Stage | Latency | Notes |
|---|---|---|
| Audio capture + chunking | ~0ms | Streaming — no accumulation delay |
| Transcription (1–3s audio chunk) | 300–600ms | Whisper-architecture model, server-side |
| Question classification (BERT/DeBERTa) | ~50ms | Lightweight fine-tuned classifier |
| Vector retrieval (top-k resume chunks) | 20–80ms | Pinecone / FAISS / Chroma |
| LLM prompt construction | ~10ms | Template fill + context injection |
| First LLM token (streaming) | 200–400ms | GPT-4o / Claude 3.5 Sonnet / Llama |
| Total to first visible token | 580ms–1140ms | Within natural thinking pause window |
| Complete answer rendered | 2–4 seconds | Streaming fills the rest progressively |
Engineering Challenges
Latency
Every millisecond across six pipeline stages compounds. Keeping total end-to-end latency under 3 seconds requires server co-location with the transcription service, low-latency LLM routing, and client-side streaming rendering. Most consumer-grade implementations sit at 4–7 seconds, which is too slow for most interviewers.
Speaker diarization
Reliably determining who is speaking — the interviewer or the candidate — in a two-channel audio mix under variable acoustic conditions (phone audio, compressed Zoom audio, background noise) is an unsolved problem at high accuracy. Misattributed turns cause wrong-person transcripts and garbage input to the classifier.
Question boundary detection
Knowing when the interviewer has finished their question (and not just paused mid-sentence) requires VAD with sufficient silence duration thresholds, language model context, and phrasing heuristics. Triggering generation on a half-formed question wastes latency budget and generates irrelevant suggestions.
Context window management
A 60-minute interview generates thousands of transcript tokens. Keeping the accumulated conversation history, resume context, and JD in the LLM prompt without exceeding token limits — while maintaining coherence — requires windowed context strategies or summarization compression.
Network reliability
The entire pipeline depends on consistent sub-100ms network round-trips to the inference servers. Mobile hotspot, hotel Wi-Fi, and VPN interference all degrade the latency budget significantly. Edge deployments (Cloudflare Workers AI, AWS Lambda@Edge) reduce geographic latency but increase infrastructure complexity.
FAQ
The copilot captures two audio streams using the browser's Web Audio API: the candidate's microphone (via getUserMedia()) and the system audio — including the interviewer's voice coming from the video call — via getDisplayMedia() with audio:true. These two streams are mixed at the client and sent to the transcription service as a single interleaved PCM stream, typically encoded as 16kHz mono WAV for efficiency.
Real-time Whisper transcription, RAG from your resume, streaming generation, Whisper Mode overlay. From $9/mo — free credits to start.