AI medical scribes turn spoken clinical encounters into structured documentation — SOAP notes, visit summaries, referral letters — by running audio through a multi-stage pipeline. The process starts with capturing sound and ends with a review-ready note formatted for your EHR. Between those points, several AI systems work in sequence: transcription, clinical entity extraction, summarization, and note formatting.
Here's exactly what happens at each stage.
The documentation pipeline
┌────────────────────────────────────────────────────────────────┐
│ │
│ ┌──────────┐ ┌──────────────┐ ┌───────────────────┐ │
│ │ AUDIO │───▶│ TRANSCRIPTION│───▶│ CLINICAL PARSING │ │
│ │ INPUT │ │ (ASR) │ │ & EXTRACTION │ │
│ └──────────┘ └──────────────┘ └───────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ ┌──────────────┐ ┌───────────────────┐ │
│ │ EHR │◀───│ CLINICIAN │◀───│ NOTE STRUCTURING │ │
│ │ OUTPUT │ │ REVIEW │ │ & FORMATTING │ │
│ └──────────┘ └──────────────┘ └───────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────┘
Each stage adds a layer of intelligence. Let's walk through them.
Stage 1: Audio or dictation input
Everything begins with sound. There are two primary input modes:
Ambient capture records the natural conversation between clinician and patient during the encounter. A microphone (phone, tablet, or dedicated device) picks up both speakers. You don't change how you practice — you just talk normally while the system listens. This is how Dictum's ambient AI scribe works.
Post-visit dictation captures audio after the patient leaves. You speak your note aloud — summarizing what happened, what you found, what you plan to do — and the system processes that monologue into a structured document. Dictum's dictation mode supports this workflow for clinicians who prefer to compose thoughts after the encounter.
The input mode shapes everything downstream. Ambient capture gives the model a richer signal (two speakers, natural dialogue, more clinical detail mentioned in context), but it also introduces noise — small talk, interruptions, background sounds. Dictation input is cleaner and more focused, but only contains what you choose to say.
Stage 2: Transcription
Raw audio becomes text through automatic speech recognition (ASR). Modern clinical ASR models are trained on medical speech data, so they handle terminology that would trip up a general-purpose transcription engine — drug names, anatomical terms, procedure codes, and eponymous conditions.
What happens during transcription:
- Speaker diarization separates your voice from the patient's voice (in ambient mode)
- Noise filtering removes background sounds — hallway conversations, equipment beeps, door closings
- Medical vocabulary recognition prioritizes clinical terms over phonetically similar common words ("hypertension" won't become "high pertension")
- Punctuation and segmentation break continuous speech into sentences and speaker turns
The output of this stage is a raw transcript — accurate text, but not yet useful as clinical documentation. It's a record of what was said, not a clinical note.
Stage 3: Clinical entity extraction
This is where general transcription becomes clinical intelligence. Natural language processing (NLP) models scan the transcript and identify medically relevant entities:
- Symptoms and complaints — what the patient reports (chief complaint, HPI elements)
- Findings — what the clinician observes or verbalizes during the exam
- Diagnoses — conditions mentioned, ruled out, or confirmed
- Medications — current, new, changed, or discontinued
- Procedures — performed, ordered, or discussed
- Follow-up plans — return visits, referrals, tests ordered
- Social and family history — relevant context mentioned during the visit
The model also identifies what's not clinically relevant. "How's your daughter doing at school?" is conversation, not documentation. "Any family history of heart disease?" is.
This stage requires models trained specifically on clinical conversations. General-purpose language models struggle here because clinical context determines meaning. "Positive" in a review of systems means a symptom is present. "Positive" in a patient's affect description means something entirely different.
Stage 4: Note structuring and formatting
Extracted entities get organized into the appropriate documentation format. The most common is SOAP:
- Subjective — patient-reported symptoms, history, and concerns
- Objective — exam findings, vitals, lab results verbalized during the encounter
- Assessment — diagnoses, clinical reasoning, differential
- Plan — medications prescribed, tests ordered, follow-up instructions, referrals
The model maps each extracted entity to the correct section. A symptom mentioned by the patient goes into Subjective. An exam finding goes into Objective. The treatment plan goes into Plan.
Different products structure this differently. Some produce rigid SOAP-only output. Others, like Dictum, generate flexible note formats including H&P notes, specialty-specific templates, and after-visit summaries — all from the same encounter audio.
Formatting also matters at this stage. The note should match your EHR's expected format — headers, bullet points, paragraph structure, standard language for normal findings.
Stage 5: Clinician review
This step is non-negotiable, and any vendor that downplays it is making a mistake.
The generated note appears on your screen — phone, tablet, or desktop — for review. You read through it, checking for:
- Accuracy — did the model capture what actually happened?
- Completeness — is anything missing that you know was discussed?
- Attribution — are symptoms assigned to the right problem?
- Appropriateness — does the language match your documentation style?
You edit inline. Good AI scribe products make this fast — you're correcting and confirming, not rewriting. A well-generated note might need 30 seconds of review. A problematic one might need a minute of editing. Either beats 5-10 minutes of writing from scratch.
Clinicians should review AI-generated documentation before adding it to the medical record and should use Dictum in accordance with their organization's policies and applicable laws.
Stage 6: EHR-ready output
Once you approve the note, it needs to reach your chart. This happens through:
- Direct integration — the note pushes into your EHR via API (limited to certain EHR/scribe combinations)
- Copy to clipboard — you paste the note into the appropriate EHR field
- Export — the note saves as a file you import
Most clinicians today use copy-paste. Direct EHR integrations are growing but remain limited by EHR vendor cooperation. Dictum supports EHR export workflows designed to minimize friction regardless of which system you use.
Limitations and safeguards
Understanding the pipeline helps you understand where things can go wrong:
| Pipeline stage | Common failure mode | Safeguard | |---------------|--------------------| ----------| | Audio input | Background noise, poor mic placement | Noise filtering, mic guidance | | Transcription | Misheard terms, speaker confusion | Medical ASR, diarization | | Entity extraction | Misattributed symptoms, missed details | Clinical-specific models | | Note structuring | Wrong section placement, hallucinated content | Template constraints | | Clinician review | Skipped or rushed review | Mandatory review step | | EHR output | Format incompatibility | Flexible export options |
No single stage is perfect. The safeguards compound — each layer catches errors the previous layer introduced. The clinician review stage is the final and most important check.
Hallucination deserves specific mention. Language models sometimes generate plausible-sounding content that wasn't in the encounter. A model might insert "lungs clear to auscultation" because that's statistically common in the training data, even if you didn't examine the lungs. This is why review exists. Read every line of a generated note before signing it.
How Dictum handles this pipeline
Dictum runs this same pipeline with a few architectural choices worth noting:
- Dual input modes — switch between ambient capture and post-visit dictation depending on the encounter
- On-device processing available — in offline mode, transcription and structuring happen locally without sending audio to the cloud
- Specialty-aware extraction — entity extraction models adapt based on your specialty setting, improving accuracy for your specific documentation patterns
- Configurable output formats — SOAP, H&P, after-visit summary, and custom templates from the same encounter data
- Auto-delete — audio and transcripts can be configured to delete immediately after note generation
The goal is the same across every AI scribe product: turn your clinical time into documentation without making you do the typing. The differences are in privacy architecture, specialty accuracy, and workflow flexibility.
Explore how Dictum's specific approach compares to other options on our security page, or see current pricing to try it in your practice.