Do AI medical scribes record the entire visit?

That depends on the product and mode. Ambient scribes typically capture audio from the start of the encounter through the end. Dictation-mode scribes only capture what you actively dictate after the visit. Some products, including Dictum, let you choose either approach.

What happens to the audio after the note is generated?

Policies vary by vendor. Some store audio indefinitely for quality assurance; others auto-delete after processing. Look for configurable retention periods and the option to delete immediately after note generation.

Can AI scribes distinguish between multiple speakers?

Most modern clinical AI scribes use speaker diarization to separate clinician speech from patient speech. Accuracy drops when more than two people are speaking or when speakers talk over each other.

Do AI scribes work with telehealth visits?

Yes. Most products can capture audio from video calls. Some integrate directly with telehealth platforms; others capture system audio. Dictum works with both in-person and telehealth encounters.

What if the AI gets something wrong in the note?

You edit it. The review step exists precisely for this reason. No AI scribe should bypass clinician review before documentation enters the medical record. Good products make editing fast — inline corrections, not rewriting from scratch.

How do AI scribes handle medical terminology and abbreviations?

Clinical language models are trained on medical corpora, so they handle standard terminology well. Rare eponyms, facility-specific abbreviations, and non-English terms may require correction. Specialty-trained models perform better within their domain.

Is there a difference between AI scribes and medical speech-to-text software?

Yes. Speech-to-text converts spoken words to written text verbatim. An AI scribe goes further — it interprets, summarizes, structures, and formats that text into clinical documentation. The output is a note, not a transcript.

How Do AI Medical Scribes Work? Step-by-Step (2026)

AI medical scribes turn spoken clinical encounters into structured documentation — SOAP notes, visit summaries, referral letters — by running audio through a multi-stage pipeline. The process starts with capturing sound and ends with a review-ready note formatted for your EHR. Between those points, several AI systems work in sequence: transcription, clinical entity extraction, summarization, and note formatting.

Here's exactly what happens at each stage.

The documentation pipeline

┌────────────────────────────────────────────────────────────────┐
│                                                                │
│   ┌──────────┐    ┌──────────────┐    ┌───────────────────┐   │
│   │  AUDIO   │───▶│ TRANSCRIPTION│───▶│ CLINICAL PARSING  │   │
│   │  INPUT   │    │   (ASR)      │    │ & EXTRACTION      │   │
│   └──────────┘    └──────────────┘    └───────────────────┘   │
│                                                │               │
│                                                ▼               │
│   ┌──────────┐    ┌──────────────┐    ┌───────────────────┐   │
│   │   EHR    │◀───│  CLINICIAN   │◀───│ NOTE STRUCTURING  │   │
│   │  OUTPUT  │    │   REVIEW     │    │ & FORMATTING      │   │
│   └──────────┘    └──────────────┘    └───────────────────┘   │
│                                                                │
└────────────────────────────────────────────────────────────────┘

Each stage adds a layer of intelligence. Let's walk through them.

Stage 1: Audio or dictation input

Everything begins with sound. There are two primary input modes:

Ambient capture records the natural conversation between clinician and patient during the encounter. A microphone (phone, tablet, or dedicated device) picks up both speakers. You don't change how you practice — you just talk normally while the system listens. This is how Dictum's ambient AI scribe works.

Post-visit dictation captures audio after the patient leaves. You speak your note aloud — summarizing what happened, what you found, what you plan to do — and the system processes that monologue into a structured document. Dictum's dictation mode supports this workflow for clinicians who prefer to compose thoughts after the encounter.

The input mode shapes everything downstream. Ambient capture gives the model a richer signal (two speakers, natural dialogue, more clinical detail mentioned in context), but it also introduces noise — small talk, interruptions, background sounds. Dictation input is cleaner and more focused, but only contains what you choose to say.

Stage 2: Transcription

Raw audio becomes text through automatic speech recognition (ASR). Modern clinical ASR models are trained on medical speech data, so they handle terminology that would trip up a general-purpose transcription engine — drug names, anatomical terms, procedure codes, and eponymous conditions.

What happens during transcription:

Speaker diarization separates your voice from the patient's voice (in ambient mode)
Noise filtering removes background sounds — hallway conversations, equipment beeps, door closings
Medical vocabulary recognition prioritizes clinical terms over phonetically similar common words ("hypertension" won't become "high pertension")
Punctuation and segmentation break continuous speech into sentences and speaker turns

The output of this stage is a raw transcript — accurate text, but not yet useful as clinical documentation. It's a record of what was said, not a clinical note.

Stage 3: Clinical entity extraction

This is where general transcription becomes clinical intelligence. Natural language processing (NLP) models scan the transcript and identify medically relevant entities:

Symptoms and complaints — what the patient reports (chief complaint, HPI elements)
Findings — what the clinician observes or verbalizes during the exam
Diagnoses — conditions mentioned, ruled out, or confirmed
Medications — current, new, changed, or discontinued
Procedures — performed, ordered, or discussed
Follow-up plans — return visits, referrals, tests ordered
Social and family history — relevant context mentioned during the visit

The model also identifies what's not clinically relevant. "How's your daughter doing at school?" is conversation, not documentation. "Any family history of heart disease?" is.

This stage requires models trained specifically on clinical conversations. General-purpose language models struggle here because clinical context determines meaning. "Positive" in a review of systems means a symptom is present. "Positive" in a patient's affect description means something entirely different.

Stage 4: Note structuring and formatting

Extracted entities get organized into the appropriate documentation format. The most common is SOAP:

Subjective — patient-reported symptoms, history, and concerns
Objective — exam findings, vitals, lab results verbalized during the encounter
Assessment — diagnoses, clinical reasoning, differential
Plan — medications prescribed, tests ordered, follow-up instructions, referrals

The model maps each extracted entity to the correct section. A symptom mentioned by the patient goes into Subjective. An exam finding goes into Objective. The treatment plan goes into Plan.

Different products structure this differently. Some produce rigid SOAP-only output. Others, like Dictum, generate flexible note formats including H&P notes, specialty-specific templates, and after-visit summaries — all from the same encounter audio.

Formatting also matters at this stage. The note should match your EHR's expected format — headers, bullet points, paragraph structure, standard language for normal findings.

Stage 5: Clinician review

This step is non-negotiable, and any vendor that downplays it is making a mistake.

The generated note appears on your screen — phone, tablet, or desktop — for review. You read through it, checking for:

Accuracy — did the model capture what actually happened?
Completeness — is anything missing that you know was discussed?
Attribution — are symptoms assigned to the right problem?
Appropriateness — does the language match your documentation style?

You edit inline. Good AI scribe products make this fast — you're correcting and confirming, not rewriting. A well-generated note might need 30 seconds of review. A problematic one might need a minute of editing. Either beats 5-10 minutes of writing from scratch.

Clinicians should review AI-generated documentation before adding it to the medical record and should use Dictum in accordance with their organization's policies and applicable laws.

Stage 6: EHR-ready output

Once you approve the note, it needs to reach your chart. This happens through:

Direct integration — the note pushes into your EHR via API (limited to certain EHR/scribe combinations)
Copy to clipboard — you paste the note into the appropriate EHR field
Export — the note saves as a file you import

Most clinicians today use copy-paste. Direct EHR integrations are growing but remain limited by EHR vendor cooperation. Dictum supports EHR export workflows designed to minimize friction regardless of which system you use.

Limitations and safeguards

Understanding the pipeline helps you understand where things can go wrong:

| Pipeline stage | Common failure mode | Safeguard | |---------------|--------------------| ----------| | Audio input | Background noise, poor mic placement | Noise filtering, mic guidance | | Transcription | Misheard terms, speaker confusion | Medical ASR, diarization | | Entity extraction | Misattributed symptoms, missed details | Clinical-specific models | | Note structuring | Wrong section placement, hallucinated content | Template constraints | | Clinician review | Skipped or rushed review | Mandatory review step | | EHR output | Format incompatibility | Flexible export options |

No single stage is perfect. The safeguards compound — each layer catches errors the previous layer introduced. The clinician review stage is the final and most important check.

Hallucination deserves specific mention. Language models sometimes generate plausible-sounding content that wasn't in the encounter. A model might insert "lungs clear to auscultation" because that's statistically common in the training data, even if you didn't examine the lungs. This is why review exists. Read every line of a generated note before signing it.

How Dictum handles this pipeline

Dictum runs this same pipeline with a few architectural choices worth noting:

Dual input modes — switch between ambient capture and post-visit dictation depending on the encounter
On-device processing available — in offline mode, transcription and structuring happen locally without sending audio to the cloud
Specialty-aware extraction — entity extraction models adapt based on your specialty setting, improving accuracy for your specific documentation patterns
Configurable output formats — SOAP, H&P, after-visit summary, and custom templates from the same encounter data
Auto-delete — audio and transcripts can be configured to delete immediately after note generation

The goal is the same across every AI scribe product: turn your clinical time into documentation without making you do the typing. The differences are in privacy architecture, specialty accuracy, and workflow flexibility.

Explore how Dictum's specific approach compares to other options on our security page, or see current pricing to try it in your practice.