Do all AI medical scribes train on patient data?

No. Vendor policies vary widely. Some use patient encounters to improve their models, some use de-identified data, and others — like Dictum — do not use patient data for model training at all. You must ask each vendor directly.

What's the difference between data processing and data training?

Processing means using data to perform a task — like converting your audio into a note — and then discarding or storing it per your retention policy. Training means feeding data into the model to change how it performs in the future. Processing is necessary; training is a choice.

Is it legal for a vendor to train on patient data?

It can be, depending on the BAA terms and whether proper de-identification is applied. HIPAA doesn't outright prohibit it if the data is properly de-identified per Safe Harbor or Expert Determination standards. But just because it's legal doesn't mean it's appropriate for your practice.

How can I tell if a vendor trains on my data?

Check three places: the terms of service, the BAA, and the privacy policy. Look for language about 'product improvement,' 'model enhancement,' or 'aggregate data use.' If you can't find a clear statement, ask the vendor directly and request the answer in writing.

Does de-identification make data training safe?

De-identification reduces risk but doesn't eliminate it. Research has shown that clinical text can sometimes be re-identified, especially when encounters contain rare conditions or unique demographic combinations. The strength of de-identification depends on the methodology used.

Can I opt out of data training?

Some vendors offer opt-outs. Others make training the default with no option to decline. If training opt-out matters to your practice, confirm the option exists before signing a contract — not after.

What does Dictum do with patient data?

Dictum processes encounter audio to generate clinical notes and does not use patient data for model training or product improvement. Audio can be processed on-device in offline mode, and auto-delete ensures data doesn't persist beyond your chosen retention window.

Do AI Scribes Train on Patient Data? (2026)

Some do. Some don't. And some are vague enough that you can't tell without reading the fine print. Whether an AI medical scribe uses your patient encounters to train its models is one of the most important questions you can ask a vendor — and one of the hardest to get a straight answer to. The short version: data processing (turning audio into a note) is not the same as data training (using that audio to improve the AI). Every vendor processes your data. Not every vendor trains on it.

Why clinicians are asking this question

The concern is straightforward. When you record a patient encounter and send it through an AI system, you're handing over sensitive clinical information — diagnoses, medications, mental health disclosures, substance use history. If that data gets folded into a training dataset, it becomes part of the model's foundation. Even with de-identification, the idea that your patient's words are shaping a commercial product raises ethical and practical concerns.

Clinicians aren't being paranoid. They're being responsible.

Data processing vs. data training

These two concepts sound similar but are fundamentally different.

Processing

Every AI scribe processes your data. That's the whole point. The system takes audio input, runs it through speech recognition and language models, and produces a structured note. Once the note is generated, the audio and intermediate data can be deleted.

Processing is a one-time, task-specific use. The data goes in, the note comes out, and — ideally — nothing else happens.

Training

Training means the vendor feeds patient data (audio, transcripts, or notes) into a machine learning pipeline to improve the model's performance. This could involve:

Fine-tuning — adjusting model weights using clinical encounter data
Reinforcement learning — using clinician edits to teach the model what "good" output looks like
Evaluation datasets — testing model performance against real encounter data
Aggregate analysis — studying patterns across encounters to inform product decisions

Training creates a lasting relationship between your patient's data and the vendor's product. Even if the original data is later deleted, its influence on the model persists.

How vendor policies differ

There's a wide spectrum in how vendors handle this:

| Approach | What it means | Risk level | |----------|--------------|------------| | No training on patient data | Data is processed and deleted; never enters training pipelines | Lowest | | Training on de-identified data only | PHI is stripped before data enters training; de-identification methodology matters | Moderate | | Opt-out training | Training is the default; customers can request exclusion | Moderate-high (many don't know to opt out) | | Training included in terms | Terms of service or BAA permit data use for model improvement | Higher | | Vague or undisclosed | No clear statement on training practices | Highest risk — assume the worst |

The middle categories are where most vendors land. They use patient data in some form, apply some level of de-identification, and bury the details in legal documents that most clinicians never read.

Questions to ask every vendor

Before choosing an AI scribe, get clear answers to these questions. Ask for responses in writing — verbal assurances aren't enough for compliance documentation.

Does your company use patient encounter data to train, fine-tune, or evaluate AI models?
If yes, what de-identification process is applied? Does it follow HIPAA Safe Harbor or Expert Determination standards?
Can I opt out of data training entirely? Is it opt-in or opt-out by default?
Does the BAA specifically address data use for model improvement?
Are sub-processors (third-party LLM providers, cloud services) also restricted from training on the data?
If I delete my data, is it also removed from any training datasets?
Has an independent audit verified your de-identification and data handling practices?
Where is my data stored, and for how long?

If a vendor can't answer these clearly, that tells you something.

How to read the fine print

Vendors rarely say "we train on your patient data" in plain language. Look for these phrases in the terms of service, privacy policy, and BAA:

"Product improvement"
"Service enhancement"
"Aggregate data analysis"
"Model performance optimization"
"Quality assurance using customer data"
"De-identified data may be retained for research"

Any of these can mean your data enters a training pipeline. If you see them, ask for clarification about what exactly happens, what de-identification is applied, and whether you can opt out.

Dictum's approach

Dictum does not use patient encounter data for model training or product improvement. When you record an encounter, the audio is processed to generate your clinical note. It is not retained for AI development purposes.

Additional safeguards:

On-device processing — in offline mode, audio is processed locally and never transmitted to external servers
Configurable auto-delete — set your retention window, and data is purged automatically
No sub-processor training — Dictum's data handling agreements with third-party services restrict use of patient data for training

For full details on Dictum's security posture, visit the security overview and HIPAA compliance page.

Clinicians should review AI-generated documentation before adding it to the medical record and should use Dictum in accordance with their organization's policies and applicable laws.

The broader privacy picture

Data training is one piece of a larger privacy conversation. For a complete view, read our guide on AI clinical documentation privacy risks, which covers storage, access controls, and human review processes. If you're evaluating vendors specifically for HIPAA readiness, our article on whether AI medical scribes are HIPAA compliant includes a full vendor evaluation checklist.

You can also compare vendors side by side on our HIPAA-compliant AI medical scribe comparison page.

The bottom line

Not every AI medical scribe treats your patient data the same way. Some vendors train on it. Some de-identify it first. Some don't touch it for training at all. The difference matters — for your patients' privacy, for your compliance posture, and for your peace of mind.

Ask the questions. Read the agreements. And choose a vendor whose data practices match what you'd want if you were the patient.