·Dictum Team

Do AI medical scribes train on your patient data?

hipaadata-privacyai-training

Some do. Some don't. And some are vague enough that you can't tell without reading the fine print. Whether an AI medical scribe uses your patient encounters to train its models is one of the most important questions you can ask a vendor — and one of the hardest to get a straight answer to. The short version: data processing (turning audio into a note) is not the same as data training (using that audio to improve the AI). Every vendor processes your data. Not every vendor trains on it.

Why clinicians are asking this question

The concern is straightforward. When you record a patient encounter and send it through an AI system, you're handing over sensitive clinical information — diagnoses, medications, mental health disclosures, substance use history. If that data gets folded into a training dataset, it becomes part of the model's foundation. Even with de-identification, the idea that your patient's words are shaping a commercial product raises ethical and practical concerns.

Clinicians aren't being paranoid. They're being responsible.

Data processing vs. data training

These two concepts sound similar but are fundamentally different.

Processing

Every AI scribe processes your data. That's the whole point. The system takes audio input, runs it through speech recognition and language models, and produces a structured note. Once the note is generated, the audio and intermediate data can be deleted.

Processing is a one-time, task-specific use. The data goes in, the note comes out, and — ideally — nothing else happens.

Training

Training means the vendor feeds patient data (audio, transcripts, or notes) into a machine learning pipeline to improve the model's performance. This could involve:

  • Fine-tuning — adjusting model weights using clinical encounter data
  • Reinforcement learning — using clinician edits to teach the model what "good" output looks like
  • Evaluation datasets — testing model performance against real encounter data
  • Aggregate analysis — studying patterns across encounters to inform product decisions

Training creates a lasting relationship between your patient's data and the vendor's product. Even if the original data is later deleted, its influence on the model persists.

How vendor policies differ

There's a wide spectrum in how vendors handle this:

| Approach | What it means | Risk level | |----------|--------------|------------| | No training on patient data | Data is processed and deleted; never enters training pipelines | Lowest | | Training on de-identified data only | PHI is stripped before data enters training; de-identification methodology matters | Moderate | | Opt-out training | Training is the default; customers can request exclusion | Moderate-high (many don't know to opt out) | | Training included in terms | Terms of service or BAA permit data use for model improvement | Higher | | Vague or undisclosed | No clear statement on training practices | Highest risk — assume the worst |

The middle categories are where most vendors land. They use patient data in some form, apply some level of de-identification, and bury the details in legal documents that most clinicians never read.

Questions to ask every vendor

Before choosing an AI scribe, get clear answers to these questions. Ask for responses in writing — verbal assurances aren't enough for compliance documentation.

  1. Does your company use patient encounter data to train, fine-tune, or evaluate AI models?
  2. If yes, what de-identification process is applied? Does it follow HIPAA Safe Harbor or Expert Determination standards?
  3. Can I opt out of data training entirely? Is it opt-in or opt-out by default?
  4. Does the BAA specifically address data use for model improvement?
  5. Are sub-processors (third-party LLM providers, cloud services) also restricted from training on the data?
  6. If I delete my data, is it also removed from any training datasets?
  7. Has an independent audit verified your de-identification and data handling practices?
  8. Where is my data stored, and for how long?

If a vendor can't answer these clearly, that tells you something.

How to read the fine print

Vendors rarely say "we train on your patient data" in plain language. Look for these phrases in the terms of service, privacy policy, and BAA:

  • "Product improvement"
  • "Service enhancement"
  • "Aggregate data analysis"
  • "Model performance optimization"
  • "Quality assurance using customer data"
  • "De-identified data may be retained for research"

Any of these can mean your data enters a training pipeline. If you see them, ask for clarification about what exactly happens, what de-identification is applied, and whether you can opt out.

Dictum's approach

Dictum does not use patient encounter data for model training or product improvement. When you record an encounter, the audio is processed to generate your clinical note. It is not retained for AI development purposes.

Additional safeguards:

  • On-device processing — in offline mode, audio is processed locally and never transmitted to external servers
  • Configurable auto-delete — set your retention window, and data is purged automatically
  • No sub-processor training — Dictum's data handling agreements with third-party services restrict use of patient data for training

For full details on Dictum's security posture, visit the security overview and HIPAA compliance page.

Clinicians should review AI-generated documentation before adding it to the medical record and should use Dictum in accordance with their organization's policies and applicable laws.

The broader privacy picture

Data training is one piece of a larger privacy conversation. For a complete view, read our guide on AI clinical documentation privacy risks, which covers storage, access controls, and human review processes. If you're evaluating vendors specifically for HIPAA readiness, our article on whether AI medical scribes are HIPAA compliant includes a full vendor evaluation checklist.

You can also compare vendors side by side on our HIPAA-compliant AI medical scribe comparison page.

The bottom line

Not every AI medical scribe treats your patient data the same way. Some vendors train on it. Some de-identify it first. Some don't touch it for training at all. The difference matters — for your patients' privacy, for your compliance posture, and for your peace of mind.

Ask the questions. Read the agreements. And choose a vendor whose data practices match what you'd want if you were the patient.