Skip to main content

Explainable AI for Public Health Client Linkage and Deduplication

·8 mins

Public health data is messy in very human ways. Families share phones and email addresses. Clinics rely on placeholder dates so work can keep moving. Organizations sometimes appear where people should. Any patient deduplication approach that ignores these realities will struggle to produce results teams can trust.

Probabilistic deduplication and linkage is a key public health AI use case that supports responsible stewardship of personal health information. Using machine learning, it estimates match likelihood across records that are often incomplete, transcription-heavy, and not consistently validated against authoritative sources. We treat it as governed, explainable decision support aligned with our responsibilities as a health information custodian (HIC), with equity considerations and routine human review.

Governance and stewardship #

Deduplication decisions can change client histories and downstream actions, so we govern linkage as a decision workflow that needs clear accountability, review pathways, and documentation.

PHIPA requires HICs to taking reasonable steps to ensure information is as accurate, complete, and up to date as necessary for the purpose. We also use NIST AI RMF as a practical guide for operating trusted AI systems. In practice, that means the system is context-informed, explainable, equity-aware, and supported by monitoring and human oversight through its lifecycle. 1

Explainable AI means you can trace a decision back to human-understandable signals. In this pipeline, every match is supported by explicit fields, thresholds, and audit outputs that show why a pair was or was not merged.

Equity considerations #

Equity issues in linkage often show up as systematic differences in missingness, data quality, and how frequently identity fields change across time and across populations. A pipeline tuned for clean identity data will tend to work best for the people already best represented in the system. This is one reason we avoid hard gates on fields that are known to vary, and why we invest in review workflows and audit outputs that make risk visible.

Sex and gender #

Sex and gender are distinct concepts and may be recorded differently across health data systems. A recorded sex or gender value can reflect variations in form design, system definitions, or the timing of updates, and may naturally change over time. Because of this, differences in sex or gender information do not necessarily mean that two records refer to different people. Treating these differences as strong evidence against a match, or using sex or gender as blocking criteria, can unintentionally lead to fragmented records for transgender and gender-diverse clients by preventing meaningful matches from being considered. When sex or gender information is used, any disagreements are best treated as weak negative indicators or reviewed by a human, rather than serving as automatic reasons to reject a potential match.

The data realities we designed around #

Many deduplication problems come from treating known data issues as edge cases. We chose to design for them instead.

In a PHU electronic medical record, most demographics arrive through transcription at the point of care. In Ontario, there are provincial identity services and registries that can validate demographic and identifier data, but local operational systems are not always integrated in ways that support consistent real-time validation. For example, Ontario’s Provincial Client Registry aggregates identity data from sources including the Registered Persons Database and participating organizations, but that does not mean every local workflow can validate against it at the time data is collected.

These realities show up in predictable ways. Local identifiers can carry typos. Dates of birth can be missing or defaulted. People can have multiple contacts over time, and households can share contact points. Organizations (schools, clinics, teams) can end up in patient tables and share contact information across many individuals. Addresses and names can change over time.

Once these realities are acknowledged, several design decisions become easier to justify and easier to operate.

What we optimize for #

We prioritize precision and explainability over aggressive merging. We aim for operational predictability with fast runs, simple dependencies, and outputs that program teams can validate. We also assume collaboration is required. Engineers need to understand clinic realities and information system constraints, and program, privacy, and analytics partners need to be able to challenge decisions and shape thresholds. We share our approach so others can pilot and adapt it, and so practices can translate across the health system.

Design choices that mattered #

Normalize early and null decisively #

Dates are converted to a single format and obvious placeholders are nulled before they influence any downstream logic. Phones are normalized to remove formatting noise. Emails are lowercased and cleaned. Names are trimmed and standardized enough to support consistent comparisons.

This is the simplest way to reduce avoidable disagreement across systems and across data entry patterns.

Distance-measure-aware feature engineering #

Record linkage compares fields using similarity measures, then combines signals into an overall match likelihood. Feature engineering matters because similarity measures behave differently under formatting variation, missingness, and drift.

Feature engineering is the process of preparing and transforming raw fields into comparison-ready signals for linkage. In deduplication, this often includes standardizing formats (case, punctuation, dates), nulling placeholder values, and creating derived representations that make similarity comparisons more reliable (for example, initials from names, birth year from date of birth, or normalized contact keys). The goal is to reduce noise, handle missingness consistently, and make match evidence easier to interpret and audit.

Distance measures quantify similarity between two values. Some are exact (match or no match). Others are approximate (for example, string similarity for names). In linkage, they are inputs to a decision process that still needs governance, review, and auditing.

We focus on feature engineering that supports reliable comparison rather than tool-specific variables. Examples include: normalizing contact fields so exact comparison is meaningful; creating stable variants that reduce sensitivity to common formatting differences; comparing names using both full strings and simplified representations like initials; and using birth year as a resilient signal when full dates are missing or error-prone.

Prefer high-trust signals and handle field drift deliberately #

Exact matches on date of birth (when not placeholder), contact fields after normalization, postal or geographic keys, and internal identifiers do most of the work. Names are compared using approximate string similarity, but with high thresholds. A simple deterministic fallback such as last name with first initial can help catch common variants without expanding matching too broadly.

We also design around field stability. Addresses often change. Some phone numbers change. Last names change due to family transitions. Preferred names and aliases may be added over time. When comparison strategies and weights can reflect time and stability, they should.

Expand strategically for candidate coverage #

Match candidate generation can benefit from temporarily expanding a single client into multiple rows to ensure all plausible comparisons occur. For example, a client with multiple contact values can be represented in multiple views so contact-based candidate rules are applied comprehensively. Preferred or alias names can be swapped into the comparison view so name-based candidate rules do not miss valid candidates.

The main safeguard is to prevent a client’s expanded rows self-matching. After linkage, results are collapsed back to the canonical client level.

Candidate generation and blocking #

All linkage systems have the same constraint: comparing all records to all other records is not feasible at scale. Blocking reduces the comparison space by generating candidate pairs that share coarse attributes, then performing detailed comparisons only within those candidates.

Blocking is the step that generates candidate record pairs. Effective blocking keeps the candidate set small enough to run quickly while still finding true matches in the presence of missingness and common data entry variation.

We treat blocking as a set of practical principles. We usually combine multiple blocking strategies and take their union so that missingness in one field does not eliminate coverage. We also design blocking keys around stability and data quality. Birth year is often more robust than full date of birth. Forward sortation area (FSA) can be a useful geographic compromise, but it rarely stands alone. We avoid blocking on defaulted or placeholder values, and we handle organizations separately to prevent institutional contact details from dominating candidate generation.

Blocking has to be monitored operationally. Even high-performance linkage libraries will slow down or fail when blocking produces large candidate sets. A concrete trigger helps. We review block size distributions and investigate blocks that grow beyond a few hundred records. As a simple rule of thumb, blocks over 500 records deserve a closer look and usually benefit from splitting by an additional stable attribute or from targeted handling. This keeps runtime predictable and reduces false-positive pressure.

Scoring, clustering, and review #

We use conservative thresholds and strong evidence requirements for automatic merges. Match thresholds start high and move only when we can demonstrate safety with audits and partner review. When evidence is mixed, especially when contact information is shared but name agreement is weak, pairs are routed to review rather than auto-merged. This is one of the most effective ways to prevent household over-merges.

Audits as first-class outputs #

Every run produces outputs designed for inspection. We generate summaries, conflict probes that surface high-risk disagreement, and human-readable explanations that link outcomes to the signals that supported them. Upstream identifiers are preserved so decisions remain traceable. This supports program validation, privacy review, and operational troubleshooting without reverse engineering the linkage logic.

Tooling and operations #

We have had good results using Splink with a DuckDB backend2, paired with vectorized preprocessing and post-processing in Pandas.

Working in the open #

Deduplication improves when it is designed with the people who understand the data and the work it supports. Engineering decisions need to reflect clinic workflows and information system constraints. Program teams and analysts need outputs they can validate. Privacy and governance partners need traceability and clear limits on automation.

Sharing approaches also matters outside a single organization. Many public health units face similar data realities, and strategies that work under those constraints can often translate across the health system. This post is intended as a reusable pattern that others can pilot, tune, and extend in their own context.