Skip to main content

Designing Data Lakehouses for Public Health

·6 mins

Public health data rarely arrives in a single, analysis-ready form. It moves through administrative systems, partner submissions, spreadsheets, faxed documents, images, audio, and other artifacts that vary widely in structure and quality. Many of these materials are now also inputs to AI workflows that extract, classify, summarize, and link information across formats.

A storage architecture for this environment needs to support intake, preservation, transformation, and retrieval without assuming that every source should be standardized at the moment it arrives. In this article, we describe how a data lakehouse supports that progression while still meeting governance expectations.

Progressive normalization of multimodal data #

Public health organizations need stable analytical datasets to support reporting and operations, but much of what they receive is not ready for structured storage at intake. Traditional data warehouse approaches often assume that value comes only after data has been transformed into a predefined relational model. That works well for mature, repeatable reporting pipelines, but it can slow innovation when teams are still learning what incoming data contains, how it should be interpreted, or how it may be used across operational, analytical, and AI workflows.

For many public health data teams, a better pattern is to accept data as it arrives, preserve original artifacts, and apply structure in stages within a data lakehouse architecture. This allows teams to begin using new sources earlier, refine transformations incrementally, and avoid forcing premature design decisions at the point where the data is least understood.

A faxed document, for example, may first need to be retained as a source record, then processed through extraction or classification steps, and only later represented as structured fields suitable for reporting, search, or application use. Requiring that document to fit a relational schema at intake can delay useful work, increase rework, and narrow what teams are able to test or automate. A lakehouse makes it possible to support that progression while keeping multimodal and analytical workflows within the same governed environment.

A data warehouse stores curated, structured data in a defined schema for consistent analysis and reporting. It is strongest when the main need is stable, well-understood downstream use.

A data lake stores data in original formats without requiring a schema at intake. This allows flexible ingestion of varied inputs, but it requires governance and later processing to support reliable use.

A data lakehouse combines these models by preserving original data where appropriate and applying validation, structure, and promotion over time so that trusted datasets can support analysis, applications, and AI-assisted workflows.

Zone structure #

Data lakehouses are typically organized into zones that separate received data, transformed data, certified data, exploratory work, and retained copies. A well-governed data lakehouse requires explicit rules for promotion between these zones.

At WDGPH, we have adopted a variation on the common medallion architecture, with the following zone definitions:

ZonePurposeExpected properties
LandingInitial receipt, validation, and routing of source dataOriginal form preserved temporarily, limited guarantees beyond completeness and provenance
ProcessedStandardization and structural validationConsistent schema, parsed and cleaned, suitable for downstream transformation but not routine consumption
CuratedApproved datasets for analysis and applicationsDocumented, monitored, attributable, and fit for routine use with defined ownership
ExperimentalShort-lived analytical and development workIsolated, time-limited, not relied on for production decisions
ArchiveLong-term retention for compliance and referenceDurable, auditable, restricted modification, optimized for retrieval rather than active use

Boundaries for external data #

Organizations that receive data from partners need a clear boundary between external submission and internal processing. Treating these as distinct architectural concerns reduces risk and avoids exposing internal storage structures to external parties.

The externally reachable area should have a narrow function: accept submissions, confirm delivery, and retain only what is necessary for transfer. Internal landing is part of the processing environment, where data enters managed workflows. Movement between these areas should be explicit and event-driven, with automated handling for validation, routing, and exception management.

Users and interfaces #

A familiar directory structure may help readers picture the zones, with separate areas for intake, transformation, approved use, temporary analysis, and retention. But that analogy should not be taken literally: the lakehouse is meant for controlled system interaction, with most users accessing data through applications, reports, and services rather than by browsing storage directly.

Different users interact with that environment in different ways, with distinct requirements for access, performance, governance, and reliability. Typical public health users, their tasks, and platform requirements are described below:

User or systemPrimary taskTypical interfaceRequirement from the platform
External partnerSubmit files or recordsControlled receiving areaReliable delivery without visibility into internal data
Data engineer or data scientistBuild ingestion, validation, and promotion pipelinesCode repository, workflow platform, notebooksStable paths, automation hooks, reproducibility, and scoped write access
Analyst or epidemiologistUse approved datasets for reporting and analysisNotebooks, BI toolsCurated data with consistent structure, lineage, and freshness metadata
Operational applicationSupport client lookup, record retrieval, or workflow logicSearch interface, application service, APIFast access to approved subsets of data with defined ownership
Records or privacy staffRetrieve retained information for compliance or reviewSearch and retrieval toolingAuditability, retention controls, and durable storage

Relational systems in a lakehouse context #

A lakehouse architecture does not eliminate the role of relational databases. They remain important for operational systems that depend on transactional updates, key-based relationships, and behaviors such as cascading changes. At the same time, lakehouse platforms have matured, and formats such as Delta Lake can now provide ACID transactions and constraints for tabular data stored in the lakehouse. The difference between these systems is not defined solely by transactional support. Relational databases combine storage and compute in a tightly coupled platform, while lakehouses keep storage and compute more loosely connected. In practice, organizations need to decide what belongs in each environment by considering not only functional requirements, but also whether the workload justifies the cost and responsiveness of a dedicated database platform.

ACID transactions are a set of reliability guarantees for tabular data operations.

  • Atomicity means a change is applied completely or not at all.
  • Consistency means the data must still satisfy defined rules after a change.
  • Isolation means concurrent operations do not interfere with one another in invalid ways.
  • Durability means committed changes persist even after failures.

These gurantees are important in production environments in which tabular datasets may be updated, validated, and consumed by multiple processes at once.

Delta Lake is an open-source table format for lakehouse environments that adds features such as a transaction log, schema enforcement, and versioned table state on top of file-based storage. This allows tabular data in a lakehouse to support capabilities such as ACID transactions and reliable incremental processing.

Closing #

The case for a lakehouse is strong for fast-moving public health data teams with multimodal data requirements. When used to their full capability, a lakehouse allows structured and unstructured data to be handled together, with refinement applied progressively rather than imposed at intake. That makes it easier to support reliable reporting, partner data intake, records retrieval, and AI-assisted workflows within a single governed environment.

Author
PhD, MSc ยท Current Team Member
Justin is Principal Data Scientist at Wellington-Dufferin-Guelph Public Health, where he manages portfolios in data science, data engineering, and AI. He previously completed a PhD (2020) and MSc (2014) in Applied Statistics at the University of Guelph, where he developed machine learning methods for outbreak reconstruction using pathogen genetic data and created models for spatially correlated data. Justin believes that open-source software is a powerful framework for building consistency, and trust across the public health system.