Designing Data Lakehouses for Public Health
Table of Contents
Public health data rarely arrives in a single, analysis-ready form. It moves through administrative systems, partner submissions, spreadsheets, faxed documents, images, audio, and other artifacts that vary widely in structure and quality. Many of these materials are now also inputs to AI workflows that extract, classify, summarize, and link information across formats.
A storage architecture for this environment needs to support intake, preservation, transformation, and retrieval without assuming that every source should be standardized at the moment it arrives. In this article, we describe how a data lakehouse supports that progression while still meeting governance expectations.
Progressive normalization of multimodal data #
Public health organizations need stable analytical datasets to support reporting and operations, but much of what they receive is not ready for structured storage at intake. Traditional data warehouse approaches often assume that value comes only after data has been transformed into a predefined relational model. That works well for mature, repeatable reporting pipelines, but it can slow innovation when teams are still learning what incoming data contains, how it should be interpreted, or how it may be used across operational, analytical, and AI workflows.
For many public health data teams, a better pattern is to accept data as it arrives, preserve original artifacts, and apply structure in stages within a data lakehouse architecture. This allows teams to begin using new sources earlier, refine transformations incrementally, and avoid forcing premature design decisions at the point where the data is least understood.
A faxed document, for example, may first need to be retained as a source record, then processed through extraction or classification steps, and only later represented as structured fields suitable for reporting, search, or application use. Requiring that document to fit a relational schema at intake can delay useful work, increase rework, and narrow what teams are able to test or automate. A lakehouse makes it possible to support that progression while keeping multimodal and analytical workflows within the same governed environment.
A data warehouse stores curated, structured data in a defined schema for consistent analysis and reporting. It is strongest when the main need is stable, well-understood downstream use.
A data lake stores data in original formats without requiring a schema at intake. This allows flexible ingestion of varied inputs, but it requires governance and later processing to support reliable use.
A data lakehouse combines these models by preserving original data where appropriate and applying validation, structure, and promotion over time so that trusted datasets can support analysis, applications, and AI-assisted workflows.
Zone structure #
Data lakehouses are typically organized into zones that separate received data, transformed data, certified data, exploratory work, and retained copies. A well-governed data lakehouse requires explicit rules for promotion between these zones.
At WDGPH, we have adopted a variation on the common medallion architecture, with the following zone definitions:
| Zone | Purpose | Expected properties |
|---|---|---|
| Landing | Initial receipt, validation, and routing of source data | Original form preserved temporarily, limited guarantees beyond completeness and provenance |
| Processed | Standardization and structural validation | Consistent schema, parsed and cleaned, suitable for downstream transformation but not routine consumption |
| Curated | Approved datasets for analysis and applications | Documented, monitored, attributable, and fit for routine use with defined ownership |
| Experimental | Short-lived analytical and development work | Isolated, time-limited, not relied on for production decisions |
| Archive | Long-term retention for compliance and reference | Durable, auditable, restricted modification, optimized for retrieval rather than active use |
Boundaries for external data #
Organizations that receive data from partners need a clear boundary between external submission and internal processing. Treating these as distinct architectural concerns reduces risk and avoids exposing internal storage structures to external parties.
The externally reachable area should have a narrow function: accept submissions, confirm delivery, and retain only what is necessary for transfer. Internal landing is part of the processing environment, where data enters managed workflows. Movement between these areas should be explicit and event-driven, with automated handling for validation, routing, and exception management.
Users and interfaces #
A familiar directory structure may help readers picture the zones, with separate areas for intake, transformation, approved use, temporary analysis, and retention. But that analogy should not be taken literally: the lakehouse is meant for controlled system interaction, with most users accessing data through applications, reports, and services rather than by browsing storage directly.
Different users interact with that environment in different ways, with distinct requirements for access, performance, governance, and reliability. Typical public health users, their tasks, and platform requirements are described below:
| User or system | Primary task | Typical interface | Requirement from the platform |
|---|---|---|---|
| External partner | Submit files or records | Controlled receiving area | Reliable delivery without visibility into internal data |
| Data engineer or data scientist | Build ingestion, validation, and promotion pipelines | Code repository, workflow platform, notebooks | Stable paths, automation hooks, reproducibility, and scoped write access |
| Analyst or epidemiologist | Use approved datasets for reporting and analysis | Notebooks, BI tools | Curated data with consistent structure, lineage, and freshness metadata |
| Operational application | Support client lookup, record retrieval, or workflow logic | Search interface, application service, API | Fast access to approved subsets of data with defined ownership |
| Records or privacy staff | Retrieve retained information for compliance or review | Search and retrieval tooling | Auditability, retention controls, and durable storage |
Relational systems in a lakehouse context #
A lakehouse architecture does not eliminate the role of relational databases. They remain important for operational systems that depend on transactional updates, key-based relationships, and behaviors such as cascading changes. At the same time, lakehouse platforms have matured, and formats such as Delta Lake can now provide ACID transactions and constraints for tabular data stored in the lakehouse. The difference between these systems is not defined solely by transactional support. Relational databases combine storage and compute in a tightly coupled platform, while lakehouses keep storage and compute more loosely connected. In practice, organizations need to decide what belongs in each environment by considering not only functional requirements, but also whether the workload justifies the cost and responsiveness of a dedicated database platform.
ACID transactions are a set of reliability guarantees for tabular data operations.
- Atomicity means a change is applied completely or not at all.
- Consistency means the data must still satisfy defined rules after a change.
- Isolation means concurrent operations do not interfere with one another in invalid ways.
- Durability means committed changes persist even after failures.
These gurantees are important in production environments in which tabular datasets may be updated, validated, and consumed by multiple processes at once.
Delta Lake is an open-source table format for lakehouse environments that adds features such as a transaction log, schema enforcement, and versioned table state on top of file-based storage. This allows tabular data in a lakehouse to support capabilities such as ACID transactions and reliable incremental processing.
Closing #
The case for a lakehouse is strong for fast-moving public health data teams with multimodal data requirements. When used to their full capability, a lakehouse allows structured and unstructured data to be handled together, with refinement applied progressively rather than imposed at intake. That makes it easier to support reliable reporting, partner data intake, records retrieval, and AI-assisted workflows within a single governed environment.