Tracing the Line: The Rare Diseases Case, Between the Data We Have and the Patients We're Missing

May 12
4 min read

In the context of rare diseases, I find the word "rare" somewhat misleading. Taken one by one, each rare disease affects a tiny fraction of the population. Collectively, they affect hundreds of millions of people. In Europe alone, tens of millions live with a rare disease. Most will spend years searching for a diagnosis. And of those lucky enough to receive one, most will spend the rest of their lives without access to a treatment for their condition.

We have more patient data than ever before, and researchers understand the molecular basis of these diseases better than at any point in history. Yet progress for patients remains painfully slow. Where is the bottleneck? As someone who works on data infrastructure for a living, I happen to hold a particular hammer and I cannot help but see a data lineage nail.

Data fragmentation, by design

A rare disease patient's medical history is not held in one place. An EHR record here, a genomic sequence there, a specialist consultation in a system that talks to none of the others. Each source built by a different organisation, for a different purpose, under different assumptions.

Some of that fragmentation is deliberate and necessary. Patient data is sensitive. GDPR and equivalent regulations exist for good reasons, and the friction added by consent frameworks, data access agreements, and de-identification requirements is largely justified.

But the result, across hundreds of institutions and dozens of countries, is a picture that is very hard to read as a whole. The information exists but the sources were never designed to be interpreted together, and in the absence of a shared semantic foundation, even data that can be accessed legally, and responsibly, cannot always be compared meaningfully.

Meaning fragmentation, by design as well?

If data fragmentation has a rationale, the semantic fragmentation is harder to justify.

The intention to harmonize is there. It is why the major classification systems exist: HPO for phenotypic features, OMIM and Orphanet for disease definitions, ICD codes for clinical encounters. But like all standards, they suffer from drift. Each was built by a different scientific community with different priorities, not originally designed to interoperate, and the gap has widened with every year of inconsistent application across institutions. The same disease appears under different codes depending on which system a clinician was trained on.

Privacy regulations govern how data is collected, stored, accessed, and shared. They say nothing about how diseases should be named, how variants should be annotated, or how a confirmed diagnosis should be recorded. The semantic fragmentation accumulated through parallel communities building what they each needed, without coordinating around shared definitions. The mappings between those efforts were left as an afterthought.

The impact for rare diseases clinical trials

Developing a treatment for a rare disease requires a thorough understanding of the patient population and for that, you first need to find the patients. Rare disease patients are scattered across the globe and many of them are living with the wrong diagnosis or not even diagnosed.

Assembling the rare disease picture means drawing on multiple sources, often from data collected under different protocols and classified using different conventions. This is where the semantic fragmentation described above becomes more pronounced because it compounds the needle-in-the-haystack problem with the problem of needle compatibility…

For a sponsor deciding whether to commit to a trial, that uncertainty translates directly into recruitment risk. If you cannot demonstrate with confidence that enough eligible patients exist and can be reached, the incentive to start is undermined. Many rare disease programmes stall for this reason. More than a quarter of rare disease clinical trials started in recent years were terminated before completion because too few eligible patients could be enrolled.

From fragmentation to orchestrated federation

The data that could support a more accurate estimate already exists, in principle. Genomic biobanks hold large numbers of sequenced individuals whose records include health data. National registries track patient cohorts. These sources collectively contain information that, if queried coherently, would give a far more grounded picture of where eligible patients are and in what numbers.

The barrier is not primarily access. Queried individually, each source disposes the proper access control mechanism and the data model definitions to interpret the results. Queried collectively, however, the results obtained cannot be straightforwardly compared or combined unless the underlying data was structured with that interoperability in mind from the start.

Getting value from those sources requires that they share enough semantic common ground to be interpreted together: consistent use of shared ontologies, explicit tracking of how concepts have been translated between systems, and enough provenance to understand what any given data point actually represents.

The standards and frameworks that would make this possible exist already and for the most part they are adhered to by rare disease registries and biobanks. But at a global federated scale, assuming that sustained commitment to standards would suffice alone is an oversimplification of the reality. A proper query orchestration layer is still necessary to make cross-source queries produce results that are actually comparable. That is not the responsibility of any single data source. It is infrastructure that has to be built and maintained independently. Alliances like GA4GH recognise this and have made meaningful progress on the governance and technical frameworks for federated querying in rare disease. What remains is the query execution layer: something that can take a clinical question, translate it across the semantic variations present in each source, and return results that a sponsor or researcher can actually act on.

The rare disease use case is a good example because the cost of not having such proper federation infrastructure weighs heavily when patients never make it into studies designed to help them and, even worse, when studies are stalled altogether because of that.

Tracing the Line: The Rare Diseases Case, Between the Data We Have and the Patients We're Missing

Recent Posts

Comments