top of page
Search

Tracing the Line: Intent in the Agentic AI Era

  • Jan 27
  • 5 min read

Updated: Jan 28

In highly regulated domains like life sciences, data lineage isn’t optional. If you’ve ever worked with clinical trials, even remotely, you know exactly what I mean. Data lineage is baked into the very definition of regulated evidence. Across clinical research, drug development, diagnostics, and similar areas, evidence must be traceable, reproducible, and defensible over time. After all, this is what it takes when improving patients’ lives is at stake. Every time I reflect on this, I think: if that isn’t data lineage, then what is? Yet I’ve seen teams spend weeks herding data through transformation pipelines, only to have to manually trace the resulting evidence back to the raw collected data when responding to a reviewer’s enquiry.


I believe this is because true data lineage, understood not merely as a technical trace of data movements but as an explicit, multi-dimensional property of the data across systems, transformations, and time, is rarely designed as such. Decisions about derivations, transformations, and contextual assumptions are usually scattered across code, documentation, and individual expertise. As a result, the lineage is reconstructed after the fact instead of being preserved from the start.

This is a tension I’ve observed first hand: teams try to balance strict regulatory demands with local, practical workflows and this often creates the very paradox that makes data lineage so hard to get right.


Part of the problem is simply prioritization. Under constant delivery pressure, teams optimize for milestones and submissions deadlines. Traceability gets attention when it is required, not because it is valuable or incentivized on its own. Having well established standards helps. Standards establish shared definitions, expected structures, and baseline transformation rules. They reduce ambiguity and make local decisions more comparable. But standards alone do not eliminate local variations. Transformations, derivations, and interpretations are still applied differently across teams. When those deviations are not tracked explicitly, even reasonable local choices accumulate and start to clash.


This rarely looks like a problem while data remains locked in project silos. It becomes one when data is reused (or attempted at). That is when inconsistencies surface as conflicting datasets, opaque derivation histories, and duplicated or incompatible transformations across projects. You may be thinking, this is classical ETL arena, but it is not. Lineage is not about the mechanics of moving or transforming data. It is about capturing decisions, assumptions, and context across systems, teams, and time.


Compounding the problem, most stakeholders interact only with fragments of the data. I happen to have an outsider’s perspective as an information architect, which lets me spot patterns across the gaps. A statistician sees an analysis-ready dataset. A programmer sees a transformation script. A reviewer sees a table or figure. No single role naturally spans the full lifecycle, and this in my opinion is why lineage matters. 


But who is actually responsible for data lineage?


In practice, responsibility is fragmented. Architects design pipelines. Data stewards define standards. Teams implement transformations and document decisions. Platform teams operate infrastructure. Each group owns a piece, yet no one owns lineage end to end. This diffusion of accountability reflects the real effort and complexity of making lineage explicit, coherent and durable across organizational and technical boundaries. Accountability is still possible, but only if it is deliberately reinforced, a nuance many organizations struggle to put in place.


Many organizations make a pragmatic choice. They do not pursue full traceability. They accept partial lineage. They prioritize delivery and defer coherence and it works, seemingly at least. But every shortcut has a cost. Not always immediate but manifested later as lost opportunities:

  • Lost opportunities for data reuse: Cleaned datasets often stay locked in their original project. Without lineage capturing how data was processed, what assumptions made and transformations applied, reusing it in a new study requires repeating costly curation and integration work. For example, reusing a clinical trial’s lab results across multiple drug studies can double data preparation time if the derivation logic isn’t explicitly recorded.

  • Exposure to GenAI hallucinations: AI models rely on context to interpret and combine data correctly. Missing lineage metadata means models generate outputs without understanding underlying assumptions, leading to incorrect or misleading results. For instance, a model predicting adverse events might misinterpret a lab measurement’s unit or population subset, producing hallucinated correlations.

  • Missed operational insights: Lineage reveals patterns in processes, bottlenecks, and dependencies. Without it, organizations can’t identify where errors propagate or where efficiencies exist. For example, tracing which preprocessing steps consistently introduce variance in assay results can uncover process improvements and reduce downstream rework.


For corporate data strategy, the question is not whether full lineage is desirable, that’s hopefully a given. The real challenge is weighing the perceived cost and complexity: how far should we go in pursuing lineage, and conversely, how much semantic and historical fidelity are we willing to sacrifice? Every choice carries trade-offs, shaping the future capabilities we may be giving up as a result.


And what does proper data lineage actually require?


Paradoxically, both more and less than most organizations assume.

It requires more intent, and far less tooling. The instinctive response I’ve mostly observed is to look for a technical solution. Data platforms, catalogs, and observability stacks promise visibility and control. These tools are useful, necessary even. But they are not the tipping point. Without explicit intent about what needs to be traced, why it matters, and at what level of granularity, tooling captures activity rather than meaning.


The challenge is that intent is often fragile. For business subject matter experts, it’s taken for granted and assumed obvious. For IT partners building the solutions, it can feel abstract or vague. This is where many lineage initiatives quietly fail. After months of assembling technical solutions, the system may log that a dataset was produced by a job at a specific time, but it fails to capture how, and why, the transformation was applied. For example, a lab results dataset might show that values were scaled or normalized, but without recording the reason (e.g., adjusting for batch effects or standardizing units). A human expert might infer the missing context, but an automated process cannot, making automated downstream analyses unreliable.


What happens in practice is that local adaptive usage patterns emerge silently. The user uses the system to advance their work and with best intentions, documents the rationale on the side, somewhere, in some form. It could be a cell in a tabular sheet, a note in a minutes document or a comment in a shared code repo. Over time, organizations accumulate lineage records that are technically present but practically irrelevant.


To address this, intent has to be made explicit early on. As an information architect, this has consistently been the point I’ve pushed for: making what is assumed, local, or tacit visible and durable. Without a clear purpose behind each transformation, lineage inevitably collapses into a collection of facts without meaning, regardless of how advanced the tooling appears. This is where ontologies, global identifiers, and FAIR data principles add real value, not as academic constructs, but as practical means to formalize contextual intent so it can be shared, reused, and remain interoperable beyond the boundaries of individual systems and teams.


That said, ontologies and FAIR principles are necessary, but not sufficient on their own. Making intent explicit only works when it is supported by complementary organizational structures. Domain data ownership establishes accountability for decisions. Governance models define how local autonomy is balanced with global coherence. Information architecture then provides the connective metadata tissue, translating intent, ownership, and rules into coherent structures that persist over time. It’s only when metadata is treated as a first-class asset rather than a byproduct, that data lineage delivers on the promise.


With these elements in place, lineage becomes something that can survive personnel changes, system migrations, and evolving analytical methods. Without them, even the most sophisticated tooling has nothing stable to anchor to.


Data lineage in regulated environments is almost a moral commitment. A commitment that is more critical in the Gen AI era. Treating it as such is what separates projects that merely pass reviews from those that remain interpretable, defensible, and usable years later as autonomy increases.







 
 
 

Comments


bottom of page