Why Your AI Agents Keep Failing: Fix the Data Model First

There’s a pattern playing out inside almost every enterprise data team right now. The AI proof of concept lands beautifully in a demo environment. Stakeholders nod. Budget gets approved. Then production happens — and the agent starts hallucinating, the recommendations drift, and the data pipeline quietly poisons everything downstream. The model isn’t broken. The data model is.

The Production Gap Nobody Wants to Talk About

Dustin Dorsey at dbt puts it plainly: the gap between a working AI proof of concept and a working AI deployment almost always lives in the data model, not the model itself. Barr Moses made the same argument at Snowflake Summit, framing it as a pipeline problem with compounding consequences — garbage data flowing into AI agents doesn’t produce garbage outputs occasionally, it produces garbage outputs systematically, at scale, with confidence scores attached.

This is particularly acute for brands running first-party data programmes in Southeast Asia, where consent-driven data collection is still maturing and the data arriving into CRMs and CDPs is often inconsistent in structure, language, and completeness. Thai customer records collected via LINE OA look different from Indonesian records pulled through a Shopee affiliate integration. When an AI agent is asked to act on unified profiles that were never truly unified, the agent fails — and the team blames the AI.

The fix isn’t a better model. It’s a defensible data contract at the point of ingestion.

What ‘Garbage Data’ Actually Means in Practice

Data quality issues tend to cluster into three failure modes that Moses identifies as most damaging in production: schema drift (upstream sources change their structure without notice), stale joins (two tables linked on a key that no longer reliably represents the same entity), and silent nulls (missing values that aren’t flagged as missing, so the model treats absence as signal).

For a regional brand running a loyalty programme across five markets, schema drift alone can invalidate months of behavioural modelling. A promotional redemption event in Malaysia carries a different event_type string than the same event fired from a Vietnam-localised app version — because no one enforced a canonical event taxonomy at collection time. The recommendation engine then learns two separate behavioural signals from what is functionally identical customer behaviour.

The tactical fix here is unsexy but non-negotiable: data contracts enforced at the source, not patched at the transformation layer. Define what a valid customer event looks like before it enters your warehouse, and reject or quarantine records that don’t conform. Monte Carlo’s data observability tooling exists precisely for this — automated monitoring of data freshness, volume, and schema consistency so anomalies surface before agents consume them.

LLMs as Precision Tools, Not Magic Wands

One area where the ‘fix the model’ temptation is strongest is recommendation systems. Piero Paialunga’s technical walkthrough in Towards Data Science demonstrates how LLMs can meaningfully increase recommendation precision — specifically by enriching sparse interaction data with semantic understanding of product attributes and user intent signals that traditional collaborative filtering misses.

The Python implementation Paialunga outlines uses LLM embeddings to represent items and users in a shared semantic space, then applies nearest-neighbour retrieval against that space rather than relying solely on historical co-purchase signals. For cold-start scenarios — a chronic problem on Lazada or Shopee where new SKUs have zero transaction history — this approach can surface contextually relevant recommendations from day one.

But here’s what the tutorial doesn’t foreground: the LLM embedding quality is entirely dependent on the richness of the item metadata fed into it. If your product catalogue has inconsistent category labels, missing attribute fields, or multilingual descriptions that haven’t been normalised, the semantic space the LLM constructs is distorted from the start. The algorithm is sound. The data it’s working with isn’t.

This is the same argument in a different register. Investing in LLM-enhanced recommendations without first auditing catalogue data quality is spending on a faster engine before checking whether the tyres are flat.

When Behaviour Shifts, Your Data Model Needs to Shift With It

A Fullstory survey of over 1,000 U.S. consumers found that 31% of travellers are booking earlier specifically to offset rising prices — a structural shift in booking behaviour, not a seasonal blip. For travel and hospitality brands in Southeast Asia, where OTA competition is fierce and margin pressure from Agoda, Traveloka, and Trip.com is constant, this kind of behavioural signal has direct revenue implications.

The strategic point isn’t the statistic itself — it’s what it reveals about the relationship between real-world behaviour change and data model design. If your customer journey model was built around a booking window assumption from 2023 data, your AI-driven pricing and personalisation logic is now operating on a stale behavioural premise. The model hasn’t changed. The world has. And your data architecture needs a mechanism to detect and respond to that drift.

Building temporal sensitivity into your data models — tracking when behavioural patterns shift, not just what they currently are — is the difference between a first-party data programme that compounds in value over time and one that slowly becomes a liability. Consent-based longitudinal data, collected transparently and modelled carefully, is genuinely one of the most durable competitive advantages available to a Southeast Asian brand right now. The brands that will unlock it aren’t the ones with the biggest AI budgets. They’re the ones that got their data contracts right first.

Key Takeaways

Enforce data contracts at the point of ingestion — schema drift and silent nulls kill AI agent performance in production, and patching at the transformation layer is too late.
LLM-enhanced recommendation systems only outperform traditional models when the underlying item and user metadata is structurally clean and semantically consistent across markets and languages.
Design your data models to detect behavioural drift over time, not just reflect current patterns — first-party data programmes that can’t adapt to shifting customer behaviour become a liability, not an asset.

The question worth sitting with: if your data model was designed for a world where customer behaviour was more predictable and privacy constraints were looser, what would it actually cost — in time, trust, and technical debt — to rebuild it for the one you’re operating in now? That’s not a rhetorical question. It’s the scoping exercise most teams are avoiding.

At grzzly, we work with growth and data teams across Southeast Asia to build first-party data programmes that are structurally sound before they’re scaled — designing the data contracts, taxonomy, and observability layer that make AI activation actually work in production. If the gap between your proof of concept and your production results feels uncomfortably familiar, Let’s talk.

Why Your AI Agents Keep Failing: Fix the Data Model First

The Production Gap Nobody Wants to Talk About

What ‘Garbage Data’ Actually Means in Practice

LLMs as Precision Tools, Not Magic Wands

When Behaviour Shifts, Your Data Model Needs to Shift With It

Enjoyed this?Let's talk.

Enjoyed this?
Let's talk.