Buzzed Technology

Contact us.

We'll get in touch within 24 hours.

All Field Notes
Field Note4 min read

Your data doesn’t need to be perfect. It needs to be in one place.

The ‘we need to clean our data first’ line is how you spend a year not shipping AI. Consolidation is the unlock; cleaning is a nice-to-have.

The most expensive sentence in an AI strategy meeting is “we need to clean our data first.” It sounds responsible. It is actually a stall tactic dressed up as diligence. Every company we have shipped AI for had messy data on day one. None of them cleaned it before we shipped. The unlock was never clean data. The unlock was data in one place.

Recent industry surveys keep putting a hard number on this. Only 7% of enterprises say their data is fully ready for AI, and 73% admit they struggle with preparation. Gartner expects 60% of AI projects to be abandoned by the end of this year, largely because teams tried to boil the ocean on data quality before they had anything running. Meanwhile, the companies actually shipping are the ones who built systems that work with data as it is - not the ones with the cleanest spreadsheets.

What “clean” actually buys you

Clean data helps. It reduces hallucinations on structured extraction. It makes analytics dashboards less embarrassing. It is genuinely required for things like training a classifier or running a regression. But for most of the AI use cases teams want to ship right now - RAG over internal docs, agent-style lookups, summarization, triage, drafting - clean data is a nice-to-have, not a gate.

Modern retrieval handles noise surprisingly well. Vectorization lets a malformed FAQ, a half-finished Notion doc, and a 2019 PDF sit in the same index and still produce a decent answer. Re-ranking models pull the better source to the top. You can even let the model flag which source it trusted and why. All of that works on data that would fail a strict data-quality audit.

Fragmentation is the real blocker

The top obstacle to AI-ready data in the same surveys is not quality. It is fragmentation. 56% of enterprises cite siloed data and integration difficulty as the primary barrier. That lines up with what we see. The customer record that lives half in Salesforce, half in a spreadsheet, and a quarter in someone’s inbox is the blocker. Not the typos in the address field.

Put the same data in one place - even if it is still messy - and three things get easier immediately:

  • Retrieval actually works. The model can find the relevant chunk because the relevant chunk exists in the index it is querying.
  • Access control becomes a single conversation, not a cross-system compliance project.
  • You can measure what the AI is doing. Logs, evals, and audits all assume a single system of record. If there are five, there is no record.

Consolidation is cheaper than cleaning

A data cleaning project is open-ended. It has no obvious end state. Every clean-up reveals three more fields someone never filled out. Consolidation is bounded. Pick a target store - a warehouse, a lakehouse, a document index, a vector database, even a thoughtfully modelled Postgres - and move the relevant data in. You will notice the quality problems during the move, and you can triage them as they appear, but the project has a shippable milestone. The cleaning project does not.

This is the shape of most of our intelligent data analytics engagements. The first phase is almost never “clean the data.” It is “put it in one place, with consistent identifiers, and write down where it came from.” The cleaning that ends up mattering gets done along the way, on the handful of fields the AI actually reads. We do the same on the document side in document processing - unify ingestion, extract what we need, move on.

What “in one place” actually means

Not literally one database. One logical surface the AI can query with consistent identifiers, consistent auth, and a consistent idea of which record is the source of truth for which field. A customer has one canonical ID. A contract has one canonical location. A support ticket has one canonical status. The underlying storage can be three systems; what matters is that the AI sees them as one.

That surface is also what makes the next step - orchestrator agents, multi-step workflows, cross-team tools - possible at all. See our note on orchestrator agents needing a single source of truth. The shortest path to an AI that does real work is a pile of average data with a good index on top of it, not a pristine dataset nobody has looked at in six months.

Stop cleaning. Start consolidating. You can always scrub the typos later - ideally with the AI you shipped in the meantime.