Why data quality is (still) essential for AI success.

Aurai

Why data quality is (still) essential for AI success.

TL;DR - Data quality has always been a critical factor for AI success. Nowadays, the state of the art in large language models promises magic wand solutions. However, they bring with them a whole other set of challenges associated with their data foundation. To raise the data quality game within an organization means to move beyond reactive fixes and into proactive design choices. Because AI success still depends on something deceptively simple: data quality.

We’ve all heard the phrase: Garbage In, Garbage Out (or GIGO).
It’s a classic one-liner that perfectly captures a fundamental truth; if the input is flawed, the output will be too. In an age of increasingly advanced analytics and AI, this principle is more relevant than ever.

But here’s the twist: While the phrase has been around since the early days of computing, its implications have evolved. Today, with the rise of powerful AI models, we’re seeing a new challenge; the garbage is getting harder to spot.

To exemplify; When testing several leading AI models by prompting them with identical job applicant profiles that differed only by gender, the models consistently advised women to ask for lower salaries than men. With the training data volume of modern language models, they will inevitably include controversial and stereotypical content which contains all sorts of biases.

In other words: it’s no longer enough to assume that convincing outputs mean the input used was all high-quality. Moreover, sufficient data quality may not anymore be obtained by the same means as several years ago.

That is why in this blog, I’ll discuss the Why, How and What of data quality for modern AI:

Why data quality is (still) essential for AI success.
How modern AI is redefining “quality”
What organizations can do to keep up.

WHY: data quality is (still) essential for AI success

“While reviewing a seemingly qualitatively good dataset on plant characteristics, Bob (our lead data scientist) noticed that one field consistently indicated the plant’s color as green 85% of the time. On the surface, the data looked accurate: no missing values, valid entries, and proper formatting. But the field was meant to record the flower color, not the plant’s foliage. The data was complete and correct, but not right. This semantic misunderstanding quietly reduced data quality, possibly undermining the model’s performance if going unnoticed as the feature did not fully reflect the real world circumstances.

In another project, while developing a model to generate digital maps from aerial imagery, there were persistent alignment issues between the model outputs and our reference maps. It turned out the aerial photos were recent, but the digital map labels at some spots were several years out of date. Buildings had been demolished, roads rerouted, and new infrastructure added which was not reflected in the training data. The result was a mismatch that taught us how to critically evaluate the key data quality success factors, in this case being timeliness in dynamic environments.”

These experiences are emblematic of the data science zeitgeist of early 2020. New technologies enabled the transformation of ad hoc and opportunistic AI projects into mature production pipelines, though with growing pains. Along this route, the caution about the real world applicability and added business value of AI also grew.

According to a widely cited 2019 Gartner estimate, up to 85% of AI projects would come to fail. Often due to poor data quality or insufficient relevant data. When you consider the scope and budget attached to such initiatives, an anticipated 15% success rate would have had most managers second guessing. It’s now 2025 and are a bit further into the hype cycle, we’ve also learned a lot along the way:

Data collection and preparation are often slow and prone to errors. In fact, these steps typically consume around 80% of the total AI development effort, leaving less time for actual modeling and validation.

Training data may not reflect real-world conditions. When datasets don’t match the diversity or complexity of actual use cases, models tend to underperform once deployed.

Unvetted data sources introduce hidden risks. Data pulled from uncontrolled environments like web scraping can carry biases, inaccuracies, or misinformation that go undetected until it’s too late.

Quality is often measured in oversimplified ways. Metrics like sample counts, class balance, or missing values are useful but insufficient for understanding whether a dataset is truly fit for purpose.

Data governance and documentation are often overlooked. Without clear ownership, curation processes, and lifecycle management, data quality tends to degrade with real consequences for AI outcomes.

Mitigating poor data quality hasn’t just been crucial to avoid technical failure, it has also been essential to stave off regulatory risks, and societal damage. We’ve all heard or seen examples of when AI models were trained on flawed, biased, or incomplete data, they made incorrect predictions, reinforced existing inequalities, or misled decision-makers who rely on their outputs.

Want to learn more about our approach for ensuring AI initiatives don’t end up in that 85%?
Look no further than our lead consultant Stefans’ step by step runthrough for building an AI strategy:

HOW: modern AI is redefining “quality”

“Meet our colleague Savas; who has been working as part of a team tasked with building an automated client helpdesk tool at a grid operator. In the use case at hand, a Large Language Model (LLM) would lookup internal documentation to be able to answer customers questions sent via e-mail. The documentation database presented two main data quality challenges;

It contains both instructions for Customer Service Workers (for internal use only) as well as customer facing information. This resulted in answers to questions making references to systems the customer could not access (Relevance).

Instructional texts were structured per topic, but in order to answer questions information from different pages was needed. (Completeness/Identifiability).

Tackling these challenges involved employing another LLM to sift the relevant information that would not lead to impossible instructions, as well as an elaborate chunking strategy to ensure the text data would be provided by topic when needed in conjunction.

Both of these issues stem from the fact that the documentation has been set up for human usage (unsurprisingly). To go beyond this specific use case, documentation could generally be written with usage by LLM’s in mind as well. Think about special punctuations, alternative text descriptions of images and better grouping of topics upfront.”

Data quality has never been one-size-fits-all. It depends heavily on the use case, the domain and the risks involved. So what to focus on? In following traditional data management practices, ensuring quality could start with aligning data to factors such as:

Accuracy: Are the values correct and verified?

Consistency: Are definitions and formats used uniformly?

Completeness: Are any critical elements missing?

Timeliness: Is the data up to date and refreshed as needed?

Relevance: Does the data actually support the business or model goal?

Uniqueness: Are duplicates avoided or accounted for?

Integrity: Are relationships between data sets preserved and coherent?

However, there is a seemingly endless list of qualitative and quantitative data quality dimensions available for anyone willing to look for them.

Figure 1: “Endless” data quality dimensions (via MIOSoft)

Moreover, there are new rules being written as the field evolves. One of the key data types used much more nowadays; text data, has different requirements for successful utilization. The actors who directly rely on the data being of good quality are also of another breed; more often than not machine. And even though we still use the same terms like before, these dimensions are now recontextualized. To illustrate:

Timeliness is key when working with instruction manuals in retrieval-augmented generation (RAG) use cases to ensure up to date guidance. As outdated texts could hold information not true or relevant anymore.

Consistency is important when training LLMs because inconsistent terminology or labeling (e.g., using both “customer ID” and “client number” for the same concept) can confuse the model. Ensuring consistent language and structure helps the model learn cleaner patterns and generate more reliable outputs.

Completeness is critical when training autonomous agents. High-quality training sequences need to fully capture start-to-finish processes to ensure agents can operate reliably in real-world environments. It is also essential for RAG, as missing context can cause an LLM to provide insufficient answers.

Additionally, their prioritization should not solely be determined by a use case at hand, which is what we are going to look at next.

Another important angle to inform your focus on data quality is the associated risk. Entered into force in 2024, the AI-act prescribes a hierarchy of risk classifications for AI applications with according obligations. Among the extra obligations, usage of high quality datasets is listed; “training, validation, and testing datasets to be relevant, representative, free of errors and complete”. Stay tuned for our upcoming blog on this topic.

WHAT: organizations can do to keep up.

Anticipate downstream usage

“Marc, Medior Data Engineer has been working on an internal project which employs an LLM for conducting interviews, to build a database about past projects. During the design phase, direct efforts have gone into finetuning the format of the resulting qualitative database. This is making it possible to directly use the outputs in newer projects down the line who want to make use of this data without any further wrangling needed. What caused these design choices was the experience with similar portfolio databases and the ifs and buts associated with its processing.

The way the pipeline is set up is as follows: firstly the interviewer application collects lots of text data. Then the collected data is transformed into a schema done through enforcing output format, in this case JSON (with the help of pydantic in an API call to OpenAI). This format seems to be suited as it is both readable to humans and parsed and generated by machines. Not only that, with regards to data quality this format enables easy validation of results by means of familiar programmatic checks. E.g checking character or word counts in specified fields, especially crucial when systems downstream expect a certain amount of words or similar requirements. One potential risk however seems to be nesting too deep, as successful data retrieval succumbs at some point.

One such application making use of this portfolio database is a generation tool that creates one pagers and CV’s. In this tool, an LLM provides the contents and further scripting parses it into powerpoint or other (web)formats. Quality outputs are ensured by cutting up the prompting process into smaller chunks. This mitigates the risk of hallucinations, and also enables usage of less advanced (and less expensive) models. Furthermore, extra instructions can be provided and intermediary results can be validated.

Want to read more on the usage of internal LLM’s?

Prioritize best practices now to enable modern tools tomorrow

Another example of the importance of a quality data foundation nowadays is given to us by Don, our Lead Data Engineer. Through the continuous development of our Data platforms proposition, this unit is looking to provide our clients with the next step for BI. Namely the enablement of AI/BI tools such as Genie or Cortex Analyst. These tools enable a form of self-service BI not seen before, where insight is provided on the spot by an agentic LLM. Taking this step will enable faster time-to-insight, more autonomy for end-users and less maintenance of reports. Ultimately freeing up more of data engineers’ time (as well as shifting development focus from custom to more generally used data products).

These advanced services are based on the analytics platforms many organisations currently employ. However, there is an important distinguishing factor for how engineers should approach the typical data processing with regards to these new services. In a normal source to end user pipeline, data engineers and other professionals of the likes build data products for other data engineers and professionals to make use of. In the case of AI/BI tools, the interfacing LLM agent needs to be able to understand, lookup, query and analyse all relevant tables, marts, models, documentation etc. This distinction greatly heightens the need for consistency, identifiability (for example column naming conventions) on top of all the requirements of “normal” BI.
Don foresees that businesses that took the extra effort of fully documenting, rigorously and consistently validating and modelling data will have a much easier transition into the adoption of chatting interfaces for answering their custom data questions. Because in these instances they will be able to actually provide the correct information towards the end-user in a much shorter timespan.”

Because data quality needs differ between contexts, there is still no magic wand. Whether you’re transforming traditional BI into AI/BI , embedding predictive models into daily operations, or experimenting with generative AI, you will need to take measures that go beyond collecting and cleaning data. Such efforts towards building in data quality are often best approached stepwise :

Define: Identify what “quality” means for your organization, your use cases, and your compliance needs
Assess: Evaluate your current data landscape, systems, and quality gaps
Realize: Implement improvements in governance, pipelines, documentation, and validation processes
Sustain: Embed data quality into daily operations, roles, and technology choices

Looking back at the past few years, we’ve often seen organizations succeed in the first two steps for a given AI use case. Defining and assessing quality gets done, sometimes thoroughly. But the realization and sustain phases often remain underdeveloped or disconnected. When those phases are formalized, they are usually geared towards more operational or tactical, day to day data needs such as traditional BI & management reporting.

To break that pattern, we recommend rethinking quality not just per AI project, but at the framework level. Build reusable structures, validation patterns, naming conventions and documentation templates that don’t just support today’s dashboards but tomorrow’s agents and assistants too.

Lets get in touch and chat for free with our professionals.

We support all phases of this journey with a team that understands both the technical and organizational sides of data quality and AI. From consultants and data architects to data engineers and AI specialists, we bring the tools and expertise needed to turn data into a dependable asset for your AI strategy.