AI in business requires control over data quality

These days, companies are increasingly moving beyond simply asking whether they can use artificial intelligence. A different question is becoming far more challenging: do they know what data they are feeding into their AI systems, who is responsible for that data, and whether it can be trusted in business processes?

This is a fundamental shift. In the first wave of generative AI roll-outs, the key priorities were access to models, the speed of pilot projects, and initial applications in customer service, sales, marketing, IT and document analysis. In the second phase, companies are facing a more down-to-earth problem. AI does not scale on the basis of presentations, but on the basis of data: data that is often scattered, inconsistent, partly out of date, unlabelled or subject to legal restrictions.

That is why data governance is no longer a topic confined to a small group of data specialists. It is becoming one of the prerequisites for the secure, compliant and scalable use of AI within an organisation. The greater the role AI plays in decision-making, automation and day-to-day processes, the more important it is to be able to answer a few fundamental questions: where does the data come from, is it permitted to use it, what is its quality, is it representative, who has access to it, and who is responsible for its consequences?

Data is a new checkpoint for AI

In traditional analytics, an error in the data could lead to an inaccurate report or a misleading dashboard. With AI, the same error can be replicated by the model, utilised by an agent, carried over into an operational decision, or hidden within a recommendation that nobody checked in time. AI therefore amplifies not only the value of data, but also its shortcomings.

For this reason, data readiness for AI cannot be understood simply as the availability of large datasets. An organisation may have vast amounts of data, yet still not be ready to scale up AI. Problems may include a lack of data owners, a lack of metadata, unclear definitions, incomplete data lineage, quality gaps, a lack of classification of personal data, or a lack of post-implementation monitoring procedures.

The AI Act reinforces this approach. The European Commission describes the AI Act as the first comprehensive legal framework for AI, based on a risk-based approach. For high-risk systems, it sets out, amongst other things, requirements concerning high-quality datasets, activity logging, documentation, user information, human oversight, and high levels of resilience, cybersecurity and accuracy.

This means that data governance in the context of AI is no longer merely a matter of best practice. In many applications, it is becoming a mechanism for providing evidence: a company must not only claim that it controls its data, but also be able to demonstrate this.

What does data governance mean in practice?

Data governance for AI can be described as a system of policies, roles, processes and controls that enable an organisation to manage data throughout the entire lifecycle of an AI system. It is not just about training data. In corporate deployments, validation and test data, data used in RAG systems, user input, logs, production data, reference data and information used to monitor model performance after deployment are equally important.

A mature organisation should know which data is critical to the operation of the AI system, what its limitations are, who owns it, who is authorised to process it, how often it is updated, and how it affects the model’s output. Without this, scaling up AI simply multiplies the risks.

The AI Act currently provides the strongest basis for this approach with regard to high-risk systems. It requires that training, validation and test data be subject to data governance and data management practices appropriate to the system’s intended purpose. In practice, this means managing data provenance, dataset preparation, representativeness, errors, completeness, bias and data gaps.

Data governance component	Source material	Significance for the organisation

Source of the data and the collection process

The AI Act sets out the processes involved in data collection, the source of the data and the original purpose for which personal data is collected

A company must know where the data comes from and whether it can use it for a particular AI purpose

Data preparation

The AI Act refers to operations such as annotation, labelling, cleaning, updating, enrichment and aggregation

It is not enough simply to have a dataset; you need to know how it has been processed

Data assumptions

The AI Act requires the definition of objectives regarding what the data is intended to measure and represent

It helps to minimise the risk of spurious proxies and false correlations

Availability, volume and usefulness of data

The AI Act refers to the assessment of the availability, quantity and suitability of datasets

The AI scale requires an assessment of whether the data is suitable for the system’s purpose

Data bias

The AI Act requires an assessment of potential biases, as well as measures to detect, prevent and mitigate such biases

This is crucial in applications that affect people, such as recruitment, credit, education and public services

Gaps and missing data

The AI Act requires the identification of gaps or missing data that could prevent compliance with the requirements

Data readiness also involves knowing what is missing from the data

Representativeness, completeness and errors

The AI Act requires that data be relevant, sufficiently representative, as free from errors as possible, and complete in relation to the system’s purpose

This is the cornerstone of data quality assessment for AI

This table shows that governance is not an abstract concept. It is a very specific list of areas that must be described, measured or documented before AI is integrated into larger-scale processes.

Data readiness must be measured, not merely declared

The biggest mistake made by companies scaling up AI is treating data quality as a matter of opinion. In practice, data readiness should be measured using a set of metrics that make it possible to determine whether a given dataset can be safely used in a specific AI application.

There is no single, universal data readiness threshold that applies to all industries and use cases. Systems supporting marketing will have different requirements to those classifying service requests, whilst high-risk solutions – such as those in recruitment, credit, education, healthcare or critical infrastructure – will have yet other requirements. However, it is possible to identify metrics that stem directly from the logic of the AI Act, the GDPR, the NIST AI RMF and data quality standards for analytics and machine learning.

ISO/IEC 5259-5:2025 sets out a data quality governance framework for analytics and machine learning. The standard is aimed, amongst others, at those responsible for organisational and data quality management, which clearly demonstrates that data quality for AI is not merely a technical task.

Area

Credits

What does it measure?

Justification from the source

Completeness

Percentage of required fields with no missing entries

Does the dataset contain the data required for the AI’s purpose?

The AI Act addresses data completeness and gaps

Correctness

Percentage of values that comply with business or reference rules

Is the data as free from errors as possible?

The AI Act requires that data be as free from errors as possible

Representativeness

Comparing training data with the population or context of use

Do the data meet the conditions under which the system will operate?

The AI Act requires data to be sufficiently representative

News

Percentage of data falling within the required SLA timeframe

Does the AI use data that is up to date for the process?

ISO/IEC 5259 relates data quality to the analytics and ML lifecycle

Fitness for purpose

Assessing the suitability of data for a specific AI application

Is the data suitable for the intended purpose, rather than simply being available?

The AI Act requires an assessment of the suitability of datasets

Bias review

A documented overview of possible biases

Could the data lead to biased or discriminatory results?

The AI Act requires the identification and mitigation of biases

Data gaps

The number and significance of missing segments, fields or sources

What is needed for the system to meet the requirements?

The AI Act requires the identification of data gaps and deficiencies

Ownership

Percentage of critical datasets with an assigned owner

Is anyone responsible for quality, definitions and access?

ISO/IEC 5259-5 addresses data quality within the framework of governance

It is important that these metrics are not managed solely at the level of the central data team. When it comes to AI, the business owner should be responsible for the meaning and context of the data, the data team for its quality and flows, IT for the architecture and security, and compliance for adherence to regulations and internal policies. Without this, an organisation may have the technical tools for governance, but still lack actual accountability.

Regulations and standards all say the same thing: keep your data under control

The AI Act does not operate in a vacuum. Companies scaling up their use of AI must, in parallel , take into account the GDPR, risk management and data quality standards, and internal security policies. In practice, all these frameworks lead to a similar conclusion: an organisation should know what data it is using, why, on what basis, with what risks involved, and under whose responsibility.

The GDPR remains particularly important where AI uses personal data. The European Commission points out that data protection principles include, amongst others, lawfulness, fairness and transparency, purpose limitation, data minimisation, accuracy, storage limitation, integrity and confidentiality, and accountability.

The NIST AI RMF organises AI risk management through the functions of Govern, Map, Measure and Manage. In this framework, governance is a cross-cutting function that supports the other activities throughout the AI lifecycle.

Frame

What it brings to data governance for AI

A question for the board

AI Act

It requires the monitoring of data, documentation, logging, dataset quality and risk mitigation in high-risk systems

Do we know where the data comes from, how it was compiled and whether it is representative?

GDPR

It requires purpose, minimisation, accuracy, security, storage limitation and accountability in relation to personal data

Do we have a legal basis and a clearly defined purpose for the use of personal data in AI?

NIST AI RMF

It streamlines AI risk management through governance, mapping, measurement and management

Is data risk mapped, measured and managed following implementation?

ISO/IEC 5259

Relates to data quality for analytics and machine learning

Is data quality managed systematically rather than on an ad hoc basis?

This perspective is particularly important for boards of directors. Implementing AI is no longer just a technological decision. It is a decision about how an organisation manages risk, accountability and evidence of compliance.

Five questions to ask before scaling up AI

In practice, companies do not need to start with a major data transformation programme. A good first step is to review the most important AI applications and the data that powers them. If an organisation cannot answer the questions below, it is probably not yet ready to scale up securely.

Question

What the answer shows

Do we know where the data used by AI comes from?

Transparency of data sources and the ability to trace their origin

Do we know whether the data is representative of the system’s purpose?

The risk of incorrect results, bias and poor decisions

Is the data as free from errors and as complete as possible?

Actual data quality, not just a claim of quality

Do we have the right to use personal data for this specific purpose?

Compliance with the GDPR, the AI Act and internal policies

Do we monitor the data after the model has been implemented?

The ability to detect drift, errors and incidents following the deployment of AI

These questions are simple, but their organisational implications are serious. If a company does not know where the data comes from, it will not be able to replicate the model’s decisions. If it does not assess representativeness, it may implement a system that works well only in a pilot phase. If it does not monitor the data after implementation, it will not notice that the conditions under which the model operates have changed.

Tools help, but they are no substitute for responsibility

The market offers an ever-increasing number of tools for data cataloguing, quality assurance, lineage tracking, access control, data observability, dataset versioning and drift monitoring. These are necessary, but they do not, on their own, solve the most important problem. A tool can reveal the chaos, but it cannot determine who is responsible for it.

This is a common mistake made by organisations. A company purchases a data platform, implements a catalogue or classification system, but fails to determine which datasets are critical, who owns them, what the minimum quality thresholds are, and what the incident response procedure is. As a result, governance remains a matter of documentation rather than an operational control mechanism.

In the context of AI, this is not enough. Models and agents are increasingly operating in real time, drawing on multiple data sources and influencing the decisions made by people. Data governance must therefore be in place not only at the project preparation stage, but also across data pipelines, access processes, production monitoring and auditing.

Governance separates pilot projects from large-scale implementation

An AI pilot can be carried out even on limited, partially manually prepared data. Scaling, however, requires something else: repeatability, control, accountability and evidence. This is why data governance forms the boundary between a high-impact experiment and a secure business infrastructure.

Companies wishing to move from isolated implementations to the widespread use of AI should start not by asking which models to choose, but by asking which data is actually ready for use. Is it documented? Does it have an owner? Is it up to date? Is it representative? Can its provenance be traced? Can compliance with the law be demonstrated?

In the next phase of AI adoption, the advantage will lie not with those organisations that are quickest to connect their models to all data sources, but with those that know which data can be trusted, which data must not be used, and who is responsible for the consequences of its use.

Data governance is therefore not simply red tape surrounding artificial intelligence. It is a mechanism that determines whether AI will remain a series of pilot projects or become a secure, compliant and scalable part of a company’s operations.