Why measuring artificial intelligence quality has become global challenge
Artificial intelligence (AI) has outgrown the evaluation tools originally designed for conventional software, exposing weaknesses in how quality, safety, and trustworthiness are measured. As AI systems take on decision-making roles in sensitive domains like healthcare, the absence of clear, measurable quality benchmarks has become a growing regulatory and operational concern.
This issue is the focus of a new paper Quality Assessment of Artificial Intelligence Systems: A Metric-Based Approach, published in Electronics. The study presents a comprehensive model for assessing AI quality through standardized metrics that capture both technical performance and real-world use risks.
Why traditional software quality models fail AI
Most existing quality models were designed for deterministic software systems, where explicit rules and predictable logic define the behaviour. AI systems, on the other hand, rely on data-driven learning processes that introduce uncertainty, adaptation, and opacity. As a result, conventional quality characteristics such as correctness or functional completeness do not adequately capture how AI systems behave in real-world conditions.
The authors review the ISO/IEC 25000 family of standards, widely used for software quality assessment, and identify inconsistencies between older software-focused models and newer AI-oriented frameworks. While recent standards acknowledge AI-specific traits such as adaptability and autonomy, they often stop short of providing concrete metrics that developers or regulators can apply in practice.
This disconnect has practical consequences. AI systems may pass traditional validation checks while still producing biased outcomes, unsafe recommendations, or misleading outputs. The study points to recurring failures in algorithmic grading systems, automated decision-making tools, and conversational agents as evidence that quality gaps often emerge after deployment rather than during development.
Another limitation highlighted in the research is the lack of alignment between product-focused quality measures and user-centered outcomes. An AI system may perform well in controlled testing environments while undermining trust, fairness, or safety when deployed at scale. The authors argue that quality assessment must therefore extend beyond internal performance metrics to include real-world usage contexts and human interaction.
To address these limitations, the study highlights the need to reconceptualize AI quality as a multidimensional construct. Rather than relying on a single benchmark or performance score, quality assessment should evaluate how systems function technically, how they behave in operational environments, and how they affect users and society, the study notes.
A metric-based framework for measuring AI quality and risk
The study proposes a comprehensive metric-based framework that bridges the gap between abstract quality principles and measurable indicators. The authors structure their approach around two complementary perspectives: product quality and quality in use.
Product quality focuses on the AI system as a technical artifact. It includes characteristics such as functional suitability, reliability, performance efficiency, security, maintainability, flexibility, and safety. For each of these dimensions, the framework identifies metrics that can be used to evaluate whether the system meets defined quality thresholds. Where existing ISO standards already provide measurement tools, the study recommends their continued application. Where gaps exist, particularly for AI-specific traits, new metrics are proposed.
Quality in use shifts attention to how AI systems perform once deployed. This perspective encompasses effectiveness, efficiency, user trust, acceptability, freedom from risk, and overall impact in real-world settings. The authors argue that quality in use is especially critical for AI systems whose decisions influence human behavior, access to services, or safety outcomes.
The study lays a particular emphasis on safety as a key quality characteristic rather than a secondary concern. The framework integrates metrics for hazard identification, fail-safe behavior, risk mitigation, and system resilience. This approach reflects growing recognition that AI systems can cause harm not only through malfunction but also through statistically plausible yet contextually inappropriate decisions.
Transparency and explainability are also treated as measurable quality attributes. The authors propose layered metrics that distinguish between technical explainability for developers, interpretability for users, and accountability for regulators. This layered approach acknowledges that full transparency may not always be feasible due to intellectual property or security constraints, while still enabling meaningful oversight.
Another key area addressed is robustness. The study highlights the vulnerability of AI systems to data drift, adversarial manipulation, and unexpected inputs. To mitigate these risks, the framework includes metrics for stress testing, robustness under distribution shifts, and resistance to malicious interference. These measures aim to ensure that AI systems maintain acceptable performance even as operating conditions change.
Importantly, the authors stress that quality metrics should not be applied uniformly across all AI systems. Instead, metric selection and weighting should reflect domain-specific risks and regulatory requirements. An AI system used for entertainment recommendations, for example, requires a different quality profile than one used for medical diagnosis or autonomous navigation.
Implications for regulation, industry, and public trust
The study notes that AI quality assessment is increasingly tied to regulatory compliance and public accountability. As governments introduce new AI regulations and risk classification schemes, the lack of standardized, measurable quality indicators poses a significant challenge.
The authors note that metric-based quality assessment can serve as a common language between developers, regulators, and users. By grounding quality claims in observable metrics, organizations can move beyond vague assurances of safety or fairness and toward evidence-based evaluation.
For industry, the framework offers a structured way to integrate quality assessment into the AI development lifecycle. Rather than treating quality checks as a final validation step, metrics can be used to guide design decisions, monitor system behavior over time, and support continuous improvement. This approach aligns with emerging best practices in responsible AI development, which emphasize lifecycle management over one-time certification.
The study also addresses the economic dimension of AI quality. Poorly assessed AI systems can generate significant downstream costs, including reputational damage, legal liability, and operational disruption. By contrast, robust quality assessment can reduce uncertainty, improve system reliability, and support sustainable deployment.
From a societal perspective, the research highlights the role of quality assessment in building public trust. As AI systems become more visible in everyday life, trust increasingly depends on whether systems behave consistently, transparently, and safely. Metric-based evaluation provides a way to demonstrate these qualities without relying solely on expert judgment or proprietary claims.
The authors acknowledge that their framework is not a final solution. Some quality attributes remain difficult to quantify, and trade-offs between characteristics such as usability, security, and transparency are often unavoidable. The study calls for empirical validation of proposed metrics through real-world deployments and cross-sector collaboration.
- FIRST PUBLISHED IN:
- Devdiscourse