Data, not algorithms, is true power behind artificial intelligence

The authors report that the overwhelming majority of respondents identified data quality, rather than model sophistication, as the most critical factor influencing AI system performance. Errors, omissions, and inconsistencies in datasets ripple through the entire pipeline, undermining fairness, transparency, and reliability. This recognition marks a significant shift in how the AI community is beginning to value the human and procedural dimensions of data work.


CO-EDP, VisionRICO-EDP, VisionRI | Updated: 23-10-2025 18:57 IST | Created: 23-10-2025 18:57 IST
Data, not algorithms, is true power behind artificial intelligence
Representative Image. Credit: ChatGPT

A new international study has revealed that the real foundation of artificial intelligence lies not in algorithms or model architectures, but in the quality, governance, and ethics of data itself. The research offers rare insight into how leading AI professionals around the world understand, manage, and confront the challenges of data-driven model development.

Published in Machine Learning and Knowledge Extraction, the study "Behind the Algorithm: International Insights into Data-Driven AI Model Development" challenges the long-standing, model-centric paradigm in AI research and practice. It argues that despite advances in machine learning frameworks, the integrity and reliability of AI systems depend primarily on how organizations collect, prepare, secure, and govern data.

Global insight into the hidden work behind AI systems

The study is based on in-depth interviews with 74 senior AI and data professionals across five continents, covering regions including North America, Europe, Asia, Africa, and Oceania. Participants represented sectors as diverse as healthcare, finance, public administration, and technology.

The authors employed a thematic qualitative analysis to map the recurring patterns and challenges in data-driven AI development. The findings reveal a consistent theme: while organizations invest heavily in model innovation, the underlying data infrastructure often remains fragmented, inconsistent, and poorly regulated.

To capture this complexity, the authors developed what they call the Data-Centric AI Lifecycle Model, a framework describing data as an active infrastructure that shapes every phase of AI development, from collection and preparation to deployment and monitoring. The model reframes data not as a static resource but as a dynamic, evolving ecosystem requiring continuous maintenance, context-awareness, and ethical scrutiny.

The authors report that the overwhelming majority of respondents identified data quality, rather than model sophistication, as the most critical factor influencing AI system performance. Errors, omissions, and inconsistencies in datasets ripple through the entire pipeline, undermining fairness, transparency, and reliability. This recognition marks a significant shift in how the AI community is beginning to value the human and procedural dimensions of data work.

Data quality, governance, and ethics as systemic weak points

The study points out that poor data quality and governance remain the primary obstacles to safe and responsible AI. Respondents described a range of recurring challenges: unstandardized data formats, incomplete labeling, lack of documentation, and difficulties in integrating data from multiple sources. These deficiencies lead to what the authors call "data degradation"- a process where errors accumulate invisibly until they manifest as biased or unreliable model outputs.

The study outlines five major domains where data issues affect AI reliability:

  • Data Preparation Challenges: High costs of data cleaning, labeling, and harmonization continue to strain organizational resources.
  • Quality and Reliability Risks: Incomplete or inconsistent datasets undermine both technical performance and ethical fairness.
  • Privacy and Security Concerns: Data leakage, adversarial manipulation, and third-party vendor misuse remain serious threats to public trust.
  • Ethical and Technical Bias: Bias enters through human labeling decisions as well as through feedback loops in algorithmic training.
  • Regulatory Adaptation Gaps: While many organizations comply with privacy laws like the GDPR, few have frameworks tailored to emerging AI-specific regulations, such as the EU AI Act.

Together, these domains reveal a systemic imbalance in how organizations prioritize innovation over integrity. The study's participants repeatedly emphasized that without strong governance structures, even technically advanced AI systems risk amplifying existing social inequalities.

The researchers highlight that data work is not purely technical but deeply social. Every stage, from acquisition to annotation, involves human judgment that reflects organizational culture, power dynamics, and ethical choices. This recognition reframes AI development as a socio-technical process requiring interdisciplinary oversight, not just engineering expertise.

From model-centric to data-centric AI: A necessary shift

The authors argue that the global AI ecosystem must now move from a model-centric paradigm to a data-centric one, where data integrity, transparency, and contextualization are treated as equal or greater priorities than algorithmic optimization.

Under the Data-Centric AI Lifecycle Model, each stage of the AI process is interconnected:

  • Collection demands contextual relevance and informed consent.
  • Preparation requires transparency about how data is cleaned and labeled.
  • Development must ensure that datasets reflect diverse and representative samples.
  • Deployment calls for accountability mechanisms to detect drift, bias, and misuse.
  • Monitoring and Explainability serve as ongoing checks to ensure that real-world feedback continues to improve system reliability.

In this model, data governance is not an add-on - it is the foundation for ethical and effective AI. The researchers note that organizations that embed data stewardship principles early in development cycles achieve greater scalability and public confidence.

Participants from highly regulated sectors such as healthcare and finance demonstrated the most advanced governance maturity, often integrating interdisciplinary teams that combine data engineers, legal experts, and ethicists. In contrast, smaller startups and developing-region institutions expressed difficulty balancing innovation speed with compliance and quality control.

The study also underscores the regulatory uncertainty surrounding AI governance. While global frameworks like the EU AI Act are beginning to shape compliance, implementation remains inconsistent. Most respondents reported reliance on internal ethics committees or corporate policies, which vary widely in rigor and enforcement. This patchwork approach, the authors argue, leaves significant gaps in accountability.

The Human Dimension of Data-Centric AI

Perhaps the study's most profound insight is that AI systems mirror the social conditions of their data creation. The researchers describe data as a form of labor, produced, curated, and maintained by people whose perspectives and limitations inevitably shape outcomes. By making this "invisible work" visible, the study calls for recognition of the human expertise embedded in AI pipelines.

The authors propose that universities, corporations, and regulators invest in data literacy and stewardship training to close the expertise gap that currently limits governance implementation. They also advocate for cross-sector collaboration between technologists, social scientists, and policymakers to standardize best practices globally.

Ethical AI, according to their findings, cannot emerge from better code alone, it must emerge from better data practices grounded in social responsibility. The study positions this as the defining challenge of the next phase of AI evolution: ensuring that the data we use to build intelligent systems reflect the diversity, integrity, and values of the societies they aim to serve.

  • FIRST PUBLISHED IN:
  • Devdiscourse
Give Feedback