AI oversight changes how employees perform

AI oversight changes how employees perform
Representative Image. Credit: ChatGPT

New research suggests that replacing human judgment with AI may quietly change how people work, and not necessarily for the better.

In When an AI Judges Your Work: The Hidden Costs of Algorithmic Assessment, an arXiv working paper, researchers present experimental evidence that workers produce more output but lower quality work when they know an AI system, rather than a human, is evaluating them

AI assessment boosts quantity but weakens quality

To measure how workers respond to AI evaluation, the researchers conducted a controlled online experiment involving 208 participants recruited from the Prolific platform. Participants were paid to complete a real task: writing captions between 150 and 400 characters for 20 images depicting everyday scenes such as children playing, people working, or vendors preparing food.

Each caption was evaluated on three criteria: how interesting, how effortful, and how accurate it was. Participants earned a $1 bonus for each caption that ranked in the top 30 percent among submissions for a given image, allowing them to earn up to $20 in addition to an $8 participation fee. The incentive structure ensured that participants had strong financial motivation to perform well.

Crucially, participants were randomly assigned to one of two evaluation conditions. In one group, they were told their captions would be graded by a college-educated human. In the other, they were told ChatGPT would evaluate their work. Participants were not informed about the existence of the alternate condition, and each remained under the same evaluator for all 20 tasks.

The results show a clear shift in behavior under AI oversight. Participants assigned to AI evaluation produced longer captions on average. Being evaluated by ChatGPT increased output length by roughly 27.8 percent of a standard deviation compared to those assigned a human evaluator. At the same time, AI-assigned participants spent less time per caption. On average, human-evaluated participants took 94 seconds per response, while AI-evaluated participants took 85 seconds.

The pattern indicates that workers facing AI judgment wrote more text but did so more quickly. Quantity increased, but the additional words did not translate into higher-quality work.

When researchers controlled for caption length, quality declined in the AI treatment. Across multiple quality benchmarks, including grades assigned by both human graduate student raters and GPT-4o itself, captions written under AI evaluation were rated significantly lower once output quantity was held constant. The decline ranged between 12.1 percent and 20.5 percent of a standard deviation depending on the grading metric used.

This means that although workers produced more text when judged by AI, the substance of their work suffered. The increase in quantity did not compensate for a decline in per-unit quality.

The researchers also found that both human and AI graders tended to award higher scores to longer captions. However, human graders placed greater weight on length. Once the influence of length was statistically accounted for, both human and AI graders consistently rated captions written under human evaluation as higher in quality.

External tool use surges under AI oversight

The second major finding concerns workers' reliance on external assistance. The experimental platform tracked whether participants pasted text into the caption field, allowing researchers to detect potential use of outside tools. Pasting served as the primary proxy for external help, including the possible use of AI writing assistants.

Participants assigned to AI evaluation were nearly twice as likely to paste text. About 20.1 percent of captions in the AI condition involved pasted content, compared to 10.4 percent in the human condition. This represents a 93 percent increase in the likelihood of external tool use when workers believed AI was judging their work.

The researchers investigated whether increased tool use explained the decline in quality. It did not. In fact, captions that involved pasted text were often rated as higher quality by human graders. The quality drop under AI oversight persisted even when separating pasted and non-pasted captions.

To deepen their analysis, the authors introduced a novel incentivized disclosure method. At the end of the experiment, participants were asked whether they had used a tool like ChatGPT. Initially, the question carried no financial incentive. Later, participants were offered a $1 bonus if their answer matched the prediction of AI detection software.

The incentivized question significantly increased admissions of AI use. While only 13 participants initially admitted using external help, 29 acknowledged possible AI assistance when money was involved. The method captured heavy users who pasted text in many rounds and also identified participants who may have consulted AI without directly copying text.

The elicitation technique represents a methodological contribution to experimental research. As generative AI becomes more integrated into everyday tasks, detecting AI assistance poses challenges. The study demonstrates that combining behavioral tracking with incentivized self-reporting can improve measurement accuracy.

The psychology of human versus AI judgment

The study explores why workers respond differently to AI and human evaluators. Since the payment structure was identical across treatments, differences cannot be explained by beliefs about grader leniency or bonus probability.

One possible explanation lies in intrinsic motivation. Participants were asked how happy they felt when receiving a high score from their assigned evaluator. Those evaluated by a human reported significantly greater satisfaction than those evaluated by AI. On a five-point scale, participants in the human condition reported higher levels of positive emotion following validation.

The result suggests that social validation matters. Being judged by a person may trigger reputational concerns, pride, or a desire for approval. AI oversight, by contrast, may reduce emotional engagement and shift focus toward strategic optimization.

The findings align with broader economic literature on monitoring and effort. Previous research has shown that monitoring can either motivate or crowd out intrinsic effort depending on context. The current study adds nuance by distinguishing between human and algorithmic monitoring.

The researchers argue that AI evaluation introduces a new trade-off for firms. On one hand, algorithmic assessment is dramatically cheaper and faster. In the experiment, AI grading cost $11.67 plus programming time, while human grading required three graduate students working 54 hours at a cost of $6,480. The financial incentive to automate is substantial.

On the other hand, the behavioral response to AI oversight may undermine quality in certain contexts. If workers adjust their behavior strategically when monitored by algorithms, organizations may face hidden costs that offset some of the efficiency gains.

Implications for the future of work

AI-based evaluation is already being deployed in hiring, academic grading, call center monitoring, and performance reviews. As generative AI systems become more capable of evaluating complex tasks, algorithmic judgment is likely to expand.

The study suggests that managers should carefully consider how workers respond to different forms of oversight. For routine or high-volume tasks where quantity matters most, AI evaluation may be well suited. For creative, analytical, or high-stakes work where refinement and depth are critical, exclusive reliance on AI assessment could produce unintended consequences.

The research also highlights the evolving relationship between AI as tool and AI as monitor. Generative AI not only evaluates work but also performs the very tasks it assesses. This dual role may increase incentives for workers to seek AI assistance when AI is judging them.

  • FIRST PUBLISHED IN:
  • Devdiscourse

TRENDING

OPINION / BLOG / INTERVIEW

Africa’s AI future at risk without stronger digital privacy safeguards

Can artificial intelligence reduce learning poverty?

AI may change job structures without replacing traditional career status

Generative AI may accelerate progress toward SDG 4 quality education goals

DevShots

Latest News

Connect us on

LinkedIn Quora Youtube RSS
Give Feedback